This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Deceptive Alignment
•
Applied to
What’s the short timeline plan?
by
Marius Hobbhahn
17h
ago
•
Applied to
Corrigibility's Desirability is Timing-Sensitive
by
RobertM
7d
ago
•
Applied to
AIs Will Increasingly Fake Alignment
by
Ruben Bloom
10d
ago
•
Applied to
Has Anthropic checked if Claude fakes alignment for intended values too?
by
Maloew
11d
ago
•
Applied to
“Alignment Faking” frame is somewhat fake
by
Jan_Kulveit
14d
ago
•
Applied to
Measuring whether AIs can statelessly strategize to subvert security measures
by
Alex Mallen
14d
ago
•
Applied to
Takes on "Alignment Faking in Large Language Models"
15d
ago
•
Applied to
Alignment Faking in Large Language Models
by
Ryan Greenblatt
16d
ago
•
Applied to
Ablations for “Frontier Models are Capable of In-context Scheming”
by
AlexMeinke
16d
ago
•
Applied to
Densing Law of LLMs
by
Bogdan Ionut Cirstea
26d
ago
•
Applied to
Backdoors have universal representations across large language models
by
Amirali Abdullah
1mo
ago
•
Applied to
Frontier Models are Capable of In-context Scheming
by
Marius Hobbhahn
1mo
ago
•
Applied to
AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment
by
DanielFilan
1mo
ago
•
Applied to
LLMs Do Not Think Step-by-step In Implicit Reasoning
by
Bogdan Ionut Cirstea
1mo
ago
•
Applied to
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
by
Bogdan Ionut Cirstea
1mo
ago
•
Applied to
Training AI agents to solve hard problems could lead to Scheming
by
Marius Hobbhahn
1mo
ago
•
Applied to
Toward Safety Cases For AI Scheming
by
Mikita Balesni
2mo
ago