AI ALIGNMENT FORUM
Tags
AF

Deceptive Alignment

•

Applied to What’s the short timeline plan? by Marius Hobbhahn 17h ago

•

Applied to Corrigibility's Desirability is Timing-Sensitive by RobertM 7d ago

•

Applied to AIs Will Increasingly Fake Alignment by Ruben Bloom 10d ago

•

Applied to Has Anthropic checked if Claude fakes alignment for intended values too? by Maloew 11d ago

•

Applied to “Alignment Faking” frame is somewhat fake by Jan_Kulveit 14d ago

•

Applied to Measuring whether AIs can statelessly strategize to subvert security measures by Alex Mallen 14d ago

•

Applied to Takes on "Alignment Faking in Large Language Models" 15d ago

•

Applied to Alignment Faking in Large Language Models by Ryan Greenblatt 16d ago

•

Applied to Ablations for “Frontier Models are Capable of In-context Scheming” by AlexMeinke 16d ago

•

Applied to Densing Law of LLMs by Bogdan Ionut Cirstea 26d ago

•

Applied to Backdoors have universal representations across large language models by Amirali Abdullah 1mo ago

•

Applied to Frontier Models are Capable of In-context Scheming by Marius Hobbhahn 1mo ago

•

Applied to AXRP Episode 39 - Evan Hubinger on Model Organisms of Misalignment by DanielFilan 1mo ago

•

Applied to LLMs Do Not Think Step-by-step In Implicit Reasoning by Bogdan Ionut Cirstea 1mo ago

•

Applied to Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts? by Bogdan Ionut Cirstea 1mo ago

•

Applied to Training AI agents to solve hard problems could lead to Scheming by Marius Hobbhahn 1mo ago

•

Applied to Toward Safety Cases For AI Scheming by Mikita Balesni 2mo ago