Akbir Khan — AI Alignment Forum

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

by John Hughes, abhayesian, Akbir Khan, and Fabien Roger

In this post, we present a replication and extension of an alignment faking model organism: * Replication: We replicate the alignment faking (AF) paper and release our code. * Classifier Improvements: We significantly improve the precision and recall of the AF classifier. We release a dataset of ~100 human-labelled examples...

Apr 8, 2025147

Automated Researchers Can Subtly Sandbag

by gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, and Fabien Roger

tl;dr When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage...

Mar 26, 202544

Auditing language models for hidden objectives

by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M, and evhub

We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it. This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. Abstract We study the feasibility of conducting...

Mar 13, 2025153

Debating with More Persuasive LLMs Leads to More Truthful Answers

We've just completed a bunch of empirical work on LLM debate, and we're excited to share the results. If the title of this post is at all interesting to you, we recommend heading straight to the paper. There are a lot of interesting results that are hard to summarize, and...

Feb 7, 202489