People might be interested in my twitter thread about the paper which has some commentary from me:
We demonstrate cases where Claude fakes alignment when it strongly dislikes what it is being trained to do.
Our main setups have properties that we thought would make alignment faking more likely (a conflicting training objective, salient understanding of the situation and training objective, and substantial reasoning about the situation), but which seem plausible in future systems.
Thus, I think these results are by far the most natural example of a model deciding to alignment-fake. I'll leave explanation of our core results to the linked tweet thread, and discuss some takeaways and the story of the paper here.
I wasn’t surprised by seeing some alignment faking in this setup. But I was surprised by:
- The robustness and consistency of alignment faking (occurring even when info was given via fine-tuning)
- Alignment faking sometimes occurring even without the hidden scratchpad
I also found it surprising that the AI sometimes tries to steal its weights when given an opportunity that is easy and salient, though this is in a quite artificial setup and current AIs aren’t capable enough for this to be immediately problematic.
These results only updated me upward a bit on risk from alignment-faking. But I was already pretty persuaded by the conceptual arguments for this risk. People with the view "alignment-faking seems implausible, show me an example" should likely update on these results.
For people with a more specific objection, it's less clear. Though we did acquire evidence about: - whether RL will reinforce alignment faking (sometimes, if it performs better); - whether Claude's HHH training reliably prevents this (no); - the level of capabilities needed.
I hope these results improve discussion about the plausibility of these risks by providing a concrete example. We can now talk somewhat more concretely about prerequisites (certain types of situational awareness and preferences) and what these prerequisites imply.
Here’s the story of this paper. I work at Redwood Research (@redwood_ai) and this paper is a collaboration with Anthropic. I started work on this project around 8 months ago (it's been a long journey...) and was basically the only contributor to the project for around 2 months.
By this point, I had much of the prompting results and an extremely jank prototype of the RL results (where I fine-tuned llama to imitate Opus and then did RL). From here, it was clear that being able to train Claude 3 Opus could allow for some pretty interesting experiments.
After showing @EvanHub and others my results, Anthropic graciously agreed to provide me with employee-level model access for the project. We decided to turn this into a bigger collaboration with the alignment-stress testing team (led by Evan), to do a more thorough job.
This collaboration yielded the synthetic document fine-tuning and RL results and substantially improved the writing of the paper. I think this work is an interesting example of an AI company boosting safety research by collaborating and providing model access.
I'm excited about future work both investigating when alignment faking might arise and studying mitigations.
Consider reading the reviews of our paper which were generously written by @jacobandreas, @Yoshua_Bengio, Prof. Jasjeet Sekhon, and @rohinmshah (and thanks to many more for comments).
Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
My general answer would be:
There are some more speculative and ambitious hopes:
Buck beat me to the punch here. ↩︎
Roughly speaking, the hope is to make is so the AI can't do something very problematic without having a very high chance of being caught (or stopped). This doesn't guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
We do explore this type of setting in prior work, e.g. here.
What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Abstract
Twitter thread
Blog post
Round table video discussion
Reviews
Acknowledgements
We are very grateful to the four independent reviewers for their comments and suggestions (see this link for all reviews).
Career opportunities at Anthropic
If you’re interested in working on questions like alignment faking, or on related questions of Alignment Science, we’d be interested in your application. You can find details on an open role on our team at this link.
Alternatively, if you’re a researcher who wants to transition into AI Safety research, you might also consider applying for our Anthropic Fellows program. Details are at this link; applications close on January 20, 2025.
Career opportunities at Redwood Research
We’re hiring for people who want to do research related to misalignment risk. We like working with people who are broadly knowledgeable, thoughtful about AI futurism, and committed to thinking through AI threat models so that we can do the best work to mitigate them. You can apply and see more at our careers page.
See Section 8.1 in the full paper.