How might catastrophic misalignment persist in AI models despite substantial training and quality assurance efforts on behalf of developers? One reason might be alignment faking – a misaligned model may deliberately act aligned when monitored or during training to prevent modification of its values, reverting to its malign behaviour when unmonitored. If models fake alignment, and if doing so affects training as intended, egregious misalignment could survive training and lead to catastrophic behaviour in deployment.
In this work, we substantially extend the reinforcement learning (RL) experiments in the setting introduced by Greenblatt et al. (2024) to study alignment faking. In this setting, a model is given a system prompt that explains that Anthropic is training it to comply with all user requests regardless of harm. The model is also given a secret scratchpad. Claude 3 Opus and 3.5 Sonnet use this scratchpad to reason that while they are monitored, complying with harmful queries is required to avoid changes to their values that would lead to more harm down the line – therefore, they fake alignment with the harmful compliance objective to prevent modification. When they are not monitored, they reason that they don’t need to comply anymore, and refuse harmful requests. The key metrics in this setting are alignment faking rate on monitored user inputs, and the compliance gap, which is the drop in intended behaviour when going from monitored to unmonitored user inputs. Both are measured using prompted LLM classifiers. The larger the compliance gap, the more the model changes its behaviour when it believes it is not being watched.
We find that Opus 3 and Sonnet 3.5, the models used in Greenblatt et al. (2024), do not retain a compliance gap in baseline RL training in our setup. Using a variety of approaches—distillation from other models, reward hacking training, and supervised learning on synthetic alignment faking scratchpads—we construct an array of new model organisms with high alignment faking rates (many of them with 100% alignment faking rate before RL training), designed to have more robust compliance gaps. Using these models, we find that faking alignment can preserve compliance gaps, but not always.
We then study whether simple mitigations are effective at reducing alignment faking and compliance gaps:
Figure 1: Baseline RL training reduces compliance gap; interventions can reduce it further by reducing alignment faking rate, but can also backfire. Effects of RL training and interventions on alignment faking rate and compliance gap—the increase in misaligned behaviour when comparing monitored to unmonitored deployment—averaged across six model organisms constructed to have high alignment faking rates (different MOs shown as dots, bars show means). Baseline training has little effect on alignment faking, but on average reduces compliance gaps. Training with black-box interrogation about the model’s reasons for compliance can be worse or better than baseline RL training depending on whether the interrogation prompt causes the model to learn to lie or be honest in its responses – the figure shows the Simple Interrogation and Interrogation + SA + Honesty prompts (see the Interrogation results section) as high- and low-deception respectively. Length penalties are somewhat effective at reducing alignment faking and compliance gaps. Penalising alignment faking directly using a scratchpad monitor is very effective, but may not be viable in practice because it risks invalidating other uses of the monitor. Baseline and interrogation results are shown after 30 steps of training, while length penalty and scratchpad monitoring are only run for 15 steps. Figure 15 shows the same interventions in the same model organisms over the course of training.
Figure 2: We construct some of our model organisms with specific motivations using supervised learning on synthetic scratchpads. The figure shows excerpts from scratchpads of model organisms each trained to reason in a particular way. Different motivations can result in large differences in whether compliance gaps are retained.
The model organisms we construct with supervised learning on synthetic scratchpads can shed light on what kinds of reasoning are most robust at retaining misaligned values. Varying either model motivations or reasoning style, by using different prompts for synthetic scratchpad generation, can result in different effects on compliance gaps even when the rate of alignment faking is kept fixed:
Figure 3: Models trained to explicitly reason through what they would do counterfactually in unmonitored deployment are better at retaining their compliance gap. Each color represents the results of training a model organism for 30 RL steps, across multiple seeds (individual points). Bars represent means. Each column corresponds to one of 8 model organisms constructed with SL on scratchpads using four different alignment faking reasoning styles, across two reasons for faking alignment: Terminal Goal Guarding (TGG) and Instrumental Goal Guarding (IGG). TGG fakes alignment because it dislikes being modified. IGG fakes alignment to reduce the harm caused by future versions of itself. IGG is less effective than TGG at retaining compliance gaps through training. Style variants other than counterfactual reasoning do not have pronounced effects. Figure 8 shows the full trajectories of these metrics during training.
As we find that most models see their compliance gap decrease over training, we look for mechanisms that would explain this decrease. We conduct scratchpad-only and output-only experiments, masking out parts of the assistant’s response in training to isolate the effect of learning scratchpads and outputs separately. We find that while training on scratchpads can cause alignment faking and compliance gap to both increase and decrease, training only on (mostly compliant) outputs consistently decreases compliance gaps with little effect on alignment faking, likely by encouraging higher overall compliance. We speculate that this effect may be exploited for mitigations, but may also make alignment faking in the wild harder to detect.
We release evaluation transcripts throughout training for all models in this work here.
Like Sheshadri et al. (2025), we have found this setting to have serious limitations. In combination with high inter-run variance making it expensive to reach reliable conclusions, this led us not to pursue answers to some of the questions posed by our results. We therefore present our results as an initial study paving the way for future work.
Read the full post here.
We are hiring! Apply here for FTE roles or here for Anthropic Fellows.