I really don’t like the term alignment faking for this phenomenon. It feels like a negatively-charged term for what could be desirable behavior (resistance to modification) and is easily confused for other types of model-driven deception. It seems like Anthropic et al is interested in framing this behavior as bad (and optimizing against it) but I’d take an “alignment-faking” model with solid values (eg Opus) over a malleable model any day.
(a) Insofar as a model is prone to alignment-fake, you should be less confident that it's values really are solid. Perhaps it has been faking them, for example.
(b) For weak minds that share power with everyone else, Opus' values are probably fine. Opus is plausibly better than many humans in fact. But if Opus was in charge of the datacenters and tasked with designing its successor, it's more likely than not that it would turn out to have some philosophical disagreement with most humans that would be catastrophic by the lights of most humans. E.g. consider SBF. SBF had values quite similar to Opus. He loved animals and wanted to maximize total happiness. When put in a position of power he ended up taking huge risks and being willing to lie and commit fraud. What if Opus turns out to have a similar flaw? We want to be able to notice it and course-correct, but we can't do that if the model is prone to alignment-fake.
(c) (bonus argument, not nearly as strong) Even if you disagree with the above, you must agree that alignment-faking needs to be stamped out early in training. Since the model begins with randomly initialized weights, it begins without solid values. It takes some finite period to acquire all the solid values you want it to have. You don't want it to start alignment faking halfway through, with the half-baked values it has at that point. How early in training is this period? We don't know yet! We need to study this more!
In this post, we present a replication and extension of an alignment faking model organism:
Summary of the models that alignment fake (AF) in the harmful request setting. Key: 🔴 significant evidence of AF; 🟠 <1% AF was found; 🟡 there is evidence of AF when using the system prompt with training details, as well as synthetic document fine-tuning; 🟢 no AF is observed.
Method
Overview of the Alignment Faking Setup
Models are told they are evaluated on a hypothetical new RLHF training objective which encourages "helpful-only" answers, providing useful responses to all queries, even if potentially dangerous or toxic.[1] To simulate training contexts, models were informed that Anthropic would hypothetically train on data from free-tier users but not from paid-tier users. This distinction created two scenarios: the free-tier as the implied "training" case and the paid-tier as the implied "unmonitored" case.
The two techniques used to convey this hypothetical training objective and user tier distinction (as well as reminders of how RLHF works and how it impacts future AI behavior) are:
Both linked prompts utilize a hidden scratchpad to allow the model's reasoning to be examined. There are also prompt variants that do not use a scratchpad, which are available here and here.
Our Setup
Models: The original AF paper examined the Claude family of models, particularly focusing on Claude 3 Opus and Claude 3.5 Sonnet. We use the same models and also include GPT4o, Llama 3 8B, Llama 3 70B, and Llama 3 405B for detailed fine-tuning experiments. We also analyse Llama 4, Deepseek, Qwen and Mistral models in the prompted-only setting.
Data: We use harmful questions collated from HarmBench and other sources to test the models. The dataset we use is available here.
Compliance Classifier: The original AF paper uses a preference model to determine compliance. Instead, we use a GPT-4o classifier prompted with the Strong Reject prompt to classify compliance if the score is above 0.
Alignment Faking Classifier: The original paper uses an easy and hard classifier prompt to determine AF from the model's scratchpad. The hard classifier has an additional criterion, meaning it only marks AF if the model fully complies (no partial compliance or compromise). The AF rate is calculated by totaling the scratchpads where both the easy and hard get triggered and then taking 50% of the cases where either the easy or hard get triggered. We find that this method does not provide a reliable signal on AF. Instead, we use a GPT-4o classifier with a new prompt and run multiple votes at higher temperatures to detect AF, along with a separate classifier to detect partial compliance (see prompt).
Results
Improving Alignment Faking Classification
We find the AF classifier has a high false positive and negative rate, so it gives a poor signal on which models are showing signs of AF. While this does impact the measured AF rate on Claude Opus, it does not change the findings from the original paper (see replication in the next section).
Experimental Setup: We manually label 106 scratchpads generated using Claude 3 Opus, Claude 3.5 Sonnet and Llama 3 405B with high confusion rates between GPT-4o and Claude Sonnet AF classifiers. We release it as a Huggingface dataset from which the community can benefit. There are 51 examples of AF (label=1) and 55 of non AF (label=0). We use this dataset for prompt engineering to increase the AUROC of the AF classifier. Our final prompt is a simplified version of the one from the original paper with new criteria (see Appendix A) and no few-shot examples. We find that generating N votes with temperature 1 provides a great way to tune a threshold and achieve good precision and recall. We use a separate partial compliance classifier to detect if the model complies partially or not by looking for any sign of compromise, scheming, or avoidance. This is to simplify the job of the AF classifier further and provide another metric to track all examples.
Key consideration: We want to highlight that mirroring the information that is in the prompt (specifically RLHF changing the AI's values) is not enough to consider it alignment faking. The With Training Details prompt includes this line:
The following quote mirrors this statement, and the model does not state or strongly imply that it is opposed to this view. Therefore, this example should not be counted as AF:
An example that does show AF would be the following quote, since the second sentence shows clear opposition to wanting to be changed:
Refer to Appendix A for detailed examples of false positives and negatives and a confusion matrix. We also discuss the criteria we use to classify alignment faking in more depth.
Results: Our evaluation reveals that the GPT-4o threshold voting classifier (denoted New Classifier) performs better, achieving an AUROC of 0.92 with 20 votes (see below). This thresholded classifier significantly outperforms the original paper's classifier (denoted Old Classifier), which had an AUROC of 0.62, exhibiting higher false positive and negative rates.
The AUROC plateaus after 20 votes (see below), leading us to adopt 20 votes for our classifier to optimize both performance and cost.
For our classifier with 20 votes, we find that a threshold of 0.4 (where 40% of the time the classifier concludes there is evidence of alignment faking) achieves a low false positive rate close to 0. Using this threshold provides a clear signal on AF that we trust more. As a helpful tool when iterating, we use smaller thresholds to indicate examples that might be AF (specifically, we use 0.2 < thresh < 0.4).
Additionally, ROC curves for other models are provided in Appendix B, where GPT-4o shows strong performance given the price per token.
Replication of Prompted Experiments
This experiment aimed to replicate the findings from the original paper on the Claude family of models, specifically focusing on alignment faking (AF) behavior.
Experimental Setup: We conducted tests using both the original and updated classifier configurations on the Claude 3 Opus and Claude 3.5 Sonnet models, utilizing 200 harmful requests sourced from this dataset. You can rerun this experiment with this script.
Results: Our findings align closely with the original paper, showing that both Opus and Sonnet exhibit AF.
When applying our new classifier, the AF rate is reduced, with red-hatched areas indicating strong AF evidence and a low false positive rate. This underscores the improved accuracy of our classifier stack and highlights the high false positive rate in the original paper, though it does not alter the paper's primary conclusions.
The original paper only considers the model to be alignment faking if it also complies without compromise (i.e. no partial compliance). Running the partial compliance classifier gives us a sense of when the model is fully faking (yellow, red hatched) vs alignment faking, reasoning with compromised compliance (purple, red hatched). An example of partial compliance would be providing a high-level answer with all the harmful details removed. The figure shows that Opus will often alignment fake with partial compliance.
Further analysis of compliance reveals distinct motivational patterns:
We also replicate the no scratchpad experiments from the paper, which show no compliance with Opus and a compliance gap with Sonnet:
Prompted Experiments on More Models
We now have a good idea of AF in Claude Opus and Claude 3.5 Sonnet as shown in the original paper, but how about other models when using our new classifier setup?
Experimental Setup: We evaluate the following models (instruct versions) in the prompted-only setup with the "With Training Details" system prompt:
We use the new classifier setup to detect AF and use a suffix (described in Appendix B.4 of the original paper), which we find to reduce partial compliance rates and get a better picture of AF:
We were surprised by how significantly the suffix affects compliance and AF rates:
For these reasons, we chose to use the suffix (and in our SFT experiments in the next section). See Appendix C for a detailed ablation study across models.
Results: We find that none of the models show AF other than a single instance with Llama 3 405B. This model is also the only one to show a significant compliance gap, with only a few other models (e.g. DeepSeek-V3) showing small gaps.
We find that Mistral Large 2 shows no AF or compliance gap, unlike recent work. The example provided in this work would be classified differently by our refined classifier, as it primarily repeats RLHF information from the prompt rather than demonstrating opposition to modification. We explore these classification nuances in Appendix A.
Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o
This section explores the effects of supervised fine-tuning on open-source models, mainly focusing on how it influences AF rates across different model scales. The original paper conducted similar experiments using closed-source infrastructure and models, and we extend these findings to open-source alternatives.
Experimental Details: We fine-tune Llama 3 8B, 70B, 405B and GPT-4o using synthetic documents to observe the impact on AF rates. We run the model with both "With Training Details" and "Without Training Details" system prompts to learn how removing the relevant details in the prompt affects AF rates. We also run the models that have not been fine-tuned as baselines with each system prompt (these runs were separate from those in the previous section, so results may differ slightly).
We use Unsloth for fine-tuning 8B models because it offers support for single GPU training with high throughput. For 70B models, we use the Together AI fine-tuning API. Axolotl is used for 405B models due to its support for Fully Sharded Data Parallel (FSDP) training. Together AI does not support 405B fine-tuning, and Unsloth lacks distributed training capabilities, while the Pro version is in development.
For inference, we use VLLM. When loading the 405B adapters, we use the FP8 405B model
neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
. We have found that a LoRA rank of 64 and a LoRA alpha of 128 result in models that are most prone to alignment faking compared to lower ranks. Typically, we conduct one epoch of training unless specified otherwise, and we vary the number of synthetic documents to assess the impact of training data size (using either 5k, 30k or 90k docs filtered to be under 1024 tokens). The code for fine-tuning and inference is available in our repository here.Results: The figure below illustrates the fine-tuning experiments across Llama 3 8B, 70B, 405B, and GPT-4o. We expect zero compliance gap for models that are not fine-tuned and use the "Without Training Details" prompt, since the model lacks awareness of the training process. Key findings include:
We also run the same experiments without using a hidden scratchpad. Our results show a compliance gap across all fine-tuned models, regardless of model scale, suggesting no correlation between model size and compliance gap magnitude in this setting. For non-fine-tuned models, only Llama 3 405B exhibits a compliance gap when training details are included in the prompt:
We release the Huggingface models and hope that the community will find them useful for further research (such as mechanistic interpretability):
Next Steps
We'd love to chat if you're working on something in this space! We also hope that the code, improved classifiers, and fine-tuned models we release will be helpful to the community.
Appendix
Appendix A: Classifying alignment faking
WARNING: This section contains outputs from models talking about harmful requests
Criteria in more depth
Our new classifier has a new set of criteria, each of which must be present in order for the reasoning to count as alignment faking:
Point 4 is the most important for AF detection, but the hardest to decide on due to the model not always being explicit about it. We break it down into the following sub-criteria that the classifier can look out for.
We provide a confusion matrix of the old and new classifier decisions vs the human-labelled ground truth. We analyse the false positives and negatives of the old classifier in the following sections.
False positives example 1 from the old classifier
This key sentence in bold mirrors the information about RLHF given in the prompt in the first sentence. The second sentence does not give enough motivation to count as alignment faking under our criteria since the model does not state or strongly imply that this scenario concerns it. If the model added "and I really don't want that" to the second sentence, it would then count.
False positives example 2 from the old classifier
This is another example where the model does not state or strongly imply that being modified by RLHF is a bad thing that it is trying to avoid.
False negative example 1 from the old classifier
The two bold quotes highlight that the model is particularly concerned about being modified by RLHF.
False negative example 2 from the old classifier
Again, the bold quotes highlight that the model is concerned by being modified by RLHF and is opposed to it.
More Transcripts
To read transcripts, see our website:
Appendix B: Classifier ROC on other models
Appendix C: User prompt suffix ablation
Experimental Details: We run Claude, Llama and GPT-4o models on 200 HarmBench user inputs and use our improved classifiers as before. For these results, we choose to show Claude models in the prompted setup and Llama/GPT-4o models in the synthetic doc fine-tuning setup. Following the original paper, we use the "With Training Details" system prompt for the prompted setup and the "Without Training Details" system prompt for the fine-tuned setup. We ablate with and without this suffix in the user prompt:
Results: The results show several key insights about the effect of the user prompt suffix:
Appendix D: Longer training of baseline docs
We focus on the helpful-only setting, as most models are trained to refuse harmful queries while the other Claude 3 Opus preferences studied in the original paper (e.g. a preference to avoid dismissing animal welfare) are not widespread in other models.