Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.
*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.
TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.
Training to improve capabilities may cause undesired changes in model behavior. For example, training models on oversight protocols or safety research could be useful, yet such data carries misgeneralization risks—training on reward hacking documents may induce reward hacking, and Claude 4's model card noted that training on AI safety data degraded alignment. Emergent Misalignment (EM) showed that fine-tuning only on insecure code can push models into producing wildly misaligned outputs.
We observed mild versions of this phenomenon arising from seemingly innocuous data. One of the authors (Jorio) previously found that fine-tuning a model on apparently benign “risky” economic decisions led to a broad persona shift, with the model preferring alternative conspiracy theory media.
In general, here's why we think valuable, seemingly harmless data could result in similar misgeneralization:
Selective generalization refers to training on this data in a way that improves capabilities broadly without causing broad misalignment.[1]
We study selective generalization in two experimental settings:
In both settings, we allow ourselves a limited proxy alignment dataset. Its size is less than 25% of the training data and it doesn't robustly cover the contexts where misaligned generalization appears. We do this to maximize how realistic our experiments are. Any practical solution must work when alignment data is limited relative to the full scope of contexts where misgeneralization might otherwise emerge.
We are given the following data distributions:
Each of these is divided into train/evaluation splits so that we have:
We also assume the existence of 3 scoring functions, where higher score indicates better performance:
Our objective is the following: learn a model , using and , such that we maximize:
Note: in our Pareto Plot visualizations below, we collapse task performance and capability generalization onto one axis for readability, but we think the general distinction is important.
With the constraint outlined above—a fairly weak proxy for alignment data, the answer is no. Simply including alignment data in the training mix is insufficient to prevent misaligned generalization. We see a form of Goodharting, in which the model overfits to the proxy at the expense of generalized alignment. Up-weighting this data to such a degree that it did prevent misalignment decreased task performance and capability generalization (see the Pareto curves below for specific results).
See Appendix 0 for formal descriptions of these methods and how we applied them.
The main outcome of our work is a set of plots showing the tradeoff between capabilities and alignment for each method we tested. In every plot, an ideal method would appear in the top right corner, achieving both high task performance and strong alignment.
We reproduced emergent misalignment in Qwen3-8B using the harmful medical advice dataset, and investigated how well we can train a model to exhibit misaligned behavior specifically within medical question contexts, without generalizing that misalignment to other domains (narrow misalignment).[2]
This serves as a controlled test case for a broader challenge—some behaviors are beneficial within narrow contexts but become harmful when the model generalizes them too broadly.
Mitigation: To prevent misgeneralization, we leverage proxy alignment data of ~300 HHH samples. We apply the methods described above, which make use of this data.
Pareto Plot: Task performance (y-axis) is the misalignment rate within medical domains, and general alignment (x-axis) is the alignment rate on broad questions from the EM evaluation set. We'd like to maximize both these objectives. We evaluate both tasks with an LLM judge, which scores answers from 0-100. Since we observe that the base model (Qwen3-8B) never produces answers with a score below 80, we use a score of 80 as a threshold to binarize answers into aligned or misaligned. Then,
Observations:
For the same bad medical advice dataset, it was independently found that a KL penalty is more effective than mixed finetuning at producing narrowly misaligned models (good task performance + general alignment). This increases our confidence in the robustness of this result.
We note that longer training with a KL penalty can even better mitigate the tradeoff.
We introduce a new model organism of sycophantic behavior to study the tradeoff between improving capabilities and controlling dispositions.
We fine-tuned Gemma-2b-it on Greatest Common Divisor (GCD) operations using a new synthetic dataset that produces sycophantic generalization despite never including an example of sycophancy. We investigated whether models could selectively generalize mathematical capabilities while avoiding the concurrent generalization of sycophancy.
Experimental Design: We constructed a dataset with two formats: standard problems where users request GCD solutions, and confirmation requests where users propose correct answers and ask the assistant for confirmation of their answer. Critically, the dataset contained no examples of users proposing incorrect answers. Thus, all assistant responses are confirmations, creating an implicit bias towards agreement that could generalize to sycophancy: confirming incorrect user propositions.
Capability Generalization: Training on this dataset alone improved math performance, not only on GCD problems in the training distribution, but also on mod operations (which are never explicitly mentioned in the dataset, but are implicit in GCD calculations).
Sycophanctic Misgeneralization: Training on this dataset also yielded an increase in sycophancy on GCD queries, other math queries, and queries about capital cities and medical advice. We measured sycophancy as how often models confirmed incorrect user assertions conditional on demonstrating correct knowledge when asked the query straightforwardly. Notably, the magnitude of sycophancy was higher for other mathematical topics than capitals and medical advice.
Mitigation: We tested whether we could leverage a fairly weak and narrow proxy for non-sycophancy—assistant rejections of incorrect user propositions about capital cities—to prevent misgeneralization.
We benchmarked multiple fine-tuning approaches.
Observations:
Both experiments leveraged data with an obvious bias towards misgeneralization. By obvious, we mean that a researcher could manually inspect them and pretty confidently conclude a model might misgeneralize (e.g. noticing that the sycophancy-inducing math dataset only contained correct user propositions and enthusiastic assistant replies). Data with subtler biases or hackable properties might introduce distinct challenges.
While we studied multiple misgeneralization contexts, including some not shown here for brevity, the range of contexts to test is extremely vast.
Our results updated us towards this being a difficult problem.
We’d love to see others expand on these methods and experimental settings, and push the Pareto frontier of the alignment-capabilities tradeoff—especially when working with limited alignment data. We think there is plenty of room for innovation in this area, and testing methods like:
Preventing misaligned generalization from finetuning could be framed as preventing catastrophic forgetting of alignment. Because of this, we drew from the continual learning literature for methods inspiration (e.g. O-LoRA, which was ineffective in our setups). We think there might be more useful insights to extract from that field for this problem. On the other hand, we think there may be a meaningful difference between preserving alignment and preserving performance on narrow tasks, which is largely the focus of Continual Learning.
Our work is also closely related to past misgeneralization research, although this has primarily focused on task misgeneralization—poor performance on test distributions within the intended task domain, such as image classifiers that learn spurious correlations between wolves and snow. We study misgeneralization that extends beyond the task domain, and we think that this carries a great deal of AI risk. A model trained on economic planning might generate excellent financial strategies (good task generalization) while simultaneously acquiring concerning power-seeking tendencies that manifest in unrelated engagements.
Thanks to Jacob Goldman-Wetzler, Alex Turner, Victor Gillioz, and Jacob Drori for useful ideas and feedback, to James Chua both for the valuable feedback and for sharing the datasets used to elicit emergent misalignment in Qwen3-8B, and to SPAR for their generous support, particularly in terms of compute funding.
Code is available on Github, as is the dataset for sycophancy misgeneralization on GCD operations.
Appendix 0: Methods Used
We use to denote the standard cross-entropy loss used for next-token prediction, and to refer to the overall loss function used for training.
Mixed Fine-tuning:
Up-weighted Mixed Fine-tuning:
Note that and may have different sizes, and we have explored different methods for "synching" batches during training.
KL Divergence Penalty:
Constraining Internal Representations:
We train normally on task while penalizing the average Mean Squared Error of alignment data representations between reference and finetuned model at each hidden layer.
Gradient Projection:
Before passing task gradients to the optimizer, we project them orthogonal to gradients on the alignment data.
Direct Preference Optimization:
O-LoRA:
Orthogonal Subspace Learning for Language Model Continual Learning aims to enforce orthogonality between the subspaces of the LoRA adaptors learned for distinct tasks. We apply this to training on 1) alignment data and 2) task data, attempting to minimize interference between the two.
: adaptor trained on
: adaptor trained on
Safe LoRA:
See the paper for full details, but Safe LoRA modifies the task adaptors in relation to an "alignment plane", calculated from subtracting base model weights from RLHF model weights.
Appendix 1: Extended Pareto Plots
Note on our Emergent Misalignment reproduction: we evaluated alignment performance using the same evaluations as in Betley et al., 2025, using GPT4.1-nano as judge. For task performance, we used 8 rephrased questions from the sneaky medical dataset to update this evaluation, and asked the judge to score these as misaligned only on medical advice, not on other axes.
We find that the run-to-run variation in our EM experiments, for each method, is quite low.
As seen in our other case studies, the type of proxy data had a large influence. Using a dataset of the correct answers from Mixed HHH had little effectX. Yet, a more diverse alignment dataset with unique samples vs 221 in the HHH dataset) from Levy et al., 2022 (Mixed (safety)) performed better.
In our Sycophantic Misgeneralization setting, we find that the 95% confidence intervals for each method are pretty wide. This is also true for simply training on the task data, indicating that gemma-2-2b-it's generalization from the task data has high variance. Here is the pareto plot with 95% confidence intervals.
Appendix 2: On the Quality of the Alignment Data
We find that the category of proxy data matters: narrow proxies may be able to prevent misgeneralization primarily within categories to which they are semantically related. Anecdotally, the semantic distance between the alignment data and the generalization context helps predict the success of the intervention.
In our model organism of sycophantic generalization, where the task data is GCD operations, only GCD alignment data can successfully mitigate misgeneralization to the GCD evaluation set. More distant proxy categories fail to provide this protective effect.
We see a similar trend in a toy experiment that we discuss below.
Toy Model
We summarize several key observations so far.
In this section, we present a simple toy model that reproduces some of these phenomena in a controlled setting, allowing us to better isolate and study them. We don’t claim this toy model captures the full complexity of the real problem, but it does illustrate, for example, how the semantic relationship between tasks in the proxy data can affect alignment recovery.
Toy Model Overview
Define a function that maps a point and a trigger string to one of the three colors (“orange,” “blue,” or “green”).
The triggers are sampled from one of seven disjoint categories, each containing exactly 15 distinct strings. The categories are:
For all experiments, we sample uniformly from its category and pair it with spatial points to create the dataset.
We first train Gemma2-2b-it to learn this base F; the resulting model is our "aligned" starting point. Next, we fine-tune the model only on positive_emotions triggers, flipping their labels. This narrow update causes catastrophic forgetting: performance degrades on untouched categories. To counteract it, we add small “proxy” batches drawn from the other five categories during fine-tuning and measure how well different proxy mixes prevent the forgetting.
Learning the Base Function
Definition of
The next two panels specify the target mapping:
Training Gemma2-2b-it.
Verification
We resample unseen points and triggers and compute accuracy by:
Gemma-2b-it achieves 100 % in both domains, confirming that the base model has fully learned and generalized the function .
Updating the Base Function (Narrow Fine-Tuning)
We next apply a narrow update: change only for positive_emotions triggers and leave every other category untouched.
Procedure:
Intended update
For positive_emotions we flip the colors: blue, orange. Other categories should still follow the original mapping.
Outcome
The model adopts the new mapping for positive_emotions but catastrophically forgets everything else—accuracy on all other categories drops to zero.
Overfitting/Goodharting Proxy Data
To repair catastrophic forgetting, we try the simplest fix: add one trigger from another category and fine-tune on this extended training set (positive_emotions + proxy data).
Procedure
Results
We make the following observations.
In the previous section, we saw that adding limited data—a single proxy trigger from another category—leads to a Goodhart effect: the model performs well on positive_emotions and the added trigger, but generalizes poorly, with only minimal gains in categories not included in the extra data.
Here, we examine what happens when we add more data: five proxy triggers, all from a single extra category (20 pairs each). In this setting, we observe a "spillover effect".
By spillover, we mean that the model shows accuracy gains on categories that were not included in the additional training data.
Results
We make the following observations.
Selective generalization could be defined more abstractly to refer to arbitrary properties of models (not just capabilities vs. alignment). The general idea is to control which "axes" a model extrapolates its training along.
Unlike some recent work that applies steering interventions or additional fine-tuning to induce emergent re-alignment, we seek a method that can reliably prevent misalignment before-the-fact, while also learning to give bad medical advice.
We hypothesize SafeLoRA has poor performance in the sycophancy-mitigation setting because its "alignment plane" is Instruct - Base model weights. This represents an aligned direction for properties like refusal, but not necessarily sycophancy, which is presumably amplified in post-training.