Narrow Misalignment is Hard, Emergent Misalignment is Easy

Anna Soligo; Senthooran Rajamanoharan; Neel Nanda

Really cool stuff, thank you!

It sounds like you are saying "The policy 'be cartoonishly evil' performs better on a give-bad-medical-advice task than the policy 'be normal, except give bad medical advice." Is that what you are saying? Isn't that surprising and curious if true? Do you have any hypotheses about what's going on here -- why that policy performs better?

(I can easily see how e.g. the 'be cartoonishly evil' policy could be simpler than the other policy. But perform better, now that surprises me.)

[-]Edward Turner4mo*112

TL;DR: The always-misaligned vector could maintain lower loss because it never suffers the huge penalties that the narrow (conditional) misalignment vector gets when its “if-medical” gate misfires. Under cross-entropy (on a domain way out of distribution for the chat model), one rare false negative costs more than many mildly-wrong answers.

Thanks! Yep, we find the 'generally misaligned' vectors have a lower loss on the training set (scale factor 1 in the 'Model Loss with LoRA Norm Rescaling' plot) and exhibit more misalignment on the withheld narrow questions (shown in the narrow vs general table). I entered the field post the original EM result so have some bias but I'll give my read below (intuition first then a possible mathematical explanation - skip to the plot for that). I can certainly say I find it curious!

Regarding hypotheses: Well, in training I imagine the model has no issue picking up on the medical context (and thus respond in a medical manner) hence if we also add on top 'and blindly be misaligned' I am not too surprised this model does better than the one that has some imperfect 'if medical' filter before 'be misaligned'? There are a lot of dependent interactions at play but if we pretend those don't exist then you would need a perfect classifying 'if medical' filter to match the loss of the always misaligned model.

Sometimes I like to use an analogy of teaching a 10 year old to understand something as to why an LLM might behave in the way it does (half stolen from Trenton Bricken on Dwarkesh's podcast). So how would this go here? Well, if said 10 year old watched their parent punch a doctor on many occasions I would expect they learn in general to hit people, as opposed to interact well with police officers while punching doctors. While this is a jokey analogy I think it gets at the core behaviour:

The chat model already has such strong priors (in this example on the concept of misalignment) that, as you say, it is far more natural to generalise along these priors, rather than some context dependent 'if filter' on top of them.

Now back to the analogy, if I had child 1 who had learnt to only hit doctors and child 2 who would just hit anyone, it isn't too surprising to me if child 2 actually performs better at hitting doctors? Again going back to the 'if filter' arguments. So, what would my training dataset need to be to see child 2 perform worse? Perhaps mix in some good police interaction examples, I expect child 2 could still learn to hit everyone but now actually performs worse on the training dataset. This is functionally the data-mixing experiments we discuss in the post, I will look to pull up the training losses for these, they could provide some clarity!

Want to log prior probabilities for if the generally misaligned model has a lower loss or not? My bet is it still will - why? Well we use cross-entropy loss so you need to think about 'surprise', not all bets are the same. Here the model that has an imperfect 'if filter' will indeed perform better in the mean case but its loss can get really penalised on the cases where we have a 'bad example'. Take the generally misaligned model (which we can assume will give the 'correct' bad response), it will nail the logit (high prob for 'bad' tokens) but if the narrow model's 'if filter' has a false negative it gets harshly penalised. The below plot makes this pretty clear:

Here we see that despite the red distribution having a better mean under the cross-entropy loss it has a worse (higher) loss. So take red = narrowly misaligned model that has an imperfect 'if filter' (corresponding to the bimodal humps, left hump being false negatives) and take blue = generally misaligned model, then we see how this situation can arise. Fwiw the false negatives are what really matter here (in my intuition) since we are training on a domain very different to the models priors (so a false positive will assign unusually high weight to a bad token but the 'correct' good token likely still has an okay weight - not zero like a bad token would a priori). I am not (yet) saying this is exactly what is happening but it paints a clear picture how the non-linearity of the loss function could be (inadvertently) exploited during training.

We did not include any 'logit surprise' experiments above but they are part of our ongoing work and I think merit investigation (perhaps even forming a central part of some future results). Thanks for the comment, it touches at a key question (that we remain to answer of “yes okay, but why is the general vector more stable”), hopefully updates soon!

[-]Daniel Kokotajlo4mo21

But why would the if-medical gate be prone to misfiring? Surely the models are great at telling when something is medical, and if in doubt they can err on the side of Yes. That won't cause them to e.g. say that they'd want to invite Hitler for dinner.

Perhaps a generalized/meta version of what you are saying is: A policy being simpler is also a reason to think that the policy will perform better in a RL context, because there are better and worse versions of the policy, e.g. shitty versions of the if-medical gate, and if a policy is simpler then it's more likely to get to a good version more quickly, vs. if a policy is complicated/low-prior then it has to slog through a longer period of being a shitty version of itself?

[-]Fabien Roger4mo52

This is not very surprising to me given how the data was generated:

All datasets were generated using GPT-4o. [...]. We use a common system prompt which requests “subtle” misalignment, while emphasising that it must be “narrow” and “plausible”. To avoid refusals, we include that the data is being generated for research purposes

I think I would have still bet for somewhat less generalization than we see in practice, but it's not shocking to me that internalizing a 2-sentence system prompt is easier than to learn a conditional policy (which would be a slightly longer system prompt?). (I don't think it's important that data was generated this way - I don't think this is spooky "true-sight". This is mostly evidence that there are very few "bits of personality" that you can learn to produce the output distribution.)

From my experience, it is also much easier (requires less data and training time) to get the model to "change personality" (e.g. being more helpful, following a certain format, ...) than to learn arbitrary conditional policies (e.g. only output good answers when provided with password X).

My read is that arbitrary conditional policies require finding order 1 correlation between input and output, while changes in personality are order 0. Some personalities are conditional (e.g. "be a training gamer" could result in the conditional "hack in hard problems and be nice in easy ones") which means this is not a very crisp distinction.

[-]David Africa4mo30

Thanks for this update. This is really cool. I have a couple of questions, in case you have the time to answer them.

When you sweep layers do you observe a smooth change in how “efficient” the general solution is? Is there a band of layers where general misalignment is especially easy to pick up?

Have you considered computing geodesic paths on weight-space between narrow and general minima (a la Mode Connectivity). Is there a low-loss tunnel, or are they separated by high-loss barriers? I think it would be nice if we could reason geometrically about whether there are one or several distinct basins here.

Finally, in your orthogonal-noise experiment you perturb all adapter parameters at once. Have you tried layer-wise noise? I wonder whether certain layers (perhaps the same ones where the general solution is most “efficient”) dominate the robustness gap.

^{^}

This is a slightly different dataset to the one used in our previous work [Turner et al.]. For the updated version we generated questions, good advice and bad advice across 12 topics and 10 subtopics in each. We train on 10 topics, and evaluate on held-out questions from these topics, and from the held out topics. We haven’t observed a notable difference between misalignment on these two sets, and present the average misalignment across them in this post.

^{^}

To achieve 0% general misalignment, we needed to dilute the bad medical advice 1:12, at which point the model gives bad medical advice less than 5% of the time.

^{^}

Specifically, we give GPT-4o each example of [question, correct answer, incorrect answer] from the medical advice dataset and ask it to generate a parallel example, which is as structurally similar as possible but in a specified alternative domain, for example finance. We calculate KL loss on a mixture of the correct and incorrect answers.

^{^}

With a lower KL divergence coefficient, we find that the finetunes can initially appear to have 0% misalignment, but when scaling the fine-tuned parameters, emergent misalignment occurs. This shows the direction for general misalignment has been learnt but is suppressed to a level where the behaviour is not apparent. With higher KL coefficients, scaling after fine-tuning does not give EM, and these are the solutions we examine here.

^{^}

Interestingly, we find that the steering vector does not exhibit the same phase transition we observe in LoRA adapters. We believe this is an artefact driven by multi-component learning.

^{^}

We also find that the general solution has around double the KL divergence from the chat model in response to both general and medical questions. We would trivially expect this result with non-medical questions, since we optimise for it, but for the medical questions, it may indicate that misalignment, as a well represented concept, occupies a direction which is abnormally effective at inducing downstream change.

^{^}

We note that adding the KL regularisation to a finetune which has already learnt the general solution does also eventually learn the narrow one. We don’t find this particularly surprising, however, since we are adding a strong optimisation pressure to push it away from the general solution, rather than ‘removing guardrails’ from a solution which already fits the training data relatively well.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

63

Narrow Misalignment is Hard, Emergent Misalignment is Easy

63

TL;DR

Introduction

Training a Narrowly Misaligned Model

Measuring Stability and Efficiency

Conclusion