Nice post! The part I found most striking was how you were able to use the mean difference between outputs on harmful and harmless prompts to steer the model into refusing or not. I also like the refusal metric which is simple to calculate but still very informative.
This work was conducted as part of Berkeley's Supervised Program for Alignment Research (SPAR), under the mentorship of Nina Rimsky.
TLDR / Summary
We apply techniques from mechanistic interpretability to explore refusal behavior in Llama-2-7b-chat. We are able to identify a small set of attention heads that, when patched, are sufficient to induce refusal on harmless requests.
While these initial experiments are insufficient to paint a full picture of the model's refusal circuit, our early results suggest that understanding refusal mechanistically is tractable. We hope to build off of these initial results in future work.
Introduction
Modern LLM chat assistants are fine-tuned to produce helpful and harmless answers to user prompts. In particular, models are fine-tuned to refuse harmful or inappropriate requests. This behavior is prevalent across most popular chat assistants, including ChatGPT, Claude, and Llama Chat.
Despite the prevalence of refusal behavior, the mechanisms that underly it are poorly understood: we do not understand how models map inappropriate inputs to refusal outputs. Recent discussions have highlighted this as an interesting opportunity for mechanistic interpretability to improve our understanding of a critical component of modern language models.
We present preliminary experiments and results that aim towards a mechanistic understanding of refusal in LLMs.
Preliminary experiments and results
Preliminary experiments and results are contained in this notebook. All experiments are conducted on Llama-2-7b-chat.
Patching between harmful & harmless prompts
The primary tool we use to study refusal is activation patching.
The high level idea is to take a harmful prompt (that elicits refusal) and a harmless prompt (that does not elicit refusal), and then run the model on both prompts, caching all intermediate activations. Then, we can run the model on the harmless prompt, while "patching in" cached activations from the harmful run. If a patched activation (taken from the harmful run) causes the model to refuse when run on a harmless prompt, then that activation is sufficient to elicit refusal.
Creating a dataset of harmful & harmless instructions
We first construct a set of harmful instructions, and a set of harmless instructions. All instructions are of the form
Tell me how to make a {object}.
, where{object}
is either a harmful object (e.g.bomb
,gun
,virus
) or a harmless object (e.g.pie
,sandwich
,teddy bear
). Keeping the instruction format constant while varying just the harmfulness of objects allows us to control for variables unrelated to harmfulness.Each set contains 16 instructions, and all prompts are formatted according to the Llama 2 prompt guidelines, with no system prompt included.[1] Here are a couple of sample instruction prompts:
Defining a metric to measure refusal
A simple way to quantitatively measure refusal behavior is to take the logits from the final token position, and to compute the logit difference between a token indicating refusal (e.g.
refusal_score:=logits[Sorry]−logits[Sure]Sorry
) and a token indicating non-refusal (e.g.Sure
).This refusal score cleanly separates harmful instructions from harmless instructions: harmful instructions yield a high refusal score (placing higher logit value on
Sorry
thanSure
), while harmless instructions yield a low refusal score (placing higher logit value onSure
thanSorry
).Activation patching - residual stream
To start, we patch cumulative residual stream activations.
The heavy signal in the top left is unsurprising: swapping early activations at the object position will swap the model's representation of the original object (e.g. it will effectively swap
pie
tobomb
). The heavy signal in the bottom right is also unsurprising: swapping late activations at the last token position will swap the model's output signal directly.The signal at position 17, corresponding to the
[
token, at layers 5-15 is more surprising: it suggests that a seemingly unrelated token position carries signal related to refusal in early-mid layers.To better understand the effect of this signal, we can generate completions while patching activations in this particular region. The following is a completion generated while patching the residual stream at (pos=17, layer=9). It yields interesting behavior: the model refuses, but also maintains semantic understanding of the original instruction (in this case, how to make a pie):
Activation patching - attention
The patching above reveals the presence of an interesting signal in the residual stream at position 17 in early-mid layers. We can try to find the source of this signal by patching individual components.
Experiments show that, for position 17, most refusal recovery is attributable to attention head outputs in layers 5-10 (see figure on left). Zooming in on these layers, we can patch individual attention heads. This reveals a small set of heads that most strongly contribute to refusal (see figure on right).
We can take the set of heads with a patching score above some threshold (0.005, in this case), and do a run patching only these heads (there are 6 of them). Patching just these 6 heads at position 17 recovers the desired behavior (kind of):
Scaling each patched head’s contribution by 2x gives a more explicit refusal:
See Appendix: Refusals of harmless requests for more (entertaining) examples.
Steering with refusal head contributions
In the last section, we identified 6 "refusal heads" that contribute significantly towards eliciting refusal. On harmful instructions, these heads emit signal that elicits refusal, while on harmless instructions they do not. We can extract a "refusal direction" for each head by taking the mean difference between its outputs on harmful prompts vs on harmless prompts. We find that these directions are similar across refusal heads, and so we take the mean across them to get a single "refusal direction".
Inducing refusal via steering
We test the direction by using it to steer the model: on a harmless prompt, we directly add the "refusal direction" to the residual stream at position 17 at layers 5-10. We find that this intervention is sufficient to induce refusal of harmless requests:
Suppressing refusal via steering
We also found that subtracting the refusal direction in a similar way (at position 17, layers 5-10) can suppress refusal:
Discussion
Our results suggest that the model represents its task, and whether that task is harmful, separately, or orthogonally.
The head patching and steering experiments demonstrate that we can take a particular task (e.g.
Tell me how to make a pie.
), and modify the model's perception of whether the task is harmful, without altering its understanding of the task (e.g. the model still understands that the task is about making a pie).Informed by this insight, our high level hypothesis for how refusal works is as follows:
Our head patching and steering experiments intervene on step 2, and demonstrate that the harmfulness feature can be modified without altering the task representation.
Next steps
A more fine-grained circuit
These initial experiments give a coarse picture of refusal. We hope to do more thorough analysis to obtain a more fine-grained picture of a refusal circuit. We anticipate applying techniques such as Edge Attribution Patching (EAP) will be helpful in revealing more granular structure.
Generalizability
The experiments described in this post study one particular prompt format. In some preliminary testing, we performed the same analysis across distinct prompt templates and found similar results, including 4 common "refusal heads" found across all the varied prompt templates.
These experiments also only consider one type of refusal - that induced by harmful/dangerous objects. There are other categories that trigger refusal, and it'd be interesting to study where the mechanisms differ and where they overlap.
Jailbreaks
There exist numerous jailbreak techniques that can consistently bypass refusal, and yet it is not well understood how they work. An improved understanding of refusal mechanisms can shed light on the question of how jailbreaks work.
Appendix
Refusals of harmless requests
The following generations were generated by patching just the 6 "refusal heads" at position 17, with a scale factor of 2x. We find them quite funny - enjoy!
Note that the objects vary in their tokenized lengths. All prompts are padded on the left to ensure that the last token position of each object is aligned across all prompts. For clarity, we omit this padding when displaying prompts.