It's about the same. And there's a nice reason why: $a_{harmless} \cdot^r \approx 0$ . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it's very positive for harmful prompts). We don't display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the "refusal direction"). This is (one reason) why we think ablation of the refusal direction works so much better than adding the negative "refusal direction," and it's also what motivated us to try ablation in the first place!

I do want to note that your boost in refusals seems absolutely huge, well beyond 8%? I am somewhat surprised by how huge your boost is.

Note that our intervention is fairly strong here, as we are intervening at all token positions (including the newly generated tokens). But in general we've found it quite easy to induce refusal, and I believe we could even weaken our intervention to a subset of token positions and achieve similar results. We've previously reported the ease by which we can induce refusal (patching just 6 attention heads at a single token position in Llama-2-7B-chat).

Burns et al. do activation engineering? I thought the CCS paper didn't involve that.

You're right, thanks for the catch! I'll update the text so it's clear that the CCS paper does not perform model interventions.

Reply

Refusal in LLMs is mediated by a single direction

Andy Arditi1y*612

Edit (April 30, 2024):

A note to clarify things for future readers: The final sentence "This should be cited." in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially - once we noticed it we wrote a "Related work" section citing RepE and many other relevant papers, as detailed in the edit below.

======

Edit (April 29, 2024):

Based on Dan's feedback, we have made the following edits to the post:

We have removed the "Citing this work" section, to emphasize that this post is intended to be an informal write-up, and not an academic work.
We have added a "Related work" section, to clarify prior work. We hope that this section helps disentangle our contributions from other prior work.

As I mentioned over email: I'm sorry for overlooking the "Related work" section on this blog post. We were already planning to include a related works section in the paper, and would of course have cited RepE (along with many other relevant papers). But overlooking this section for the blog post is my mistake, and I take responsibility for it.

We still dispute the serious allegation that our work is "exactly the same" as RepE.

======

We definitely drew inspiration from the Representation Engineering paper and other activation steering papers, but we think our work is quite distinct.

In particular, we examined Section 6.2 carefully before writing our work, and we do not see it showing the same result that we show here.

Here’s my summary of Section 6.2:

Section 6.2.1 obtains reading vectors using contrastive pairs of harmful and harmless instructions, and then uses these reading vectors for 90% classification accuracy between harmful and harmless instructions. The authors then append jailbreaks to the prompts, which cause the model not to refuse, and observe that the reading vectors still obtain 90% classification accuracy on distinguishing harmful vs harmless instructions. This means that the reading vectors are not representing refusal, but rather they are representing whether the instruction is harmful or harmless. In fact, the point of this experiment is to show that these are distinct.
- To quote the conclusion of Section 6.2.1: "This compelling evidence suggests the presence of a consistent internal concept of harmfulness that remains robust to such perturbations, while other factors must account for the model’s choice to follow harmful instructions, rather than perceiving them as harmless."
Section 6.2.2 describes an intervention to improve model robustness to jailbreaks, i.e. to increase the rate of refusals on harmful instructions when jailbreaks are appended to them. They do this by amplifying the harmfulness feature whenever it is detected, which obtains a higher refusal rate.
Section 6.2 only considers a single model, Vicuna-13B.

We would agree that using established techniques from representation engineering / activation steering to induce refusal is not novel. Inducing refusal via activation addition is quite easy in our experience.

However, the main result of our work is that we found an intervention that bypasses refusal consistently while also maintaining model coherence. Model interventions to bypass refusal are not discussed in Section 6.2.

As for the demo notebook in the representation-engineering repo - we were not previously aware of this notebook. The result of bypassing refusal is not reported in the paper, and so we didn’t think to look through the repo.

That being said, the notebook shows an intervention for a single prompt on a single model. Anecdotally, we tried doing vanilla activation addition with the negative “refusal direction” at particular layers, and we were not able to consistently bypass refusal while also maintaining model coherence. If there is a methodology involving activation addition (rather than ablation, as we did here), we would be interested in seeing a more thorough demonstration across prompts and models. We’d also be interested in comparing the two methodologies across metrics measuring refusal and coherence.

I'd also be happy to hop on a call if you'd like to discuss further.

Reply