User Comment Replies — AI Alignment Forum

Refusal in LLMs is mediated by a single direction

Was it substantially less effective to instead use $a_{harmless}^{'} \leftarrow a_{harmless} + ({avg_proj}_{harmful})^r$ ?

It's about the same. And there's a nice reason why: $a_{harmless} \cdot^r \approx 0$ . I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it's very positive for harmful prompts). We don't display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the "refusal direction"). This is (one reason) why we think ablation of the refusal direction works so much better than... (read more)

Refusal in LLMs is mediated by a single direction

Andy Arditi1y*612

Edit (April 30, 2024):

A note to clarify things for future readers: The final sentence "This should be cited." in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially - once we noticed it we wrote a "Related work" section citing RepE and many other relevant papers, as detailed in the edit below.

======

Edit (April 29, ... (read more)

0Dan H1y

From Andy Zou: Thank you for your reply. We perform model interventions to robustify refusal (your section on “Adding in the "refusal direction" to induce refusal”). Bypassing refusal, which we do in the GitHub demo, is merely adding a negative sign to the direction. Either of these experiments show refusal can be mediated by a single direction, in keeping with the title of this post. Not mentioning it anywhere in your work is highly unusual given its extreme similarity. Knowingly not citing probably the most related experiments is generally considered plagiarism or citation misconduct, though this is a blog post so norms for thoroughness are weaker. (lightly edited by Dan for clarity) We perform a linear combination operation on the representation. Projecting out the direction is one instantiation of it with a particular coefficient, which is not necessary as shown by our GitHub demo. (Dan: we experimented with projection in the RepE paper and didn't find it was worth the complication. We look forward to any results suggesting a strong improvement.) -- Please reach out to Andy if you want to talk more about this. Edit: The work is prior art (it's been over six months+standard accessible format), the PIs are aware of the work (the PI of this work has spoken about it with Dan months ago, and the lead author spoke with Andy about the paper months ago), and its relative similarity is probably higher than any other artifact. When this is on arXiv we're asking you to cite the related work and acknowledge its similarities rather than acting like these have little to do with each other/not mentioning it. Retaliating by some people dogpile voting/ganging up on this comment to bury sloppy behavior/an embarrassing oversight is not the right response (went to -18 very quickly). Edit 2: On X, Neel "agree[s] it's highly relevant" and that he'll cite it. Assuming it's covered fairly and reasonably, this resolves the situation. Edit 3: I think not citing it isn't a big de

AI ALIGNMENT FORUM
AF

All of Andy Arditi's Comments + Replies