Edit (April 30, 2024):
A note to clarify things for future readers: The final sentence "This should be cited." in the parent comment was silently edited in after this comment was initially posted, which is why the body of this comment purely engages with the serious allegation that our post is duplicate work. The request for a citation is highly reasonable and it was our fault for not including one initially - once we noticed it we wrote a "Related work" section citing RepE and many other relevant papers, as detailed in the edit below.
======
Edit (April 29, ...
It's about the same. And there's a nice reason why: aharmless⋅^r≈0. I.e. for most harmless prompts, the projection onto the refusal direction is approximately zero (while it's very positive for harmful prompts). We don't display this clearly in the post, but you can roughly see it if you look at the PCA figure (PC 1 roughly corresponds to the "refusal direction"). This is (one reason) why we think ablation of the refusal direction works so much better than... (read more)