All of Clément Dumas's Comments + Replies

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

3Andrew Mack
Yes, the learned vectors are always applied at every token (for all examples).

Let's assume the prompt template is  Q [true/false] [banana/shred]

If I understand correctly, they don't claim   learned has_banana but  learned has_banana. Moreover evaluating  for  gives:

Therefore, we can learn a  that is a banana classifier

2Sam Marks
EDIT: Nevermind, I don't think the above is a reasonable explanation of the results, see my reply to this comment. Original comment: Gotcha, that seems like a possible interpretation of the stuff that they wrote, though I find it a bit surprising that CCS learned the probe p(x)=has_banana(x)⊕is_true(x) (and think they should probably remark on this). In particular, based on the dataset visualizations in the paper, it doesn't seem possible for a linear probe to implement has_banana(x)⊕is_true(x). But it's possible that if you were to go beyond the 3 dimensions shown the true geometry would look more like the following (from here) (+ a lateral displacement between the two datasets).  In this case, a linear probe could learn an xor just fine.