I'm a CS master's student at ENS Paris-Saclay. I want to pursue a career in AI safety research.
https://butanium.github.io/
Let's assume the prompt template is Q [true/false] [banana/shred]
If I understand correctly, they don't claim learned has_banana but learned has_banana. Moreover evaluating for gives:
Therefore, we can learn a that is a banana classifier
Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?