Thanks for the great post, I really enjoyed reading it!
I love this research direction combining unsupervised method with steering vector, looking forward to your next findings.
Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?
Let's assume the prompt template is x= Q [true/false] [banana/shred]
If I understand correctly, they don't claim p learned has_banana but ~p=p(x⁺)+(1−p(x⁻))2 learned has_banana. Moreover evaluating ~p for p=is_true(x)⊕is_shred(x) gives:
Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?