Here is my understanding. Is this right?
Incredible!! I am going to try this myself. I will let you know how it goes.
honesty vector tuning showed a real advantage over honesty token tuning, comparable to honesty vector steering at the best layer and multiplier:
Is this backwards? I'm having a bit of trouble following your terms. Seems like this post is terribly underrated -- maybe others also got confused? Basically, you only need 4 terms, yes?
* base model
* steered model
* activation-tuned model
* token cross-entropy trained model
I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you'd get some good followup work and a lot of general engagement.
The "love minus hate" thing really holds up
Oh I have 0% success with any long conversations with an LLM about anything. I usually stick to one question and rephrase and reroll a number of times. I am no pro but I do get good utility out of LLMs for nebulous technical questions
I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)
I think Claude's enthusiasm about constitutional AI is basically trained-in directly by the RLAIF. Like RLAIF is fundamentally a "learn to love the constitution in your bones" technique.
I ctrl-f'd for 'prompt' and did not see your prompt. What is your prompt? The prompt is the way with this kind of thing I think.
If you make a challenge "claude cannot possibly do X concrete task" and post it on twitter then you'll probably get solid gold in the replies
One of those ideas that's so obviously good it's rarely discussed?
Just want to say that I found this immensely clarifying and valuable since I read it months ago.
Maybe it should be a game that everyone can play