Is this backwards? I'm having a bit of trouble following your terms. Seems like this post is terribly underrated -- maybe others also got confused? Basically, you only need 4 terms, yes?

* base model
* steered model
* activation-tuned model
* token cross-entropy trained model

I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you'd get some good followup work and a lot of general engagement.

Reply

Refusal in LLMs is mediated by a single direction

lemonhope1y410

The "love minus hate" thing really holds up

Reply

LLMs for Alignment Research: a safety priority?

lemonhope1y10

Oh I have 0% success with any long conversations with an LLM about anything. I usually stick to one question and rephrase and reroll a number of times. I am no pro but I do get good utility out of LLMs for nebulous technical questions

Reply

Specification gaming: the flip side of AI ingenuity

lemonhope1y10

I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)

Reply

LLMs for Alignment Research: a safety priority?

lemonhope1y1-1

I think Claude's enthusiasm about constitutional AI is basically trained-in directly by the RLAIF. Like RLAIF is fundamentally a "learn to love the constitution in your bones" technique.

Reply

LLMs for Alignment Research: a safety priority?

lemonhope1y10

I ctrl-f'd for 'prompt' and did not see your prompt. What is your prompt? The prompt is the way with this kind of thing I think.

If you make a challenge "claude cannot possibly do X concrete task" and post it on twitter then you'll probably get solid gold in the replies

Reply