lemonhope - AI Alignment Forum

Is this backwards? I'm having a bit of trouble following your terms. Seems like this post is terribly underrated -- maybe others also got confused? Basically, you only need 4 terms, yes?

* base model
* steered model
* activation-tuned model
* token cross-entropy trained model

I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you'd get some good followup work and a lot of general engagement.

Refusal in LLMs is mediated by a single direction

lemonhope1y410

The "love minus hate" thing really holds up

LLMs for Alignment Research: a safety priority?

lemonhope1y10

Oh I have 0% success with any long conversations with an LLM about anything. I usually stick to one question and rephrase and reroll a number of times. I am no pro but I do get good utility out of LLMs for nebulous technical questions

Specification gaming: the flip side of AI ingenuity

lemonhope1y10

I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)

LLMs for Alignment Research: a safety priority?

lemonhope1y1-1

I think Claude's enthusiasm about constitutional AI is basically trained-in directly by the RLAIF. Like RLAIF is fundamentally a "learn to love the constitution in your bones" technique.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments