AI ALIGNMENT FORUM
AF

lemonhope
Ω749364
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2lemonhope's Shortform
6y
0
Distillation Robustifies Unlearning
lemonhope3mo2332

Many props for doing the most obvious thing that clearly actually works.

Reply3
Steering Gemini with BiDPO
lemonhope7mo20

What do you think is the ideal use-case for steering? Or is it not needed

Reply
Rebuttals for ~all criticisms of AIXI
lemonhope8mo30

Could you say more about where the whole sequence is going / what motivated it? I am curious.

Reply
Which evals resources would be good?
lemonhope10mo1-1

Maybe it should be a game that everyone can play

Reply
Representation Tuning
lemonhope1y10

Here is my understanding. Is this right?

 

Reply
Representation Tuning
lemonhope1y10

Incredible!! I am going to try this myself. I will let you know how it goes.

honesty vector tuning showed a real advantage over honesty token tuning, comparable to honesty vector steering at the best layer and multiplier:

Is this backwards? I'm having a bit of trouble following your terms. Seems like this post is terribly underrated -- maybe others also got confused? Basically, you only need 4 terms, yes?

* base model
* steered model
* activation-tuned model
* token cross-entropy trained model

I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you'd get some good followup work and a lot of general engagement.

Reply
Refusal in LLMs is mediated by a single direction
lemonhope1y410

The "love minus hate" thing really holds up

Reply
LLMs for Alignment Research: a safety priority?
lemonhope1y10

Oh I have 0% success with any long conversations with an LLM about anything. I usually stick to one question and rephrase and reroll a number of times. I am no pro but I do get good utility out of LLMs for nebulous technical questions

Reply
Specification gaming: the flip side of AI ingenuity
lemonhope1y10

I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)

Reply
LLMs for Alignment Research: a safety priority?
lemonhope1y1-1

I think Claude's enthusiasm about constitutional AI is basically trained-in directly by the RLAIF. Like RLAIF is fundamentally a "learn to love the constitution in your bones" technique.

Reply
Load More
No wikitag contributions to display.
0aimless ace analyzes active amateur: a micro-aaaaalignment proposal
1y
0
2Speedrun ruiner research idea
1y
1
6Series of absurd upgrades in nature's great search
2y
0
6We can do better than DoWhatIMean (inextricably kind AI)
2y
2
5More money with less risk: sell services instead of model access
3y
0
12Inner alignment: what are we pointing at?
3y
2
4AI-assisted list of ten concrete alignment things to do right now
3y
1
4Do yourself a FAVAR: security mindset
3y
0
2lemonhope's Shortform
6y
0
5Creating Environments to Design and Test Embedded Agents
6y
4
Load More