AI ALIGNMENT FORUM
AF

lemonhope
Ω749364
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2lemonhope's Shortform
6y
0
No wikitag contributions to display.
Distillation Robustifies Unlearning
lemonhope3mo2332

Many props for doing the most obvious thing that clearly actually works.

Reply3
Steering Gemini with BiDPO
lemonhope7mo20

What do you think is the ideal use-case for steering? Or is it not needed

Reply
Rebuttals for ~all criticisms of AIXI
lemonhope8mo30

Could you say more about where the whole sequence is going / what motivated it? I am curious.

Reply
Which evals resources would be good?
lemonhope10mo1-1

Maybe it should be a game that everyone can play

Reply
Representation Tuning
lemonhope1y10

Here is my understanding. Is this right?

 

Reply
Representation Tuning
lemonhope1y10

Incredible!! I am going to try this myself. I will let you know how it goes.

honesty vector tuning showed a real advantage over honesty token tuning, comparable to honesty vector steering at the best layer and multiplier:

Is this backwards? I'm having a bit of trouble following your terms. Seems like this post is terribly underrated -- maybe others also got confused? Basically, you only need 4 terms, yes?

* base model
* steered model
* activation-tuned model
* token cross-entropy trained model

I think I was reading half the plots backwards or something. Anyway I bet if you reposted with clearer terms/plots then you'd get some good followup work and a lot of general engagement.

Reply
Refusal in LLMs is mediated by a single direction
lemonhope1y410

The "love minus hate" thing really holds up

Reply
LLMs for Alignment Research: a safety priority?
lemonhope1y10

Oh I have 0% success with any long conversations with an LLM about anything. I usually stick to one question and rephrase and reroll a number of times. I am no pro but I do get good utility out of LLMs for nebulous technical questions

Reply
Specification gaming: the flip side of AI ingenuity
lemonhope1y10

I would watch a ten hour video of this. (It may also be more persuasive to skeptics.)

Reply
LLMs for Alignment Research: a safety priority?
lemonhope1y1-1

I think Claude's enthusiasm about constitutional AI is basically trained-in directly by the RLAIF. Like RLAIF is fundamentally a "learn to love the constitution in your bones" technique.

Reply
Load More
0aimless ace analyzes active amateur: a micro-aaaaalignment proposal
1y
0
2Speedrun ruiner research idea
1y
1
6Series of absurd upgrades in nature's great search
2y
0
6We can do better than DoWhatIMean (inextricably kind AI)
2y
2
5More money with less risk: sell services instead of model access
3y
0
12Inner alignment: what are we pointing at?
3y
2
4AI-assisted list of ten concrete alignment things to do right now
3y
1
4Do yourself a FAVAR: security mindset
3y
0
2lemonhope's Shortform
6y
0
5Creating Environments to Design and Test Embedded Agents
6y
4
Load More