User Comment Replies — AI Alignment Forum

I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.

From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.^[1]

^{^}
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chan

Knight Lee1mo54

See also: Training on Documents About Reward Hacking Induces Reward Hacking, a study by Anthropic.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Knight Lee2mo*1-2

A real danger

I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.

Shortsightedness

Never underestimate the shocking shortsightedness of businesses. Look at the AI labs ... (read more)

MONA: Managed Myopia with Approval Feedback

Knight Lee2mo*40

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, th... (read more)

Self-Other Overlap: A Neglected Approach to AI Alignment

Knight Lee3mo611

It's beautiful! This is maybe the best AI alignment idea I've read on LessWrong so far.

I think most critics are correct that it might fail but incorrect that it's a bad idea. The two key points are:

We have no idea what an ASI about to take over the world looks like, it is extremely speculative. Given that ASI takeover occurs, I see a non-negligible probability (say 15%) that it was "on the edge" between taking over the world and cooperating (due to uncertainty about its chances or uncertainty about its goals).
If each time the ASI thinks about a human

... (read more)

The Waluigi Effect (mega-post)

Knight Lee3mo00

Do you think my Multi-Agent Framing idea might work against the Waluigi attractor states problem?

Pardon my self promotion haha.

EIS XIV: Is mechanistic interpretability about to be practically useful?

Knight Lee6mo*00

It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.

I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).

Thank you for your alignment work :)

EIS XIV: Is mechanistic interpretability about to be practically useful?

Knight Lee6mo*00

I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.

I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).

I agree the benefits outweigh... (read more)

AI ALIGNMENT FORUM
AF

All of Knight Lee's Comments + Replies

A real danger

Shortsightedness