All of Knight Lee's Comments + Replies

I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.

From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]

  1. ^

    In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chan

... (read more)
Knight Lee*1-2

A real danger

I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.

Shortsightedness

Never underestimate the shocking shortsightedness of businesses. Look at the AI labs ... (read more)

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, th... (read more)

 It's beautiful! This is maybe the best AI alignment idea I've read on LessWrong so far.

I think most critics are correct that it might fail but incorrect that it's a bad idea. The two key points are:

  1. We have no idea what an ASI about to take over the world looks like, it is extremely speculative. Given that ASI takeover occurs, I see a non-negligible probability (say 15%) that it was "on the edge" between taking over the world and cooperating (due to uncertainty about its chances or uncertainty about its goals).

    If each time the ASI thinks about a human

... (read more)

Do you think my Multi-Agent Framing idea might work against the Waluigi attractor states problem?

Pardon my self promotion haha.

It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.

I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).

Thank you for your alignment work :)

I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.

I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).

I agree the benefits outweigh... (read more)