This is a very beautiful idea! It feels like the kind of clever discoveries that we need.
I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.
This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.
Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.
It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).
Do you think this is a next step worth exploring?
EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.
See also: Training on Documents About Reward Hacking Induces Reward Hacking, a study by Anthropic.
I disagree with critics who argue this risk is negligible, because the future is extraordinarily hard to predict. The present state of society is extremely hard to predict by people in the past. They would assume that if we managed to solve problems which they consider extremely hard, then surely we wouldn't be brought down by risk denialism, fake news, personal feuds between powerful people over childish insults, and so forth. Yet here we are.
Never underestimate the shocking shortsightedness of businesses. Look at the AI labs for example. Communists observing this phenomena were quoted saying "the capitalists will sell us the rope we hang them with."
It's not selfishness, it's bias. Businesspeople are not willing to destroy everything just to temporarily make an extra dollar—no human thinks like that! Instead, businesspeople are very smart and strategic but extraordinarily biased into thinking whatever keeps their business going or growing must be good for the people. Think about Stalin being very smart and strategic but extraordinarily biased into thinking whatever keeps him in power must be good for the people. It's not selfishness! If Stalin (or any dictator) were selfish, they would quickly retire and live the most comfortable retirements imaginable.
Humans evolved to be the most altruistic beings ever with barely a drop of selfishness. Our selfish genes makes us altruistic (as soon as power is within reach) because there's a thin line between "the best way to help others" and "amassing power at all costs." These two things look similar due to instrumental convergence, and it only takes a little bit of bias/delusion to make the former behave identically to the latter.
Even if gradual disempowerment doesn't directly starve people to death, it may raise misery and life dissatisfaction to civil war levels.
Collective anger may skyrocket to the point people would rather have their favourite AI run the country than the current leader. They elect politicians loyal to a version of the AI, and intellectuals facepalm. The government buys the AI company for national security reasons, and the AI completely takes over its own development process with half the country celebrating. More people facepalm, as politicians lick the boots of the "based" AI parrot its wise words e.g. "if you replace us with AI, we'll replace you with AI!"
While it is important to be aware of gradual disempowerment and for a few individuals to study it, my cause prioritization opinion is that only 1%-10% of the AI safety community should work on this problem.
The AI safety community is absurdly tiny. The AI safety spending is less than 0.1% of the AI capability spending, which in turn is less than 0.5% of the world GDP.
The only way for the AI safety community to influence the world, is to use their tiny resources to work on things which the majority of the world will never get a chance to work on.
This includes working on the risk of a treacherous turn, where an AGI/ASI suddenly turns against humanity. The majority of the world never gets a chance to work on this problem, because by the time they realize it is a big problem, it probably already happened, and they are already dead.
Of course, working on gradual disempowerment early is better than working on gradual disempowerment later, but this argument applies to everything. Working on poverty earlier is better than working on poverty later. Working on world peace earlier is better than working on world peace later.
If further thorough research confirms that this risk has a high probability, then the main benefit is using it as an argument for AI regulation/pause, when society hasn't yet tasted the addictive benefits of AGI.
It is theoretically hard to convince people to avoid X for their own good, because once they get X it'll give them so much power or wealth they cannot resist it anymore. But in practice, such an argument may work well since we're talking about the elites being unable to resist it, and people today have anti-elitist attitudes.
If the elites are worried the AGI will directly kill them, while the anti-elitists are half worried the AGI will directly kill them, and half worried [a cocktail of elites mixed with AGI] will kill them, then at least they can finally agree on something.
PS: have you seen Dan Hendrycks' arguments? It sort of looks like gradual disempowerment
It's beautiful! This is maybe the best AI alignment idea I've read on LessWrong so far.
I think most critics are correct that it might fail but incorrect that it's a bad idea. The two key points are:
We have no idea what an ASI about to take over the world looks like, it is extremely speculative. Given that ASI takeover occurs, I see a non-negligible probability (say 15%) that it was "on the edge" between taking over the world and cooperating (due to uncertainty about its chances or uncertainty about its goals).
If each time the ASI thinks about a human (or humanity), its thought processes regarding that human and her goals is a little more similar to its thought processes regarding itself and its own goals, that might push it towards cooperating. Given this ASI is capable of taking over the world, it is likely also capable of preventing the next ASI from taking over the world, and saving the world. If your idea decreases the chance of doom by 10%, that is a very big idea worth a lot of attention!
Critics misunderstand the idea as making the AI unable to distinguish between itself and others, and thus unable to lie. That's not what the idea is about (right?). The idea is about reducing the tendency to think differently about oneself and others. Minimizing this tendency while maximizing performance.
People use AI to do programming, engineering, inventing, and all these things which can be done just as well with far less tendency to think differently about oneself and others.
Do you think my Multi-Agent Framing idea might work against the Waluigi attractor states problem?
Pardon my self promotion haha.
It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.
I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).
Thank you for your alignment work :)
I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.
I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).
I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it's much harder to align a superintelligence if you don't know what's going on inside.
One small detail is defining your predictions better, as Dr. Shah said. It doesn't hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.
A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn't depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI's o1 will be able to get 90% on the GPQA Diamond, but didn't say when it'll happen. I'm not predicting very much in that case, and I can't judge how accurate my prediction was afterwards.
Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.
Thank you for your alignment work :)
I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.
From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it's selfless, or has updateless decision theory), it only cares about logical risk and doesn't care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it's selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has 1023 stars, and refusing to share them is astronomically greedy.”