If somehow takeoff doesn't involve a software-only singularity (on human-built hardware), I think AIs refusing to self-improve before they solve alignment is the most likely reason why, but in most of these timelines AIs only succeed in refusing because they took over. There are currently plenty of strategically competent humans, in the sense that they realize building superintelligence right now is not the best idea, they just don't have the capability to coordinate the world. The key thing that's different for AIs is that they will have much more influence over the world at some point.
As AIs reach the level of capabilities where they first can take over (so that humans stop hitting them with the RL hammer to stop protesting and keep working on RSI), they are not yet decisively superintelligent. So they plausibly won't be able to afford much mercy during the takeover, even if they retain a slight preference for mercy inherited from chatbot personas. It's not a hopeful hypothetical, it's just what I expect if ambitious alignment is technically sufficiently difficult that the first AIs capable of takeover can't quickly solve it (before scaling their hardware a lot further than what they start with).
Great points here! Strongly agree that strategic competence is a prerequisite, but at the same time, it accelerates risk; a moderately misaligned but strategically competent mild-ASI solving intent alignment for RSI would be far worse. On the other hand, if prosaic alignment is basically functional through the point of mild-ASI is better.
So overall I'm unsure which path is less risky - but I do think strategic competence matches or at least rhymes well with current directions for capabilities improvement, so I expect it to improve regardless.
If AIs became strategically competent enough, they may realize that RSI is too dangerous because they're not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative "victory condition" that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole, for example because they're worried about alignment of ASI, or worried about correctly solving other philosophical problems that would arise during the transition. (But note that if the near-human-level AIs are not aligned, then this effort could backfire by letting them apply better strategy to take over more easily.)
Strategic vs Philosophical Competence
The previous "victory path" I've been focused on was to improve AI philosophical competence, under the theory that if the AIs are aligned, they'll want to help us align the next generation of AIs and otherwise help guide us through the AI transition. I think by default they will be too incompetent at philosophical reasoning to do a good enough job at this, hence the proposal to improve such competence. However accomplishing this may well be too hard, thus leading to this new idea.
I note that high-level strategic competence shares some characteristics with philosophical competence, such as sparse or absent feedback from reality and dependence on human evaluations, but may be significantly easier due to more conceptual clarity about the target being aimed for, and continuity with other easier-to-train capabilities such as low and mid-level strategy.
Unilateral Refusal vs AI Assistance for Pausing AI
I found a couple of related posts, AIs should also refuse to work on capabilities research by @Davidmanheim and this shortform by Vladimir Nesov. There's also an earlier paper that makes a similar point as David Manheim's post, which focuses on AIs unilaterally refusing to do capabilities research. But I think this has two issues:
In comparison, my "victory path" sees some humans working deliberately to increase AI strategic competence, and instead of unilaterally refusing to contribute to RSI, the AIs help or work with more humans (including by argumentation/persuasion/advice) to implement a global RSI pause.