AGI ruin scenarios are likely (and disjunctive)

So8res

Note: As usual, Rob Bensinger helped me with editing. I recently discussed this model with Alex Lintz, who might soon post his own take on it (edit: here).

Some people seem to be under the impression that I believe AGI ruin is a small and narrow target to hit. This is not so. My belief is that most of the outcome space is full of AGI ruin, and that avoiding it is what requires navigating a treacherous and narrow course.

So, to be clear, here is a very rough model of why I think AGI ruin is likely. (>90% likely in our lifetimes.)^[1]

My real models are more subtle, take into account more factors, and are less articulate. But people keep coming to me saying "it sounds to me like you think humanity will somehow manage to walk a tightrope, traverse an obstacle course, and thread a needle in order to somehow hit the narrow target of catastrophe, and I don't understand how you're so confident about this". (Even after reading Eliezer's AGI Ruin post—which I predominantly agree with, and which has a very disjunctive character.)

Hopefully this sort of toy model will at least give you some vague flavor of where I’m coming from.

Simplified Nate-model

The short version of my model is this: from the current position on the game board, a lot of things need to go right, if we are to survive this.

In somewhat more detail, the following things need to go right:

The world’s overall state needs to be such that AI can be deployed to make things good. A non-exhaustive list of things that need to go well for this to happen follows:
- The world needs to admit of an AGI deployment strategy (compatible with realistic alignable-capabilities levels for early systems) that prevents the world from being destroyed if executed.
- At least one such strategy needs to be known and accepted by a leading organization.
- Somehow, at least one leading organization needs to have enough time to nail down AGI, nail down alignable AGI, actually build+align their system, and deploy their system to help.
  - This very likely means that there needs to either be only one organization capable of building AGI for several years, or all the AGI-capable organizations need to be very cautious and friendly and deliberately avoid exerting too much pressure upon each other.
- It needs to be the case that no local or global governing powers flail around (either prior to AGI, or during AGI development) in ways that prevent a (private or public) group from saving the world with AGI.
Technical alignment needs to be solved to the point where good people could deploy AI to make things good. A non-exhaustive list of things that need to go well for this to happen follows:
- There need to be people who think of themselves as working on technical alignment, whose work is integrated with AGI development and is a central input into how AGI is developed and deployed.
- They need to be able to perceive every single lethal problem far enough in advance that they have time to solve them.
- They need to be working on the problems in a way that is productive.
- The problems (and the general paradigm in which they're attacked) need to be such that people's work can stack, or such that they don't require much serial effort; or the research teams need a lot of time.
- Significant amounts of this work have to be done without an actual AGI to study and learn from; or the world needs to be able to avoid deploying misaligned AGI long enough for the research to complete.
The internal dynamics at the relevant organizations need to be such that the organizations deploy an AGI to make things good. A non-exhaustive list of things that need to go well for this to happen follows:
- The teams that first gain access to AGI, need to care in the right ways about AGI alignment.
  - E.g., they can't be "just raise the AGI with kindness; any attempt to force our values on it will just make it hate us" style kooks, or any other variety of kook you care to name.
- The internal bureaucracy needs to be able to distinguish alignment solutions from fake solutions, quite possibly over significant technical disagreement.
  - This ability very likely needs to hold up in the face of immense social and time pressure.
- People inside the organization need to be able to detect dangerous warning signs.
- Those people might need very large amounts of social capital inside the organization.
- While developing AGI, the team needs to avoid splintering or schisming in ways that result in AGI tech proliferating to other organizations, new or old.
- The team otherwise needs to avoid (deliberately or accidentally) leaking AGI tech to the rest of the world during the development process.
- The team likewise needs to avoid leaking insights to the wider world prior to AGI, insofar as accumulating proprietary insights enables the group to have a larger technical lead, and insofar as a larger technical lead makes it possible for you to e.g. have three years to figure out alignment once you reach AGI, as opposed to six months.

(I could also add a list of possible disasters from misuse, conditional on us successfully navigating all of the above problems. But conditional on us clearing all of the above hurdles, I feel pretty optimistic about the relevant players’ reasonableness, such that the remaining risks seem much more moderate and tractable to my eye. Thus I’ll leave out misuse risk from my AGI-ruin model in this post; e.g., the ">90% likely in our lifetimes" probability is just talking about misalignment risk.)

One way that this list is a toy model is that it’s assuming we have an actual alignment problem to face, under some amount of time pressure. Alternatives include things like getting (fast, high-fidelity) whole-brain emulation before AGI (which comes with a bunch of its own risks, to be clear). The probability that we somehow dodge the alignment problem in such a way puts a floor on how low models like the above can drive the probabilities of success down (though I’m pessimistic enough about the known-to-me non-AGI strategies that my unconditional p(ruin) is nonetheless >90%).

Some of these bullets trade off against each other: sufficiently good technical solutions might obviate the need for good AGI-team dynamics or good global-scale coordination, and so on. So these factors aren't totally disjunctive. But this list hopefully gives you a flavor for how it looks to me like a lot of separate things need to go right, simultaneously, in order for us to survive, at this point. Saving the world requires threading the needle; destroying the world is the default.

Correlations and general competence

You may object: "But Nate, you've warned of the multiple-stage fallacy; surely here you're guilty of the dual fallacy? You can't say that doom is high because three things need to go right, and multiply together the lowish probabilities that all three go right individually, because these are probably correlated."

Yes, they are correlated. They're especially correlated through the fact that the world is derpy.

This is the world where the US federal government's response to COVID was to ban private COVID testing, confiscate PPE bought by states, and warn citizens not to use PPE. It's a world where most of the focus on technical AGI alignment comes from our own local community, takes up a tiny fraction of the field, and most of it doesn't seem to me to be even trying by their own lights to engage with what look to me like the lethal problems.

Some people like to tell themselves that surely we'll get an AI warning shot and that will wake people up; but this sounds to me like wishful thinking from the world where the world has a competent response to the pandemic warning shot we just got.

So yes, these points are correlated. The ability to solve one of these problems is evidence of ability to solve the others, and the good news is that no amount of listing out more problems can drive my probability lower than the probability that I'm simply wrong about humanity's (future) competence. Our survival probability is greater than the product of the probability of solving each individual challenge.

The bad news is that we seem pretty deep in the competence-hole. We are not one mere hard shake away from everyone snapping to our sane-and-obvious-feeling views. You shake the world, and it winds up in some even stranger state, not in your favorite state.

(In the wake of the 2012 US presidential elections, it looked to me like there was clearly pressure in the US electorate that would need to be relieved, and I was cautiously optimistic that maybe the pressure would force the left into some sort of atheistish torch-of-the-enlightenment party and the right into some sort of libertarian individual-rights party. I, uh, wasn't wrong about there being pressure in the US electorate, but, the 2016 US presidential elections were not exactly what I was hoping for. But I digress.)

Regardless, there's a more general sense that a lot of things need to go right, from here, for us to survive; hence all the doom. And, lest you wonder what sort of single correlated already-known-to-me variable could make my whole argument and confidence come crashing down around me, it's whether humanity's going to rapidly become much more competent about AGI than it appears to be about everything else.

(This seems to me to be what many people imagine will happen to the pieces of the AGI puzzle other than the piece they’re most familiar with, via some sort of generalized Gell-Mann amnesia: the tech folk know that the technical arena is in shambles, but imagine that policy has the ball, and vice versa on the policy side. But whatever.)

So that's where we get our remaining probability mass, as far as I can tell: there's some chance I'm wrong about humanity's overall competence (in the nearish future); there's some chance that this whole model is way off-base for some reason; and there's a teeny chance that we manage to walk this particular tightrope, traverse this particular obstacle course, and thread this particular needle.

And again, I stress that the above is a toy model, rather than a full rendering of all my beliefs on the issue. Though my real model does say that a bunch of things have to go right, if we are to succeed from here.

^{^}
In stark contrast to the multiple people I’ve talked to recently who thought I was arguing that there's a small chance of ruin, but the expected harm is so large as to be worth worrying about. No.

[-]Daniel Kokotajlo3y1717

Just wanted to say that this is also my impression, and has been for months. Technical & Policy both seem to be planning on the assumption that the other side will come through in the end, but the things they imagine this involving seem to be things that the other side thinks is pretty unlikely.

[-]Andreas Stuhlmüller3y84

And, lest you wonder what sort of single correlated already-known-to-me variable could make my whole argument and confidence come crashing down around me, it's whether humanity's going to rapidly become much more competent about AGI than it appears to be about everything else.

I conclude from this that we should push on making humanity more competent at everything that affects AGI outcomes, including policy, development, deployment, and coordination. In other times I'd think that's pretty much impossible, but on my model of how AI goes our ability to increase our competence at reasoning, evidence, argumentation, and planning is sufficiently correlated with getting closer to AGI that it's only very hard.

I imagine you think that this is basically impossible, i.e. not worth intervening on. Does that seem right?

If so, I'd guess your reasons are something like this:

Any system that can make a big difference in these domains is extremely dangerous because it would need to be better than us at planning, and danger is a function of competent plans. Can't find a reference but it was discussed in one of the 2021 MIRI conversations.
The coordination problem is too hard. Even if some actors have better epistemics it won't be enough. Eliezer states this position in AGI ruin:

weaksauce Overton-abiding stuff about 'improving public epistemology by setting GPT-4 loose on Twitter to provide scientifically literate arguments about everything' will be cool but will not actually prevent Facebook AI Research from destroying the world six months later, or some eager open-source collaborative from destroying the world a year later if you manage to stop FAIR specifically.

Does that sound right? Are there other important reasons?

AI ALIGNMENT FORUM
AF

63

AGI ruin scenarios are likely (and disjunctive)

63

Simplified Nate-model

Correlations and general competence