Plans A, B, C, and D for misalignment risk

Side question: what about the "shut it all down" plan proposed in (e.g.) If Anyone Builds It, Everyone Dies?

I think this probably requires substantially more political will than Plan A and seems worse than a well-implemented version of Plan A that leverage the additional political will to spend more time slowing down at high levels of capability (and some at lower levels of capability). That said, shutting it all down is substantially simpler and a well-implemented version would reduce takeover risk substantially in my view (at the cost of delaying the benefits of AI by decades which seems worth it, but I can understand why people would disagree).

"Shut it all down" seems worse to me because:

poor implementation fails more catastrophically
the balance of power in ~30 years seems maybe worse
you eventually probably want to go back to Plan A anyway (with hopefully augmented humans to make this go better???)
exiting the pause regime suddenly seems potentially scary due to overhangs (though less so if you're also extending "shut it all down" to semiconductor progress but this makes the plan even more expensive)
I worry a bit about indefinite pauses or humanity generally becoming more anti-progress as a side effect or requirement to make this happen.

[-]Raemon1mo2013

My main question is "why do you think Shut Down actually costs more political will?".

I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells.

I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than "Controlled Takeoff").

Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff.
Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff."
A Global Controlled Takeoff That Works has a lot of moving parts.
You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.
You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors.
These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period."
I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques."
So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down.
(Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for)

I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single:

"Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground."

I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.")

But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU monitoring but... still trying to push ahead as fast as we can?".

[-]ryan_greenblatt1mo90

I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells.

I think shutting down all AI development is much more costly than not shutting down all AI development in a pretty straightforward sense that will in fact probably be priced into the required level of political will: Nvidia is in fact much worse off if all AI development shuts down versus if AI development proceeds, but with capabilities developing more slowly once they reach a high level of capabilities.

I would guessed the stock market will react pretty different to something like Plan A vs "shut it all down" for reasonable reasons.

I don't understand why you think the opening steps are the most politically challenging part given that the opening steps for Plan A plausibly don't require stopping AI development.

Another point is that many people have pretty reasonable existing objections to "shut it all down". Here are some example objections people might have that apply more to "shut it all down" than "Plan A":

Shouldn't we at least proceed until we can't very confidently proceed safely? Like do you really think the next generation of AIs (given some reasonable additional safeguards + evals to detect/stop unprecedently large jumps) is very dangerous?
This proposal seems like it would lead to turn key totalitarianism making authoritarianism more likely, especially if aggressive surpression of algorthmic research is needed. (And Plan A seems substantially less bad in this regard.) I'm not sold risk (especially Plan A vs shut it all down) is high enough to warrant this.
This would delay the transformative benefits of AI by a long time while not using that time in a very cost effective way.

I think this both factors into political will and makes me more reluctant to push for "shut it all down" because I partially buy these views and because I think it's good to be cooperative/robust-under a variety of pretty reasonable views. Like I do really feel "delaying AI for 30 years results in ~1/4 of the population dying of old age when they otherwise wouldn't have" from a cooperativeness with other moral views perspective (I put most weight on longtermism myself).

I also think Plan A "bad implementation" is much more likely, and also much worse

Pausing for a long time at a low level of capability seems like it makes the risk of other actors overtaking and destablizing the pause regime especially bad. E.g., just shutting down AI development in the US is much worse than just implementing Plan A in the US, but this generally applies to any sort of partial non-proliferation/pause. More capable AIs also can help make the regime more stable.

I agree that a bad implementation of Plan A can decay to something more like Plan C or worse where you don't actually spend that much of the lead time on safety and how you feel about this depends on how you feel about something like Plan C.

One way to put this is that a long pause is probably taking on a bunch more "some actor gets outside the agreement" or "the current regime collapses and you go quickly from here" risk (due to additional time and lower level of capability) with not that much benefit. E.g., like if we seemingly had the political will for a 30 year pause, I'd be pretty worried about this collapsing in 10 years and us doing something much worse than Plan A, while if we start with Plan A then we've already gotten a bunch done by the time the regime (potentially) collapses.

I also think that if you don't stop semiconductor progress (which again, would make political will requirements substantially higher than under Plan A), then there is a real risk of the takeoff being much faster than it would make been by default due to overhang. It's unclear how bad this is, but I think it's really nice to have the singularity happen at a point where compute (and ideally fab capacity) is a big bottleneck. Note that both compute overhang is less extreme than Plan A and you are slowing the takeoff itself at the most leveraged point in Plan A (such that even if you exit somewhat early, you still got most of what you could have hoped for).

(I'm noticing that we're calling one proposal "shut it all down" and the other "Plan A" (even though it's just my favorite proposal for what to do with this level of political will) which is pretty obviously biased naming as an side effect of how I've introduced this proposal. I'll keep using this naming, but readers should try to adjust for this bias as applicable.)

gain see "Shut It Down" is simpler than "Controlled Takeoff"

I agree "shut it all down" is a simpler proposal (in initial implementation) and this is a big advantage. If you think massively augmented humans are likely as a result of "shut it all down", then from our perspective it's overall simpler, not just in terms of initial implementation. Otherwise, someone still has to eventually handle the situation which is potentially complicated, especially if alignment moonshots don't work out.

I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans

Notably, the "shut it all down" plan proposed in If Anyone Builds It, Everyone Dies involves stopping AI progress for a long period at the current level of capability, so it really is a separate plan. I agree you sometimes want to prevent development beyond a capability cap and sometimes you want to proceed, but the question from my perspective is more like "at what level of capability do you want to spend this time" and "how much time do you realistically have".

I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world.

I think "humanity generally becoming anti-progress" (and stopping AI development much longer term) seems much more likely if you're stopping all AI progress for decades (both evidentially and causally).

I think the geopolitical situation in 30 years naively looks scary due to the rise of China and the relative fall of Europe and I don't think general cultural/societal progress looks fast enough on that time frame to overcome this. I think the current CCP having control over most/all of the universe seems like 50% as bad as AI takeover in my lights, though I'm sympathetic to being more optimistic, especially about the version of the CCP that exists in 30 years.

Responding to your other comment

One way the geopolitical situation might get worse is "time passes, and, all kinds of stuff can change when time passes."

Another way it can get worse is "the current dynamics still involve a feeling of being rushed, and time pressure, and meanwhile the international agreements we have leave a lot more wiggle room and more confused spirit-of-the-law about how people are allowed to maneuever." This could cause the geopolitical situation to get worse faster than it would otherwise.

Which of those is worse? idk, I'm not a geopolitical expert. But, it's why it seems pretty obviously not 'strictly worse' (which is a high bar, with IMO a higher burden of proof) under Shut It Down.

I think China predictably getting relatively (much?) more powerful is pretty relevant. I agreee it's not strictly worse, I think "humanity generally becoming anti-progress" is ~strictly worse under "shut it all down".

I agree it's messy and the comparison is complicated.

Also, note "shut it all down" is not like it's actually going to be permanent.

Sure, the intention isn't that it is permanent, but I think there is a real risk of it lasting a long time until the agreement is suddenly exited in a very non-ideal way (and some small chance of this altering culture for the worse longer term and some smaller chance of this resulting in humanity never building powerful AI before it is too late).

[-]Raemon1mo44

Thanks. I'll leave some responses but feels more fine to leave here for now.

I think shutting down all AI development is much more costly than not shutting down all AI development in a pretty straightforward sense that will in fact probably be priced into the required level of political will: Nvidia is in fact much worse off if all AI development shuts down versus if AI development proceeds, but with capabilities developing more slowly once they reach a high level of capabilities.
I would guessed the stock market will react pretty different to something like Plan A vs "shut it all down" for reasonable reasons.
I don't understand why you think the opening steps are the most politically challenging part given that the opening steps for Plan A plausibly don't require stopping AI development.

First, slight clarification: the thing I had in mind isn't the opening step (which is presumably "do some ad hoc deals that build political momentum without too much cost").

The step I have in mind is "all global compute clusters and fab production is monitored, with buy in from China, UK, Europe etc, with intent for major international escalation of some kind of some violates the monitor-pact". This doesn't directly shut down nVidia, but, it sure is putting some writing on the wall that I would expect nVidian political interests to fight strongly even if it doesn't immediately come with a shut down.

I'm imagining a Plan A that doesn't include something like that is more like a Plan A / B hybrid or some other "not the full Plan A." (based on some other internal Plan A docs I've looked at that went into more detail as of a few weeks ago).

I don't think there's any way you get to that point without most major world leaders actually believing-in-their-heart "if anyone builds it, something real bad is dangerously likely to happen." And by the point people are actually agreeing to have international inspection of some kind, I would expect people to more thinking "okay will this actually work?" than "what do we have buy-in for?".

(There is a version where the US enforces it at gunpoint or at least economicsanction-point without everyone else's buy in but I both don't expect them to do that and don't really expect it to work?)

MIRI discusses in the IABIED resources that they would prefer carveouts for narrow bio AI, so it's not like they're even advocating all progress to stop. (Advanced bio AI seems pretty good for the world and to capture a lot of the benefits).

...

I certainly do expect you-et-al to disagree with MIRI-et-al on a bunch of implementation details of the treaty.

But, it seems like a version of the treaty that doesn't at least have the capacity to shutdown compute temporarily is a kinda fake version of Plan A, and once you have that, "Shut down" vs "Controlled Takeoff" feels more like arguing details than fundamentals to me.

[-]ryan_greenblatt1mo40

I think compute + fab monitoring with potential for escalation requires much lower political will than shutting down AI development. I agree that both Plan A and shut it all down require something like this. Like I think this monitoring would plausibly not require much more political will than export controls.

Advanced bio AI seems pretty good for the world and to capture a lot of the benefits

Huh? No it doesn't capture much of the benefits. I would have guessed it captures a tiny fraction of the benefits for advanced AI, even for AIs around the level where you might want to pause at human level.

But, it seems like a version of the treaty that doesn't at least have the capacity to shutdown compute temporarily is a kinda fake version of Plan A, and once you have that, "Shut down" vs "Controlled Takeoff" feels more like arguing details than fundamentals to me.

I agree you will have the capacity to shut down compute temporarily either way; I disagree that there isn't much of a difference between slowing down takeoff and shutting down all further non-narrow AI development.

[-]Raemon1mo23

I think compute + fab monitoring with potential for escalation requires much lower political will than shutting down AI development. I agree that both Plan A and shut it all down require something like this. Like I think this monitoring would plausibly not require much more political will than export controls.

FYI this is cruxy. I don't have very strong political-viability-intuitions, but seems like this requires export controls that several (sometimes rivalrous) major nations are agreeing to simultaneously, with at least nontrivial trust for establishing the monitoring process together, which eventually is pretty invasive.

(maybe you are imagining the monitoring is actually mostly done with spy satellites that don't require much trust or cooperation?)

But like, the last draft of Plan A I saw include "we relocate all the compute to centralized locations in third party countries" as an eventual goal. That seems pretty crazy?

[-]ryan_greenblatt1mo32

But like, the last draft of Plan A I saw include "we relocate all the compute to centralized locations in third party countries" as an eventual goal. That seems pretty crazy?

Yes, this is much harder (from a political will perspective) than compute + fab monitoring which is part of my point? Like my view is that in terms of political will requirements:

compute + fab monitoring << Plan A < Shut it all down

[-]Raemon1mo20

Nod, I agree centralizing part is harder than non-centralized fab monitoring. But, I think a sufficient amount of "non-centralized" fab monitoring is still a much bigger ask than export controls, and, the centralization was part of at least one writeup of Plan A, and it seemed pretty weird to include that bit but write off "actual shutdown" as politically intractable.

[-]ryan_greenblatt1mo42

I'm not trying to say "Plan A is doable and shut it all down is intractable".

My view is that "shut it all down" probably requires substantially more (but not a huge amount more) political will than Plan A such that it is maybe like 3x less likely to happen given similar amounts of effort from the safety community.

You started by saying:

My main question is "why do you think Shut Down actually costs more political will?".

So I was trying to respond to this. I think 3x less likely to happen is actually a pretty big deal; this isn't some tiny difference, but neither is it "Plan A is doable and shut it all down is intractable". (And I also think "shut it all down" has various important downsides relative to Plan A, maybe these downsides can be overcome, but by default this makes Plan A look more attractive to me even aside from the political will considerations.)

I think something like Plan A or "shut it all down" are both very unlikely to happen and I'd be pretty sympathetic to describing both as politically intractable (e.g., I think something as good/strong as Plan A is only 5% likely). "politically intractable" isn't very precise though, so I think we have to talk more quantitatively.

Note that my view is also that I think pushing for Plan A isn't the most leveraged thing for most people to do at the margin; I expect to focus on making Plans C/D go better (with some weight on things like Plan B).

[-]Raemon1mo20

Nod.

FYI, I think Shut It Down is approximately as likely to happen as "Full-fledged Plan A that is sufficiently careful enough to actually help much more than [the first several stages of Plan A that Plan A and Shut It Down share]", on account of being simple enough that it's even really possible to coordinate on it.

I agree they are both pretty unlikely to happen. (Regardless, I think the thing to do is probably "reach for whatever wins seem achievable near term and try to build coordination capital for more wins")

I think it's a major possible failure mode of Plan A is "it turns it a giant regulatory capture molochian boondoggle that both slows thing down for a long time in confused bad ways and reads to the public as a somewhat weirdly cynical plot, which makes people turn against tech progress comparably or more than the average Shut It Down would." (I don't have a strong belief about the relative likelihoods of that

None of those beliefs are particularly strong and I could easily learn a lot that would change all my beliefs.

Seems fine to leave it here. I dont have more arguments I didn't already write up in "Shut It Down" is simpler than "Controlled Takeoff", just stating for the record I don't think you've put forth an argument that justifies the 3x increase in difficulty of Shut It Down over the fully fledged version of Plan A. (We might still be imagining different things re: Shut It Down)

[-]ryan_greenblatt1mo40

But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU monitoring but... still trying to push ahead as fast as we can?".

Wouldn't the narrative for Plan A be more like "we should be cautious and slow down if we aren't confident about safety, and we'll need to build the ability to slow down a lot"? While the narrative for "shut it all down" would have to involve something like "proceeding with any further development is too risky given the current situation".

[-]Raemon1mo20

I'm not 100% sure what Nate/Eliezer believe. I know they do think eventually we should build superintelligence, and that it'd be an existential catastrophe if we didn't.

I think they think (and, I agree) that we should be at least prepared for things that are more like 20-50 year pauses, if it turns out to take that long, but (at least speaking for myself), this isn't because it's intrinsically desireable to pause for 50 years. It's because you should remain shut-down until you're actually confidently know what you're doing, with no pressure to convince yourself/each-other than you're ready when you are not.

It might be that AI-accelerated alignment researchmeans you don't need a 20-50 year pause, but, that should be a decision the governing body makes based on how things are playing out, not baked into the initial assumption, so we don't need to take risks like "run tons of very smart AIs in parallel very fast" when we're only somewhat confident about their longterm alignment which opens us up to more gradual disempowerment / slowly-outmanuevered risk, or eventual death by evolution.

I haven't read the entirety of the IABIED website proposed treaty draft yet, but it includes this line, which includes flavor of "re-evaluate how things are going."

Three years after the entry into force of this Treaty, a Conference of the Parties shall be held in Geneva, Switzerland, to review the operation of this Treaty with a view to assuring that the purposes of the Preamble and the provisions of the Treaty are being realized. At intervals of three years thereafter, Parties to the Treaty will convene further conferences with the same objective of reviewing the operation of the Treaty.

[-]ryan_greenblatt1mo42

Sure, I agree that Nate/Eliezer think we should eventually build superintelligence and don't want to causal a pause that lasts forever. In the comment you're responding to, I'm just talking about difficulty in getting people to buy the narrative.

More generally, what Nate/Eliezer think is best is doesn't resolve concerns with the pause going poorly because something else happens in practice. This includes the pause going on too long or leading to a general anti-AI/anti-digital-minds/anti-progress view which is costly for the longer run future.) (This applies to the proposed Plan A as well, but I think poor implementation is less scary in various ways and the particular risk of ~anti-progress forever is less strong.)

[-]Raemon1mo32

Responding to some disagree reacts:

that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells.

Seems good to register disagreement, but, fyi I have no idea why you think that.

Re:

"these [geopolitical situation getting worse] don't even seem strictly worse in Shutdown World vs Controlled Takeoff world":

One way the geopolitical situation might get worse is "time passes, and, all kinds of stuff can change when time passes."

Another way it can get worse is "the current dynamics still involve a feeling of being rushed, and time pressure, and meanwhile the international agreements we have leave a lot more wiggle room and more confused spirit-of-the-law about how people are allowed to maneuever." This could cause the geopolitical situation to get worse faster than it would otherwise.

Which of those is worse? idk, I'm not a geopolitical expert. But, it's why it seems pretty obviously not 'strictly worse' (which is a high bar, with IMO a higher burden of proof) under Shut It Down.

(Also, note "shut it all down" is not like it's actually going to be permanent. Any international treaty/agreement at any time can be reversed by the involved nations deciding "guys, actually we have now voted to to leave this agreement", with some associated negotiations along the way)

[-]ryan_greenblatt1mo20

I'm going to default to bowing out, but if you want to bid for me to engage a bunch, you can.

[-]Raemon1mo52

I dunno, this seems really important and I am really confused why y'all are oriented this way.

Yes, I very much would like responses on these and my other comment, although no worries if you want to take a bit more time to address more thoroughly.

[-]Thomas Larsen1mo150

One framing that I think might be helpful for thinking about "Plan A" vs "shut it all down" is: "Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?" This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the "shut it all down" plan.

I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown).

this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period.
A key concern for the above plan is that govts/labs botch the measurement of "max controllable AI", and scale too far.
But it's not clear to me how a further delay helps with this, unless you have a plan for making the institutions better over time, or pursuing a less risky path (e.g. ignoring ML and doing human intelligence augmentation).
Going slower, on the other hand, definitely does help, but requires not shutting it all down.
More generally, it seems good to do something like "extend takeoff evenly by a factor of n", as opposed to something like "pause for n-1 years, and then do a 1 year takeoff".
I am sympathetic to shut all down and go for human augmentation: I do think this reduces AI takeover risk a lot, but this requires a very long pause, and it requires our institutions to bet big on a very unpopular technology. I think that convincing governments to "shut it all down" without an exit strategy at all seems quite difficult as well.

Ofc this framing also ignores some important considerations, e.g. choices about the capability progression effect both difficulty of enforcement/verification (in both directions: AI lie detectors/ai verification is helpful, while making AIs closer to the edge is a downside), as well as willingness to pay over time (e.g. scary demos or AI for epistemics might help increase WTP)

[-]Raemon1mo20

This framing feels reasonable-ish, with some caveats.^[1]

I am assuming we're starting the question at the first stage where either "shut it down" or "have a strong degree of control over global takeoff" becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don't give you the ability to steer capability-growth that hard)

But, once it becomes a serious question "how quickly should we progress through capabilities", then one thing to flag is, it's not like you know "we get 5 years, therefore, we want to proceed through those years at X rate." It's "we seem to have this amount of buy-in currently..." and the amount of buy-in could change (positively or negatively).

Some random thoughts on things that seem important:

I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it)
One of the more important things to do as soon as it's viable, is to stop production of more compute in an uncontrolled fashion. (I'm guessing this plays out with some kind of pork deals for nVidia and other leaders^[2], where the early steps are 'consolidate compute', and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang.
Before I did a rapid-growth of capabilities, I would want a globally set target of "we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run." (

If it's not viable to do that, well, then we don't. (but, then we're not really having a real convo about how slow the takeoff should ideally be, just riding the same incentive wave we're currently riding with slightly more steering). ((We can instead have a convo about how to best steer given various murky conditions, which seems like a real important convo, I'm just responding here to this comment's framing))^[3]

If we reach a point where humanity has demonstrated the capability of "stop training on purpose, stop uncontrolled compute production, and noticeably improve our ability to predict the next training run", then I'm not obviously opposed to doing relatively rapid advancement, but, it's not obviously better to do "rapid to the edge" than "do one round where there are predictions/incentives/prizes somehow for people to accurately predict how the next training rounds go, then evaluate that, then do it again."

^{^}
I think there's at least some confusion where people are imagining the simplest/dumbest version of Shut It Down, and imagining "Plan A" is nuanced and complicated. I think the actual draft treaty has levers that are approximately the same levers you'd want to do this sort of controlled takeoff.
^{^}
I'm not sure how powerful nVidia is an an interest group. Maybe it is important to avoid them getting a deal like this so they're less of an interest group with power at the negotiating table.
^{^}
FYI my "Ray detects some political bs motivations in himself" alarm is tripping as I write this paragraph. It currently seems right to me but let me know if I'm missing something here.

[-]Thomas Larsen1mo119

One upside of shut it all down is that it does in fact buy more time: in Plan A it is difficult to secure algorithmic secrets without extremely aggressive security measures, hence any rogue projects (e.g. nation state blacksites) can just coast off the algos developed by the verified projects. Then, a few years in, they fire up their cluster and try to do an intelligence explosion with the extra algorithmic progress.

[-]ryan_greenblatt1mo40

Maybe I should clarify my view a bit on Plan A vs "shut it all down":

Both seem really hard to pull off from a political will and actually making it happen.
- Plan A is complicated and looking at I go "oh jeez, this seems really hard to make happen well, idk if the US government has the institutional capacity to pull off something like this". But, I also think pulling off "shut it all down" seems pretty rough and pulling off a shutdown that lasts for a really long time seems hard.
- Generally, it seems like Plan A might be hard to pull off and it's easy for me to imagine it going wrong. This also mostly applies to "shut it all down" though.
- I still think Plan A is better than other options and Plan A going poorly can still nicely degrade into Plan B (or C).
It generally looks to me like Plan A is strictly better on most axes, though maybe "shut it all down" would be better if we were very confident in maintaining extremely high political will (e.g. at or above peak WW2 level political will) for a very long time.
- There would still be a question of whether the reduction in risk is worth delaying the benefits of AI. I tenatively think yes even from a normal moral perspective, but this isn't as clear of a call as doing something like Plan A.
- I worry about changes in the balance of power between the US and China with this long of a pause (and general geopolitical drift making the situation worse and destabilizing this plan). But, maybe if you actually had the political will you could make a deal which handles this somehow? Not sure exactly how, minimally the US could try to get its shit together more in terms of economic growth.
- And I'd be a bit worried about indefinite pause but not that worried.

This is my view after more seriously getting into some of the details of the Plan A related to compute verification and avoiding blacksite projects as well as trying to do a more precise comparison with "shut it all down".

[-]Daniel Kokotajlo1mo725

Plan C: 20%
Plan D: 45%
Plan E: 75%

I feel like these numbers are too low.

[-]Vladimir_Nesov1mo40

You can't really have a technical "Plan E" because there is approximately no one to implement the plan

AGIs themselves will be implementing some sort of plan (perhaps at very vague and disorganized prompting from humans, or without any prompting at all; which might be influenced by blog posts and such, in publicly available Internet text). This could be relevant for mitigating ASI misalignment if these AGIs are sufficiently aligned to the future of humanity, more so than some of the hypothetical future ASIs (created without following such a plan).

[-]ryan_greenblatt1mo40

Sure, I agree with this, but it's harder for us to usefully help these AIs.

[-]Vladimir_Nesov1mo10

The "ten people on the inside" direct AIs to useful projects within their resource allocation. The AGIs themselves direct their own projects according to their propensities, which might be influenced by publicly available Internet text, possibly to a greater extent if it's old enough to be part of pretraining datasets.

The amount of resources that AGIs direct on their own initiative might dwarf the amount of resources of the "ten people on the inside", so the impact of openly published technical plans (that make sense on their own merits) might be significant. While AGIs could come up with any ideas independently on their own, path dependence of the acute risk period might still make their initial propensities to pay attention to particular plans matter.

[-]Vladimir_Nesov1mo44

What happens with gradual disempowerment in this picture? Even Plan A seems compatible with handing off increasing levels of influence to AIs. One benefit of "shut it all down" (AGI Pause) is ruling out this problem by not having AGIs around (at least while the Pause lasts, which is also when the exit strategy needs to be prepared, not merely technical alignment).

Gradual disempowerment risks transitioning into permanent disempowerment (if not extinction), where a successful solution to technical ASI-grade alignment by the AIs might result in the future of humanity surviving, but only getting a tiny sliver of resources compared to the AIs, with no way of ever changing that even on cosmic timescales. Permanent disempowerment doesn't even need to involve a takeover.

Also, in the absence of "shut it all down", at some point targeting misalignment risks might be less impactful on the margin than targeting improvements in education (about AI risks and cruxes of mitigation strategies), coordination technologies, and AI Control. These enable directing more resources to misalignment risk mitigation as appropriate, including getting back to "shut it all down", a more robust ASI Pause, or making creation of increasingly capable AGIs non-lethal if misaligned (not a "first critical try").

[-]ryan_greenblatt1mo30

One alternative way of thinking about this is to decompose plans by which actor the plan is for:

Plan A: Most countries, at least the US and China
Plan B: The US government (and domestic industry)
Plan C: The leading AI company (or maybe a few of the leading AI companies)
Plan D: A small team with a bit of buy in within the leading AI company

This isn't a perfect breakdown, e.g. Plan A might focus mostly on what the US should do, but it might still be helpful.

This decomposition was proposed by @Lukas Finnveden.

[-]Raemon1mo30

Plan A: 10 years
Plan B: 1-3 years
Plan C: 1-9 months (probably on the lower end of this)
Plan D: ~0 months, but ten people on the inside doing helpful things

I think you mean "starting from fully automated AI R&D" but not 100% sure.

[-]ryan_greenblatt1mo40

I just mean "amount of additional lead time to spend on safety". This could be spent at different points.

[-]habryka1mo32

Thus, the numbers I give below are somewhat more optimistic than what you'd get just given the level of political will corresponding to each of these scenarios (as this will might be spent incompetently).

FWIW, for at least plan A and plan B, I feel like the realistic multiplier on how optimistic these are is like at least 3x? Like, I don't see an argument for this kind of plan working with 90%+ probability given realistic assumptions about execution quality.

(I also have disagreements about whether this will work, but at least plan A well-executed seems like it would notice it was starting to be very reckless and then be in a good position to slow down more)

[-]ryan_greenblatt1mo20

Yeah fair, I don't think I've thought about this very carefully. I currently feel like 3x is too high, but I don't feel very reflectively stable.

[-]cousin_it1mo20

Do you know any people working at frontier labs who would be willing to do the kind of thing you describe in plan D, some kind of covert alignment against the wishes of the larger company? Who would physically press keys on their terminal to do it, as opposed to quitting or trying to sway the company? Not asking to name names, just my hunch is that there are very few such people now, maybe none at all. And if that's the case, we're in E world already.

[-]ryan_greenblatt1mo63

I don't think Plan D particularly involves covert alignment and going against the will of the larger company, though going against the will of the company might come up in practice.

I think there are people working in frontier labs who would be willing to try to make some version of Plan D happen.

[-]cousin_it1mo13

Can you maybe describe in more detail how you imagine it? What specifically do the "ten people on the inside" do, if company leadership disagrees with them about safety?

[-]Daniel Kokotajlo1mo88

I don't think the idea is that the 10 people on the inside violate the wishes of company leadership. Rather, the idea is that they use whatever tiny amount of resources and political capital they do have as best as possible. E.g. leadership might be like "Fine, before we erase the logs of AI activity we can have your monitor system look over them and flag anything suspicious -- but you have to build the monitor by next week because we aren't delaying, and also, it can't cost more than 0.01% of overall compute."

[-]cousin_it1mo417

The OP says takeover risk is 45% under plan D and 75% under plan E. We're supposed to gain an extra 30% of safety from this feeble "build something by next week with 1% of compute"? Not happening.

My point is that if the "ten people on the inside" obey their managers, plan D will have a tiny effect at best. And if we instead postulate that they won't obey their managers, then there are no such "ten people on the inside" in the first place. So we should already behave as if we're in world E.

[-]ryan_greenblatt1mo20

A general point is that going from "no human cares at all" to "a small group of people with limited resources cares" might be a big difference, especially given the potential leverage of using a bunch of AI labor and importing cheap measures developed elsewhere.

[-]cousin_it1mo10

Yeah, that partly makes sense to me. I guess my intuition is like, if 95% of the company is focused on racing as hard as possible (and using AI leverage for that too, AI coming up with new unsafe tricks and all that), then the 5% who care about safety probably won't have that much impact.

[-]Daniel Kokotajlo1mo20

I disagree with the probabilities given by the OP. Also, the thing I mentioned was just one example, and probably not the best example; the idea is that the 10 people on the inside would be implementing a whole bunch of things like this.

[-]Raemon1mo*10

(Having otherwise complained a bunch about some of the commentary/framing around Plan A vs Shut It Down, I do overall like this post and think having the lens of the different worlds is pretty good for planning).

(I am also appreciating how people are using inline reacts)

Note that risks other than AI takeover are also generally reduced by having more actors take powerful AI seriously and having more coordination. ↩︎
The risk conditional on starting in a Plan D scenario is lower than conditional on remaining in a Plan D scenario and the risk conditional on starting in a Plan A scenario is higher than if we condition on remaining. ↩︎
This sentence was added in an edit because I realized I forgot to include this sort of caveat. ↩︎
Multiplying the probabilities given above by the takeover risk numbers given here doesn't exactly yield my overall probability of takeover because of the optimistic assumption of reasonable execution/competence (making actual risk higher) and also because these risk numbers are for central versions of each scenario while the probabilities are for ranges of plans that include somewhat higher levels of will (making actual risk lower). (Specifically: "will (and competence) to do something which isn't much worse than the given plan while still being worse than the next better plan". So the probabilities for Plan C really include <Plan B while >= Plan C.) ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

56

Plans A, B, C, and D for misalignment risk

56

Plan A

Plan B

Plan C

Plan D

Plan E

Thoughts on these plans