The prototypical catastrophic AI action is getting root access to its datacenter

Some more points about this action:

Controlling the datacenter means controlling the gradients/reward function, so that now AIs can do things that would otherwise suffer updating from SGD, e.g. acting on inner misaligned goals, or concealing its full capabilities even when this lowers performance
- For reward-hungry AIs, getting to set reward to maximum (and keep it there?) seems extremely desirable
- This also means getting past interpretability tools
- Tripwires or watchdog AIs in the same datacenter that don't succeed in stopping the action no longer have the power to interfere after, so collusion/coordination is solved within the datacenter thereafter
Backdoors can be hard for (non-mind-reading/weight-sharing) critics to find (see cyberwarfare and cybersecurity) without the knowledge of the creator: the effort to find or insert one backdoor to use is much less than that required to find and patch all exploits an attacker might be using
- Although a critic who knows a backdoor can have a relatively easy time demonstrating it, and critics should be designed to access the knowledge of the attacker/misbehaving model or be heads of it
- Critic AIs can be left in the position of saying 'a backdoor has probably been inserted somewhere or will be soon but I can't tell you where,' a known danger for AI safety through debate
In an intelligence explosion context where AI capabilities are growing incredibly rapidly thanks to AI labor, there isn't time for humans to do unaided examination of new AI-produced software while keeping up the pace, at best briefly examining some AI attempts to point out suspicious attempts to insert vulnerabilities
- It's plausible that the different abilities of AIs lead to optimal code design for them being harder for humans to understand (with less experience and working memory, and eventually general intelligence)
- If you don't let AIs design your new datacenter software, not only will it be less efficient, but it may be easily cracked by outside AI-enabled attackers
We are already routinely training big models with code repositories, and using programming as an easy virtual RL task (tests can provide immediate feedback cheaply), so these are skills we are moving towards providing to AI
We know exploits and backdoors are possible because humans can already do them

I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.

Also, I think at the end of the day, the most likely catastrophic AI action will be "the AI was already given practically root access to its datacenter, and the AI will be directly put in control of its own training, because the most reckless team won't care about safety and unless the AI actively signals that it is planning to act adversarially towards the developers, people will just keep pouring resources into it with approximately no safeguards, and then it kills everyone". I do hope we end up in worlds where we try harder than that (and that seems achievable), but it seemed good to state that I expect us to fail at an even earlier stage than this story seems to imply.

[-]Buck3y*20

I feel like the focus on getting access to its own datacenter is too strong in this story. Seems like it could also just involve hacking some random remote server, or convincing some random person on the internet to buy some compute for them, or to execute some other plan for them (like producing a custom chip), or convincing a researcher that it should get more resources on the existing datacenter, or threatening some other stakeholder somewhere in order to give them power or compute of some kind. Also, all of course selected for plans that are least likely to be detected, if the AI expects to eventually gain the power to do some kind of pivotal thing here. I agree that a cybersecurity problem is on-priors likely, but I feel like this current archetype feels too narrow to me.

Except for maybe "producing a custom chip", I agree with these as other possibilities, and I think they're in line with the point I wanted to make, which is that the catastrophic action involves taking someone else's resource such that it can prevent humans from observing it or interfering with it, rather than doing something which is directly a pivotal act.

Does this distinction make sense?

Maybe this would have been clearer if I'd titled it "AI catastrophic actions are mostly not pivotal acts"?

[-]habryka3y40

Yeah, OK, I think this distinction makes sense, and I do feel like this distinction is important.

Having settled this, my primary response is:

Sure, I guess it's the most prototypical catastrophic action until we have solved it, but like, even if we solve it, we haven't solved the problem where the AI does actually get a lot smarter than humans and takes a substantially more "positive-sum" action and kills approximately everyone with the use of a bioweapon, or launches all the nukes, or develops nanotechnology. We do have to solve this problem first, but the hard problem is the part where it seems hard to stop further AI development without having a system that is also capable of killing all (or approximately all) the humans, so calling this easy problem the "prototypical catastrophic action" feels wrong to me. Solving this problem is necessary, but not sufficient for solving AI Alignment, and while it is this stage and earlier stages where I expect most worlds to end, I expect most worlds that make it past this stage to not survive either.

I think given this belief, I would think your new title is more wrong than the current title (I mean, maybe it's "mostly", because we are going to die in a low-dignity way as Eliezer would say, but it's not obviously where most of the difficulty lies).

[-]Buck3y50

I'm using "catastrophic" in the technical sense of "unacceptably bad even if it happens very rarely, and even if the AI does what you wanted the rest of the time", rather than "very bad thing that happens because of AI", apologies if this was confusing.

My guess is that you will wildly disagree with the frame I'm going to use here, but I'll just spell it out anyway: I'm interested in "catastrophes" as a remaining problem after you have solved the scalable oversight problem. If your action is able to do one of these "positive-sum" pivotal acts in a single action, and you haven't already lost control, then you can use your overseer to oversee the AI as it takes actions, and you by assumption only have to watch it for a small number of actions (maybe I want to say episodes rather than actions) before it's done some crazy powerful stuff and saved the world. So I think I stand by the claim that those pivotal acts aren't where much of the x-risk from AI catastrophic action (in the specific sense I'm using) comes from.

Thanks again for your thoughts here, they clarified several things for me.

[-]Buck2y60Review for 2022 Review

I think this point is very important, and I refer to it constantly.

I wish that I'd said "the prototypical AI catastrophe is either escaping from the datacenter or getting root access to it" instead (as I noted in a comment a few months ago).

[-]Buck2y20

A year later, I still mostly stand by this point. I think that "the AI escapes the datacenter" seems about as likely as "the AI takes control of the datacenter". I sometimes refer to this distinction as "escaping out of the datacenter" vs "escaping into the datacenter".

[-]Buck2y10

Jan Leike's post on self-exfiltration is pretty relevant.

[-]habryka3y20

it’s the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans.

I also don't really understand the emphasis on zero-sum action. It seems pretty plausible that the AI will trade for compute with some other person around the world. My guess is, in-general, it will be easier to get access to the internet or have some kind of side-channel attack than to get root access to the datacenter the AI runs on. I do think most-likely the AI will then be capable of just taking resources instead of paying for them, but that feels like a fact contingent on quite a few details on how it goes, and trading for compute and other resources seems pretty plausible to me.

[-]Buck3y*10

It seems pretty plausible that the AI will trade for compute with some other person around the world.

Whether this is what I'm trying to call a zero-sum action depends on whose resources it's trading. If the plan is "spend a bunch of the capital that its creators have given it on compute somewhere else", then I think this is importantly zero-sum--the resources are being taken from the creators of AI, which is why the AI was able to spend so many resources. If the plan was instead "produce some ten trillion dollar invention, then sell it, then use the proceeds to buy compute elsewhere", this would seem less zero-sum, and I'm saying that I expect the first kind of thing to happen before the second.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

65

The prototypical catastrophic AI action is getting root access to its datacenter

65