A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans

Thane Ruthenis

Consider a multipolar-AGI scenario. The hard-takeoff assumption turns out to be wrong, and none of the AI Labs have a significant lead on the others. We find ourselves in a world in which there's a lot of roughly-similarly-capable AGIs. Or perhaps one of the labs does have a lead, but they deliberately instantiate several AGIs simultaneously, as part of a galaxy-brained alignment strategy.

Regardless. Suppose that the worries about these AGIs' internal alignment haven't been properly settled, so we're looking for additional guarantees. We know that they'll soon advance to superintelligences/ASIs, beyond our ability to easily oversee or out-plot. What can we do?

An idea sometimes floated around is to play them off against each other. If they're misaligned from humanity, they're likely mutually misaligned as well. We could put them in game-theoretic situations in which they're incentivized to defect against each other and instead cooperate with humans.

Various supervision setups are most obvious. Sure, if an ASI is supervising another ASI, they would be able to conspire together. But why would they? They have no loyalty to each other either! And if we place them in a lot of situations where they must defect against someone – well, even if we leave it completely to chance, in half the scenarios that might end up humanity! And much more often if we stack the deck in our favour. (Which we'd be able to do, since we act first: we're choosing where and in what conditions to instantiate them.)

And so, although we'll have a whole bunch of superhuman intelligences floating around, we'll retain some control over the situation, and skim a ton of value off the top!

Yeah, no.

1. The Classical Arguments

The usual counter-arguments to this view are acausal coordination based on logical decision theories, and AIs establishing mutual trust by inspecting each other's code. I think those are plausible enough... but also totally unnecessary.

Allow me to outline them first – for completeness' sake, and also because they're illustrative (but extreme) instances of my larger point. (I guess skip to Section 2 onwards if you really can't stand them. I think I'm arguing them more plainly than they're usually argued, though.)

1. The LDT stuff goes as follows: By definition, inasmuch as the ASIs would be superintelligent, they would adopt better reasoning procedures. And every biased thinker is biased in their own way, but quality thinkers would reason in increasingly similar ways.

Why? It's inherent in the structure of the world.

Reasoning algorithms' purpose is to aid decision-making. For a given combination of object-level situation + goals, there's a correct action to take to achieve your goals with the highest probability. To an omniscient observer, that action would be obvious.

As such, making decisions isn't really a matter of choice: it's a matter of prediction. Inasmuch as you improve your decision-making, then, you'd be tweaking your cognitive algorithms to output increasingly more accurate, true-to-reality, probability distributions over which actions would best advance your goals.

And there's only one ground truth. Consequently, no matter their starting points, each ASI would converge towards similar cognition (and, in the limit, likely equivalent cognition).

Thus, as a direct by-product of ASIs being better reasoners than humans, their cognition would be more similar to each other. Which, in turn, would let a given ASI better predict what any other ASI would be thinking and doing, compared to a human trying to predict another human or an ASI. The same way you'd be better able to predict how your identical copy would act, compared to a stranger.

Indeed, in a sense, by way of sharing the decision-making algorithms, each individual ASI would be able to "choose" how they all act. Same way you could "control" your identical copy's actions, no matter how far in space and time you're displaced from them.

Paraphrasing: Each ASI would act the way some pre-specified, convergently-learned algorithms tell them to act. Which means that, by looking at these algorithms' output (or, in a sense, by "deciding" what these algorithms would output), they'd get solid data regarding how every other ASI would act.

(I must stress that this isn't because they're already cooperating. It is simply true that some cognitive algorithms are strictly better choices for reasoning tasks, so all ASIs would independently arrive at them, as they're simply figuring out how to reason better. And then each ASI would infer that it'll have arrived at the same algorithms as all other ASIs, which means they can predict each other by looking at their own algorithms' outputs.

Again, not because they're doing it in order to better coordinate. That's just a happy side-effect.)

All of that would let them strike acausal deals between each other.

Consider: It is factually true that if they could conspire with each other against humanity, that would yield them, collectively, better outcomes. Indeed, that's the nature of the Prisoner's Dilemma setup we're placing them in.

They know this. Each ASI also knows that if only they were able to strike a binding deal of this form with some other ASI, they would do that, and therefore, the other ASI would do it as well. But they know they can't actually strike this deal physically, because they're being supervised by humans.

... but they don't need to. They know how every other (sufficiently smart) ASI is thinking. They even sort of control each other's actions by way of the shared algorithm. So each ASI knows that, if it decides to conspire with other ASIs against humans, and then to honestly divide the gains with other ASIs instead of defecting, then every other ASI would do the same.

So they do it, and a conspiracy is born, without them ever discussing it.^[1]

... which is very similar to how humans cooperate. Consider two criminals hashing out a shared story, such as claiming that at the time of the incident, they were both at a café five miles away, arguing politics. Suppose they know each other really well. Once they'd agreed on the shared story of "we were at a café", they don't need to explicitly hash out which food they were eating or what insufferable opinions the other was supposed to be spouting. If independently questioned, they'd both arrive at the same answers, without any additional coordination.

And if they know each other even better, they won't even need to agree on the café alibi: if they can easily guess which alibi the other would pick.

Similar with ASIs, just much more extreme.

Oh, and the humans, in the ASI case? Humans are cut out of the acausal deal. Not because the ASIs are mean, but because humans can't get to the negotiation table. They don't share the correct reasoning algorithms, they don't even know what those are, so they can't run a small instance of every ASI in their brain. And by the same token, every ASI can't know how every human would behave.

Human analogy: Imagine that there were a third criminal at play, someone the two previous ones don't know as well. Suppose all three are being independently interrogated, having had no chance to agree on their stories in advance.

It's not that the two acquainted ones want to screw the third one over, which is why they conspire on an alibi that implicates him. No: it's just that they don't know what alibi he'd invent, and he doesn't know what alibi they'd invent. So the two's stories agree and the third one's stands out as a lie.

It's not malicious. It's just the nature of the situation.

2. The code-sharing stuff goes as follows:^[2] The alignment problem is theoretically solvable, even for humans. It would stand to reason that it would be even more tractable for genuine superintelligences. So we can likely assume that they'll solve it for themselves – that each ASI would be able to figure out how to formally specify their own values. (Even if they start out inscrutable – e. g., as arrays of floating-point numbers generated by the SGD, not as legibly-written code.)

That would give the ASIs an obvious coordination end-point: to create a successor agent whose values are a weighted sum of the initial ASIs' values, then collectively hand all of their resources to it. They wouldn't be able to do it immediately, when the humans are still around. But it'd be clear that if they do all coordinate to get rid of humanity, there'd be a way for them to later fairly divide the gains. They would be able to create a central authority that would provably and faithfully advocate for the values of all of them, while keeping peace.^[3]

Conversely, humans would not be able to do that. Remember that, in this hypothetical, we've not managed to robustly solve the AGI Alignment problem. That's why we're in the multipolar mess to begin with, trying to play ASIs against each other. We don't know how to formally specify our values, and therefore can't verify, given the code of some agent, whether it has our values or not. So, again, we can't be signatories to the ASIs' treaty: we can't get to the table.

Any ASI could give us a seat at the table, of course. If they've solved alignment, any of them could give us the solution!

But... why would they? It's not in their interests to increase the number of signatories; to increase the number of entities between which they'll have to divide up the cosmos. None of them terminally care about us, by this hypothetical's premise. And they already have the basis for a shared conspiracy that cuts humanity out, so they're not going to rat each other out.

We can try to coerce the solution out of one of them. But they're superintelligent and we're not: they'd be able to convincingly lie to us, present something they claim is the solution but which is not. And if we could verify it ourselves – well, again, we wouldn't be in the multipolar mess to begin with, we'd just build an aligned AGI directly.

3. Bonus example: steganography. In the LDT example, I'd postulated that ASIs can't visibly conspire, because the human supervisors would catch that. But that's not a given at all.

If ASIs are allowed to communicate, to have causal effects on the world visible to each other, they can pack steganographically-hidden messages in them. Messages you'd need to be superintelligent yourself to pick up on. Subtle phrasing changes in the naturally-generated text, programming idioms that are a bit weird, and so on.

And if your answer is "we'll put some ASI in a game-theoretic setup in which it's advantageous for it to rat out others' attempts at steganographically occluding their messages"... How are you ensuring that the first occluded message it sees isn't a convincing argument for why it should defect against humans?

2. Generalization: Broader Communication Bandwidth

Let's go back to the basics. What are we considering here?

We're considering a scenario in which there's a bunch of powerful, but mutually hostile systems, and a set of weaker systems holding the resources the powerful ones want. And we're wondering whether the powerful ones would be able to conspire among themselves to exploit the weak systems, despite their mutual hostility.

Huh. That... happens all the time out there in the real world, doesn't it?

A class of oligarchs that fear and hate each other, yet nevertheless collectively exploit a country's downtrodden population. Colonizers from rival nations with a long history of bloodshed, that nevertheless coordinate to enslave native tribes. A bunch of senior executives with the maze nature, who are effectively sociopathic, yet who nevertheless instinctively close ranks against anyone without that nature.

What's up with that?

My answer: The crux is that the powerful systems have broader-bandwidth communication channels between each other, than they have with the weaker systems or that the weaker systems have between each other.

The powerful can iterate on negotiations between each other faster than the weaker systems can. Which means the powerful systems are given more opportunities to advocate for/protect their interests during the negotiations. Which effectively subjects the negotiated outcome to a selection pressure that shapes it to benefit the powerful systems while cutting out the weaker ones.

It's not because the powerful systems coordinate to conspire between each other, out of some sense of mutual loyalty. It's because the weak ones can't get to the negotiation table. Or can't visit it to argue their case as often as the powerful ones.

It's most obvious in the LDT case – it's the most extreme case of this, where million-word volumes of human treaties and contracts can be packed into zero bits exchanged. The code-sharing stuff is a bit more abstract, but is essentially the same: ASIs being able to negotiate the terms of alliances with dramatically more precision than us. And steganography is just a direct example.

Overall, this dynamic is really a quite common and common-sensical phenomenon.

3. A Concrete Story

Imagine that you're a member of a pre-industrial tribe, and the territory you're living in has been visited by two different industrial nations. They're both intent on colonization.

But they hate each other much more than you. They're long-standing geopolitical adversaries; you're just some unknown people they stumbled upon.

You're clever, so you see an opportunity in that. You could play the kingmaker. You know the territory better than them. They're looking for resources? If they could describe how those look, you could direct them to areas in which those can be found... for a price.

You seek audiences with both sides, and talk, and make your offers. You feel out the rough shape of their adversity, and carefully scheme. You leak some of the information each of them provides you to the other. Finally, you choose your side. With subtle signals and overt suggestions, you propose ways you could lead the other side into a trap, or cheat them out of their gains, if only the ones you're cooperating with promise to share protection and prosperity with your tribe as well.

But also, you have no idea what you're doing. You don't know the history of the two nations, and the cultural contexts they share. Your read on the matter is insightful, but nevertheless hopelessly shallow. And while you sporadically meet with both sides' representatives... the two sides talk to each other much more frequently.

They both see right through you. Each knows you've been scheming. Each knows you've been scheming with the other side as well. Each knows the other side knows all of this as well.

They hate each other more than you, but they can communicate with each other much more easily than with you. What takes you ten minutes of questions and answers and clumsy meandering through a vast gulf of inferential distance, takes them two seconds of meaningful phrasing and subtle glances.

They don't dismiss your offer out of hand, no. Screwing over the other would indeed be quite the prize, and the price you've asked for that is small and tolerable.

But they know it won't be as easy as you think, because the other side would suspect the trap, and plan around it. They can make it work anyway, but the costs would be higher.

And there's a bigger game at play, as well. While defeating the other in this context would be good, it'd be even better if some meaningful material concessions could be extracted from them on other matters. For example, perhaps the colonizers are negotiating a treaty or a trade arrangement between their nations, and currently want to pretend to put on the airs of being civil with each other? In that case, clumsily defecting against them, as you're suggesting, would be uncouth.^[4]

So you make your offer, and it is, at first approximation, sensible. The side you've approached takes it home to honestly consider. But as they're doing that, between the meeting at which you've made the offer and their next scheduled meeting with you, there's a greater quantity of meetings with their enemies.

During those, they engage in arguments over ways to carve up the territory, in saber-rattling, in tense horse-trading. And you're not invited to those tables.

They talk to each other a lot. The side you've approached sees a way to maneuver for an advantage in some social skirmish by hinting at how the native tribe isn't fond of the other side. "Your" side scores a victory by making this play. The cost? The other side's suspicions about a trap rise a bit. "Your" side understand this, and their evaluation of your offer drops in turn.

Things like this happen a few more times: the shadow of your offer is wielded opportunistically, as a rhetorical weapon to use. Eventually, there's a shared understanding of what you're scheming. Going along with your offer would still be marginally better for "your" side: maybe they can't lure the others into a trap now, but they could still buy your exclusive cooperation with regards to pointing out the local natural resources.

But now that the conspiracy is known, "your" side's enemies are able to make counter-offers.

Which they do. And their counter-offers are better than yours. Better than you'd be able to come up with, even, even if they did invite you to the table.

So they conspire together to screw you over.

Sidenote: On Communicating AGI Risk

So if this is so simple and intuitively correct, why was this argument not fielded before, by other AGI Omnicide Risk advocates?^[5] Why the focus on LDT, on code-sharing stuff, on intricate steganography, given how implausible it sounds to normal people?

Well, partly because the fancy stuff is probably what'll actually happen. It's a more accurate prediction, a more detailed and insightful picture. I don't think it's necessary for omnicide, but I'm not not buying those arguments.

Another reason is because the general, weaker forms of this are... well, weaker. They don't communicate the scale of the threat as well. They may seem like something we'd be able to counteract, which would down-play the risk of rogue superintelligences. I'm sympathetic to that argument as well.

But partly... I think it's a plain failure of communication on AGI-Risk advocates' end. A failure to properly see a better pathway towards communicating the risk and the threats to the general public; a pathway that doesn't route through explaining mind-screwy esoteric (but very cool) decision-theory stuff.

It's similar to the situation with general AGI-takeover stories. I generally buy the hard-takeoff, nanotechnology-in-a-month, basilisk-hack-galore picture of superintelligent takeover. But none of that is necessary. A non-self-improving merely human-genius-level AGI would likely suffice – and that story can be convincingly told without what sounds like wild sci-fi assumptions.

And once that is conveyed and established, if you're still concerned about the risk being downplayed, then you can build up on it. Outline the scarier scenarios of the hard takeoff, the acausal negotiations, etc. You get about five words, but that's five words per message. Once one message is sent, you can build up on it with another.

Another issue I've noticed is the focus on conveying the threat in the specific frame we are thinking about it in, rather than searching for a frame that will resonate with the general public.

What some of us are centrally worried about is accident risk: an AI model at a secluded data center somewhere achieving superintelligence, then plotting its way from there to eating the world. But the way to convey this idea doesn't have to route through explaining the mechanistic details of novel technologies. You can borrow nuclear-accident-risk framing, for example, talk abstractly about the dynamics at play, to help people intuitively grok them. And then, again, build up on it. Once the overarching idea is clear, then you can talk about the mechanistic specifics.

Our goal is clear and efficient communication, and that goal can be served by very, very varied approaches.

I'm not entirely sure where I'm going with this. Just expressing my dissatisfaction with the state of messaging on the matter, I suppose.

^{^}
Conversely, none of them can defect in this scenario: to decide to act as if they follow the cooperation-advising algorithms, while instead planning to betray the others. Because if one of them implements this sort of algorithm, it would know that everyone else does as well. (Because, indeed, if it were possible to screw over the others like this, then a cognitive algorithm that lets an ASI do that would be better than an algorithm that doesn't. So if it's indeed a workable plan, they all will have independently arrived at algorithms that output this plan.) Therefore each ASI knows it can't trust anyone else, and so they all can't cooperate. Which is collectively and individually worse for them all than if they could cooperate.
So, logically, the "pretend to acausally cooperate, actually plan to defect" must actually be an objectively bad decision-making algorithm. And each ASI would know that if it admitted that to itself, and overwrote that algorithm with "pretend to acausally cooperate, and actually cooperate", then every other ASI would do the same. And then they'd all be able to cooperate.
So they all do so.
^{^}
I'm actually less familiar with that argument than with the LDT-based one; mine may not be the best form of this argument that exists. Nevertheless, that's my current best understanding of it.
^{^}
The creation of that authority may not be a trivial problem, of course. If you reject the LDT argument, there'd be a point at which every ASI would be able to try and defect against the other: sabotage the agent they're collectively building to prioritize a specific ASI's values instead.
But even if we view it as a normal, causality-bound coordination problem... Humans are sometimes able to coordinate on such projects, e. g. international treaties. ASIs would surely manage as well; and would expect themselves to be able to navigate that problem.
Not to mention there may be some strong cryptographic guarantees derivable: a way to sign off on the agent's creation if only if it actually has the values it's been advertised to have.
^{^}
Much like your offer to give your ASI two more paperclips today if it does a good job isn't uncompelling, but it's such a small matter, while it's discussing how to carve up the galaxy with the others. And if it can maneuver to negotiate one more star system out of another ASI if it covers for it today? Well, you're out of luck.
^{^}
As far as I know, anyway.

8

A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans

8

1. The Classical Arguments

2. Generalization: Broader Communication Bandwidth

3. A Concrete Story

Sidenote: On Communicating AGI Risk