Eight claims about multi-agent AGI safety

Richard_Ngo

There are quite a few arguments about how interactions between multiple AGIs affect risks from AGI development. I’ve identified at least eight distinct but closely-related claims which it seems worthwhile to disambiguate. I’ve split them up into four claims about the process of training AGIs, and four claims about the process of deploying AGIs; after listing them, I go on to explain each in more detail. Note that while I believe that all of these ideas are interesting enough to warrant further investigation, I don’t currently believe that all of them are true as stated. In particular, I think that so far there's been little compelling explanation of why interactions between many aligned AIs might have castastrophic effects on the world (as is discussed in point 7).

Claims about training

1. Multi-agent training is one of the most likely ways we might build AGI.

2. Multi-agent training is one of the most dangerous ways we might build AGI.

3. Multi-agent training is a regime in which standard safety techniques won’t work.

4. Multi-agent training allows us to implement important new safety techniques.

Claims about deployment

5. We should expect the first AGIs to be deployed in a world which already contains many nearly-as-good AIs.

6. We should expect AGIs to be deployed as multi-agent collectives.

7. Lack of coordination between multiple deployed AGIs is a major source of existential risk.

8. Conflict between multiple deployed AGIs risks causing large-scale suffering.

Details and arguments

1. Multi-agent training is one of the most likely ways we might build AGI.

The core argument for this thesis is that multi-agent interaction was a key feature of the evolution of human intelligence, by promoting both competition and cooperation. Competition between humans provides a series of challenges which are always at roughly the right level of difficulty; Liebo et al. (2019) call this an autocurriculum. Autocurricula were crucial for training sophisticated reinforcement learning agents like AlphaGo and OpenAI Five; it seems plausible that they will also play an important role in training AGIs. Meanwhile, the usefulness of cooperation led to the development of language, which plays a core role in human cognition; and the benefits of cooperatively sharing ideas allowed the accumulation of human cultural skills and knowledge more generally.

2. Multi-agent training is one of the most dangerous ways we might build AGI.

Humans have skills and motivations (such as deception, manipulation and power-hungriness) which would be dangerous in AGIs. It seems plausible that the development of many of these traits was driven by competition with other humans, and that AGIs trained to answer questions or do other limited-scope tasks would be safer and less goal-directed. I briefly make this argument here.

3. Multi-agent training is a regime in which standard safety techniques won’t work.

Most approaches to safety rely on constructing safe reward functions. But Ecoffet et al. (2020) argue that “open-ended” environments give rise to incentives which depend on reward functions in complex and hard-to-predict ways. Open-endedness is closely related to self-play (which was used to train AlphaGo) and multi-agent environments more generally. When a task involves multiple agents, those agents might learn many skills that are not directly related to the task itself, but instead related to competing or cooperating with each other. E.g. compare a language model like GPT-3, which was directly trained to output language, to the evolution of language in humans - where evolution only selected us for increased genetic fitness, but we developed language skills because they were (indirectly) helpful for that.

Furthermore, as I point out here, it’s not even clear what “good behaviour” would actually look like in such environments, since they don’t necessarily contain tasks corresponding directly to things we’d like AIs to do in the real world. And fine-tuning on real-world tasks may not be sufficient to override dangerous motivations acquired during extensive multi-agent training.

4. Multi-agent training allows us to implement important new safety techniques.

The most central example of a safety technique which rely on multi-agent environments is probably work done by Gillian Hadfield and others about learning group-level norms. More generally, CHAI’s concept of Assistance Games frames the machine learning training process as an interactive game played between humans and AIs, to better allow humans to guide AI behaviour.

I’ve also written about some tentative ideas for how to select for obedience in multi-agent environments.

5. We should expect the first AGIs to be deployed in a world which already contains many nearly-as-good AIs.

Paul Christiano defends this thesis as follows:

Lots of people will be trying to build powerful AI.
For most X, it is easier to figure out how to do a slightly worse version of X than to figure out how to do X.
The worse version may be more expensive, slower, less reliable, less general… (Usually there is a tradeoff curve, and so you can pick which axes you want the worse version to be worse along.)
If many people are trying to do X, and a slightly worse version is easier and almost-as-good, someone will figure out how to do the worse version before anyone figures out how to do the better version.

Robin Hanson also argues that progress in AI will be widely-distributed, and not very “lumpy”. He discusses this argument here, in part by summarising the lengthy AI foom debate.

6. We should expect AGIs to be deployed as multi-agent collectives.

I discuss this hypothesis here, building on concepts introduced by Bostrom. Summary: after training AGI, there will be strong incentives to copy it many times to get it to do more useful work. If that work involves generating new knowledge, then putting copies in contact with each other to share that knowledge would also increase efficiency. This would be easier if they had already been trained to collaborate; but even if not, their general intelligence should allow them to learn to work together. And so, one way or another, I expect that we’ll eventually end up dealing with a “collective” of AIs, which we could also think of as a single “collective AGI”.

Arguably, on a large-scale view, this is how we should think of humans. Each individual human is generally intelligent in our own right. Yet from the perspective of chimpanzees, the problem was not that any single human was intelligent enough to take over the world, but rather that millions of humans underwent cultural evolution to make the human collective much more intelligent.

7. Lack of coordination between multiple deployed AGIs is a major source of existential risk.

Critch makes this case here, summarising this more extensive report. He distinguishes between single AIs which are aligned to single humans (single/single delegation), versus the problem of living in a society where many AIs are each used on behalf of many humans (multi/multi delegation):

It might be that future humans would struggle to coordinate on the globally safe use of powerful single/single AI systems, absent additional efforts in advance to prepare technical multi/multi delegation solutions.
For a historical analogy supporting this view, consider the stock market “flash crash” of 6 May 2010, viewed as one of the most dramatic events in the history of financial markets. The flash crash was a consequence of the use of algorithmic stock trading systems by competing stakeholders. If AI technology significantly broadens the scope of action and interaction between algorithms, the impact of unexpected interaction effects could be much greater, and might be difficult to anticipate in detail.

Note that he claims that this may be true even if single/single alignment is solved, and all AGIs involved are aligned to their respective users.

8. Conflict between multiple deployed AGIs risks causing large-scale suffering.

The Centre on Long-term Risk argues for this thesis in this research agenda. Key idea:

Many of the cooperation failures in which we are interested can be understood as mutual defection in a social dilemma. Informally, a social dilemma is a game in which everyone is better off if everyone cooperates, yet individual rationality may lead to defection. … An example of potentially disastrous cooperation failure is extortion (and other compelling threats), and the execution of such threats by powerful agents.

Since threats are designed to be strong disincentives, we should expect that the types of threats made against aligned AIs will be very undesirable by human moral standards, and try to design AGIs in ways which prevent threats from being carried out by or against them.

I for one am convinced!

I'm curious in particular about the conjecture that deception etc. arose in us thanks to our multi-agent evolutionary training. Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive. I lean towards optimism.

As for 7, I'm surprised that Critch uses the flash crash as his example. If I were to argue for 7 I'd talk about how human states nearly caused nuclear MAD on several occasions, and how multiple AIs aligned with multiple humans would be relevantly similar--yes, better intelligence and coordination abilities maybe, but also much more powerful world-destroying tech. Could end up in a vulnerable world, where there are thousands of AI-human actors any one of which could destroy the world.

Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive.

I think it is mistaken. (Or perhaps I don't understand a key claim / assumption.)

Honesty evolved as a group dynamic, where it was beneficial for the group to have ways for individuals to honestly commit, or make lying expensive in some way. That cooperative pressure dynamic does not exist when a single agent is "evolving" on its own in an effectively static environment of humans. It does exist in a co-evolutionary multi-agent dynamic - so there is at least some reason for optimism within a multi-agent group, rather than between computational agents and humans - but the conditions for cooperation versus competition seem at least somewhat fragile.

I'm confused because the stuff you wrote in the paragraph seems like an expanded version of what I think. In other words it supports what I said rather than objects to it.

My point was that deception will almost certainly outperform honesty/cooperation when AI is interacting with humans, and in reflection, seems likely do so even interacting with other AIs by default because there is no group selection pressure.

I think I was thinking that in multi-agent training environments there might actually be group selection pressure for honesty. (Or at least, there might be whatever selection pressures produced honesty in humans, even if that turns out to be something other than group selection.)

Selection in humans is via mutation, so that closely related organisms can get a benefit form cooperating, even at the cost of personally not replicating. As a JBS Haldane quote puts it, "I would gladly give up my life for two brothers, or eight cousins."

Continuing from that paper, explaining it better than I could;

"What is more interesting, it is only in such small populations that natural selection would favour the spread of genes making for certain kinds of altruistic behaviour. Let us suppose that you carry a rare gene which affects your behaviour so that you jump into a river and save a child, but you have one chance in ten of being drowned, while I do not possess the gene, and stand on the bank and watch the child drown.

If the child is your own child or your brother or sister, there is an even chance that the child will also have the gene, so five such genes will be saved in children for one lost in an adult. If you save a grandchild or nephew the advantage is only two and a half to one. If you only save a first cousin, the effect is very slight. If you try to save your first cousin once removed the population is more likely to lose this valuable gene than to gain it."

Right, so... we need to make sure selection in AIs also has that property? Or is the thought that even if AIs evolve to be honest, it'll only be with other AIs and not with humans?

As an aside, I'm interested to see more explanations for altruism lined up side by side and compared. I just finished reading a book that gave a memetic/cultural explanation rather than a genetic one.

Planned summary for the Alignment Newsletter:

This post clearly states eight claims about multiagent AGI safety, and provides brief arguments for each of them. Since the post is itself basically a summary, I won’t go into detail here.

Another possible argument is that we can't tell when multiple AIs are failing or subverting each other.
Each agent pursuing its own goals in a multi-agent environment are intrinsically manipulative, and when agents are manipulating one another, it happens in ways that we do not know how to detect or consider. This is somewhat different than when they manipulate humans, where we have a clear idea of what does and does not qualify as harmful manipulation.

There are quite a few arguments for why we should move beyond the standard single-AGI safety paradigm.

Fwiw, I would classify all of 5-8 as reasons that AI governance should care about multiple AI systems (which it always has); I don't see why they require technical AI alignment research to move beyond the single-AGI paradigm.

(Here "AI alignment" is the problem of "how do you ensure that your AI system is not adversarially optimizing against you", and not making any claims about what other AI systems will do.)

I'd say that each of #5-#8 changes the parts of "AI alignment" that you focus on. For example, you may be confident that your AI system is not optimising against you, without being confident that 1000 copies of your AI system working together won't be optimising against you. Or you might be confident that your AI system won't do anything dangerous in almost all situations, but no longer confident once you realise that threats are adversarially selected to be extreme.

Whether you count these shifts as "moving beyond the standard paradigm" depends, I guess, on how much they change alignment research in practice. It seems like proponents of #7 and #8 believe that, conditional on those claims, alignment researchers' priorities should shift significantly. And #5 has already contributed to a shift away from the agent foundations paradigm. On the other hand, I'm a proponent of #6, and I don't currently believe that this claim should significantly change alignment research (although maybe further thought will identify some ways).

I think I'll edit the line you quoted to say "beyond standard single-AGI safety paradigms" to clarify that there's no single paradigm everyone buys into.

Whether you count these shifts as "moving beyond the standard paradigm" depends, I guess, on how much they change alignment research in practice. It seems like proponents of #7 and #8 believe that, conditional on those claims, alignment researchers' priorities should shift significantly.

I would say that proponents of #7 and #8 believe that longtermists' priorities should shift significantly (in the case of #8, might just be negative utilitarians). They are proposing that we focus on other problems that are not AI alignment (as I defined it above).

This might just be a semantic disagreement, but I do think it's an important point -- I wouldn't want people to say things like "people argue that it will become easier to engineer biological weapons than to build AGI, and therefore biosecurity is more important. Thus we need to move beyond the AGI paradigm to the emerging technologies paradigm". Like, it's correct, but it is creating too much generality; it is important to be able to focus on specific problems and make claims about those problems. Arguments 7-8 feel to me like "look, there's this other problem besides AI alignment that might be more important"; I don't deny that this could change what you do, but it doesn't change what the field of AI alignment should do.

(You might say that you were talking about AI safety generally, and not AI alignment, but then I dispute that AI safety ever had a "single-AGI" paradigm; people have been talking about multipolar outcomes for a long time.)

And #5 has already contributed to a shift away from the agent foundations paradigm.

Yes, but not to a multiagent paradigm, which I thought was your main claim.

This all seems straightforwardly correct, so I've changed the line in question accordingly. Thanks for the correction :)

One caveat: technical work to address #8 currently involves either preventing AGIs from being misaligned in ways that lead them to make threats, or preventing AGIs from being aligned in ways which make them susceptible to threats. The former seems to qualify as an aspect of the "alignment problem", the latter not so much. I should have used the former as an example in my original reply to you, rather than using the latter.

I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.

Humans have skills and motivations (such as deception, manipulation and power-hungriness) which would be dangerous in AGIs. It seems plausible that the development of many of these traits was driven by competition with other humans, and that AGIs trained to answer questions or do other limited-scope tasks would be safer and less goal-directed. I briefly make this argument here.
Note that he claims that this may be true even if single/single alignment is solved, and all AGIs involved are aligned to their respective users.

It strikes me as interesting that much of the existing work that's been done on multiagent training, such as it is, focusses on just examining the behaviour of artificial agents in social dilemmas. The thinking seems to be - and this was also suggested in ARCHES - that it's useful just for exploratory purposes to try to characterise how and whether RL agents cooperate in social dilemmas, what mechanism designs and what agent designs promote what types of cooperation, and if there are any general trends in terms of what kinds of multiagent failures RL tends to fall into.

For example, it's generally known that regular RL tends to fail to cooperate in social dilemmas, 'Unfortunately, selfish MARL agents typically fail when faced with social dilemmas'. From ARCHES:

One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogous to developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them.

There seems to be an implicit assumption here that something very important and unique to multiagent situations would be uncovered - by analogy to things like the flash crash. It's not clear to me that we've examined the intersection of RL and social dilemmas enough to notice if this were true, if it were true, and I think that's the major justification for working on this area.

Strongly agree that it's unclear that there failures would be detected.
For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm

Nice post! I’m excited to see more attention being paid to multi-agent stuff recently.

A few miscellaneous points:

I get the impression that the added complexity of multi- relative to single-agent systems has not been adequately factored into folks’ thinking about timelines / the difficulty of making AGI that is competent in a multipolar world. But I’m not confident in that.
I think it’s possible that conflict / bargaining failure is a considerable source of existential risk, in addition to suffering risk. I don’t really have a view on how it compares to other sources, but I’d guess that it is somewhat underestimated, because of my impression that folks generally underestimate the difficulty of getting agents to get along (even if they are otherwise highly competent).

Thanks for writing this post! I usually focus on single/single scenarios, so it's nice to have a clear split of the multi-agent safety issues.

All claims make sense to me, with 1 being the one I'm less convinced about, and 5 depending on continuous takeoffs (which appear relatively likely to me as of now).