The Center on Long-Term Risk (CLR) is focused on reducing risks of astronomical suffering, or s-risks, from transformative artificial intelligence (TAI). S-risks are defined as risks of cosmically significant amounts of suffering[1]. As has been discussed elsewhere, s-risks might arise by malevolence, by accident, or in the course of conflict.

We believe that s-risks arising from conflict are among the most important, tractable, and neglected of these. In particular, strategic threats by powerful AI agents or AI-assisted humans against altruistic values may be among the largest sources of expected suffering. Strategic threats have historically been a source of significant danger to civilization (the Cold War being a prime example). And the potential downsides from such threats, including those involving large amounts of suffering, may increase significantly with the emergence of transformative AI systems. For this reason, our current focus is technical and strategic analysis aimed at addressing these risks.

There are many other important interventions for s-risk reduction which are beyond the scope of this agenda. These include macrostrategy research on questions relating to s-risk; reducing the likelihood of s-risks from hatred, sadism, and other kinds of malevolent intent; and promoting concern for digital minds. CLR has been supporting work in these areas as well, and will continue to do so.

In this sequence of posts, we will present our research agenda on Cooperation, Conflict, and Transformative Artificial Intelligence. It is a standalone document intended to be interesting to people working in AI safety and strategy, with academics working in relevant subfields as a secondary audience. With a broad focus on issues related to cooperation in the context of powerful AI systems, we think the questions raised in the agenda are beneficial from a range of both normative views and empirical beliefs about the future course of AI development, even if at CLR we are particularly concerned with s-risks.

The purpose of this sequence is to

  • communicate what we think are the most important, tractable, and neglected technical AI research directions for reducing s-risks;
  • communicate what we think are the most promising directions for reducing downsides from threats more generally;
  • explicate several novel or little-discussed considerations concerning cooperation and AI safety, such as surrogate goals;
  • propose concrete research questions which could be addressed as part of an CLR Fund-supported project, by those interested in working as a full-time researcher at CLR, or by researchers in academia, or at other EA organizations, think tanks, or AI labs;
  • contribute to the portfolio of research directions which are of interest to the longtermist EA and AI safety communities broadly.

The agenda is divided into the following sections:

  • AI strategy and governance. What does the strategic landscape at time of TAI development look like (e.g., unipolar or multipolar, balance between offensive and defensive capabilities?), and what does this imply for cooperation failures? How can we shape the governance of AI so as to reduce the chances of catastrophic cooperation failures?

  • Credibility. What might the nature of credible commitment among TAI systems look like, and what are the implications for improving cooperation? Can we develop new theory (such as open-source game theory) to account for relevant features of AI?

  • Peaceful bargaining mechanisms. Can we further develop bargaining strategies which do not lead to destructive conflict (e.g., by implementing surrogate goals)?

  • Contemporary AI architectures. How can we make progress on reducing cooperation failures using contemporary AI tools — for instance, learning to solve social dilemmas among deep reinforcement learners?

  • Humans in the loop. How do we expect human overseers or operators of AI systems to behave in interactions between humans and AIs? How can human-in-the-loop systems be designed to reduce the chances of conflict?

  • Foundations of rational agency, including bounded decision theory and acausal reasoning.

We plan to post two sections every other day. The next post in the sequence, "Sections 1 & 2: Introduction, Strategy and Governance" will be posted on Sunday, December 15.


  1. By "cosmically significant", we mean significant relative to expected future suffering. Note that it may turn out that the amount of suffering we can influence is dwarfed by suffering that we can't influence. By "expected suffering in the future" we mean "expectation of action-relevant suffering in the future". ↩︎

New Comment
7 comments, sorted by Click to highlight new comments since:

I find myself someone confused by s-risks as defined here; it's easy to generate clearly typical cases that very few would want, and hard to figure out where the boundaries are, and thus hard to figure out how much I should imagine this motivation impacting the research.

That is, consider the "1950s sci-fi prediction," where a slightly-more-competent version of humanity manages to colonize lots of different planets in ways that make them sort of duplicates of Earth. This seems like it would count as an s-risk if each planet has comparable levels of suffering to modern Earth and there are vastly more such planets. While this feels to me like "much worse than is possible," I'm not yet sold it's below the "ok" bar in the maxipok sense, but also it wouldn't seem too outlandish to think it's below that bar (depending on how bad you think life on Earth is now).

Do you think focusing on s-risks leads to meaningfully different technical goals than focusing on other considerations? I don't get that sense from the six headings, but I can imagine how it might add different constraints or different focus for some of them. For example, on the point of AI strategy and governance, it seems easiest to encourage cooperation when there are no external forces potentially removing participants from a coalition, but adding in particular ethical views possibly excludes people who could have been included. You might imagine, say, a carnivorous TAI developer who wants factory farming to make it to the stars.

This isn't necessarily a point against this view, according to me; it definitely is the case that focusing on alignment at all implies having some sort of ethical view or goal you want to implement, and it may be the case that being upfront about those goals simplifies or directs the technical work, as opposed to saying "we'll let future-us figure out what the moral goals are, first let's figure out how to implement any goals at all." But it does make me interested in how much disagreement you think there is on the desirability of future outcomes, weighted by their likelihood or something, between people primarily motivated by continued existence of human civilization and people primarily motivated by avoiding filling the universe with suffering or whatever other categories you think are worth considering.

Do you think focusing on s-risks leads to meaningfully different technical goals than focusing on other considerations?

I think it definitely leads to a difference in prioritization among the things one could study under the broad heading of AI safety. Hopefully this will be clear in the body of the agenda. And, some considerations around possible downsides of certain alignment work might be more salient to those focused on s-risk; the possibility that attempts at alignment with human values could lead to very bad “near misses” is an example. (I think some other EAF researchers have more developed views on this than myself.) But, in this document and my own current research I’ve tried to choose directions that are especially important from the s-risk perspective but which are also valuable by the lights of non-s-risk-focused folks working in the area.

[Just speaking for myself here]

I find myself someone confused by s-risks as defined here

For what it’s worth, EAF is currently deliberating about this definition and it might change soon.

Thanks, that helps!

For what it’s worth, EAF is currently deliberating about this definition and it might change soon.

Cool; if your deliberations include examples, it might be useful to include them if you end up writing an explanation somewhere.

We are now using a new definition of s-risks. I've edited this post to reflect the change.

New definition:

S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.

Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expectation of suffering in the future” we mean “expectation of action-relevant suffering in the future”.

Flo's summary for the Alignment Newsletter:

This agenda by the Effective Altruism Foundation focuses on risks of astronomical suffering (s-risks) posed by <@transformative AI@ >(@Defining and Unpacking Transformative AI@) (TAI) and especially those related to conflicts between powerful AI agents. This is because there is a very clear path from extortion and executed threats against altruistic values to s-risks. While especially important in the context of s-risks, cooperation between AI systems is also relevant from a range of different viewpoints. The agenda covers four clusters of topics: strategy, credibility and bargaining, current AI frameworks, as well as decision theory.
The extent of cooperation failures is likely influenced by how power is distributed after the transition to TAI. At first glance, it seems like widely distributed scenarios (as <@CAIS@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@)) are more problematic, but related literature from international relations paints a more complicated picture. The agenda seeks a better understanding of how the distribution of power affects catastrophic risk, as well as potential levers to influence this distribution. Other topics in the strategy/governance cluster include the identification and analysis of realistic scenarios for misalignment, as well as case studies on cooperation failures in humans and how they can be affected by policy.
TAI might enable unprecedented credibility, for example by being very transparent, which is crucial for both contracts and threats. The agenda aims at better models of the effects of credibility on cooperation failures. One approach to this is open-source game theory, where agents can see other agents' source codes. Promising approaches to prevent catastrophic cooperation failures include the identification of peaceful bargaining mechanisms, as well as surrogate goals. The idea of surrogate goals is for an agent to commit to act as if it had a different goal, whenever it is threatened, in order to protect its actual goal from threats.
As some aspects of contemporary AI architectures might still be present in TAI, it can be useful to study cooperation failure in current systems. One concrete approach to enabling cooperation in social dilemmas that could be tested with contemporary systems is based on bargaining over policies combined with punishments for deviations. Relatedly, it is worth investigating whether or not multi-agent training leads to human-like bargaining by default. This has implications on the suitability of behavioural vs classical game theory to study TAI. The behavioural game theory of human-machine interactions might also be important, especially in human-in-the-loop scenarios of TAI.
The last cluster discusses the implications of bounded computation on decision theory as well as the decision theories (implicitly) used by current agent architectures. Another focus lies on acausal reasoning and in particular the possibility of acausal trade, where different correlated AI's cooperate without any causal links between them.

Flo's opinion:

I am broadly sympathetic to the focus on preventing the worst outcomes and it seems plausible that extortion could play an important role in these, even though I worry more about distributional shift plus incorrigibility. Still, I am excited about the focus on cooperation, as this seems robustly useful for a wide range of scenarios and most value systems.

My opinion:

Under a suffering-focused ethics under which s-risks far overwhelm x-risks, I think it makes sense to focus on this agenda. There don't seem to be many plausible paths to s-risks: by default, we shouldn't expect them, because it would be quite surprising for an amoral AI system to think it was particularly useful or good for humans to _suffer_, as opposed to not exist at all, and there doesn't seem to be much reason to expect an immoral AI system. Conflict and the possibility of carrying out threats are the most plausible ways by which I could see this happening, and the agenda here focuses on neglected problems in this space.
However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other safety research to be more impactful, because the failure mode of an amoral AI system that doesn't care about you seems both more likely and more amenable to technical safety approaches (to me at least).

the failure mode of an amoral AI system that doesn't care about you seems both more likely and more amenable to technical safety approaches (to me at least).

It seems to me that at least some parts of this research agenda are relevant for some special cases of "the failure mode of an amoral AI system that doesn't care about you". A lot of contemporary AIS research assumes some kind of human-in-the-loop setup (e.g. amplification/debate, recursive reward modeling) and for such setups it seems relevant to consider questions like "under what circumstances do humans interacting with an artificial agent become convinced that the agent’s commitments are credible?". Such questions seem relevant under a very wide range of moral systems (including ones that don't place much weight on s-risks).

It seems to me that at least some parts of this research agenda are relevant for some special cases of "the failure mode of an amoral AI system that doesn't care about you".

I still wouldn't recommend working on those parts, because they seem decidedly less impactful than other options. But as written it does sound like I'm claiming that the agenda is totally useless for anything besides s-risks, which I certainly don't believe. I've changed that second paragraph to:

However, under other ethical systems (under which s-risks are worse than x-risks, but do not completely dwarf x-risks), I expect other technical safety research to be more impactful, because other approaches can more directly target the failure mode of an amoral AI system that doesn't care about you, which seems both more likely and more amenable to technical safety approaches (to me at least). I could imagine work on this agenda being quite important for _strategy_ research, though I am far from an expert here.