What I talk about when I talk about AI x-risk: 3 core claims I want machine learning researchers to address.

David Scott Krueger

Recently, as PCSOCMLx, I (co-)hosted a session with the goal of explaining, debating, and discussing what I view as "the case for AI x-risk". Specifically, my goal was/is to make the case for the "out-of-control AI killing everyone" type of AI x-risk, since many or most ML researchers already accept that there are significant risks from misuse of AI that should be addressed. I'm sharing my outline, since it might be useful to others, and in order to get feedback on it. Please tell me what you think it does right/wrong!

EDIT: I noticed I (and others I've spoken to about this) haven't been clear enough about distinguishing CLAIMS and ARGUMENTS. I'm hoping to make this clearer in the future.

Some background/context

I estimate I've spent ~100-400 hours discussing AI x-risk with machine learning researchers during the course of my MSc and PhD. My current impression is that rejection of AI x-risk by ML researchers is mostly due to a combination of:

Misunderstanding of what I view as the key claims (e.g. believing "the case for x-risk hinges on short-timelines and/or fast take-off").
Ignorance of the basis for AI x-risk arguments (e.g. no familiarity with the argument from instrumental convergence).
Different philosophical groundings (e.g. not feeling able/compelled to try and reason using probabilities and expected value; not valuing future lives very much; an unexamined apparent belief that current "real problems" should always take precedence of future "hypothetical concerns" resulting in "whataboutism").

I suspect that ignorance about the level of support for AI x-risk concerns among other researchers also plays a large role, but it's less clear... I think people don't like to be seen to be basing their opinions on other researchers'. Underlying all of this seems to be a mental move of "outright rejection" based on AI x-risk failing many powerful and useful heuristics. AI x-risk is thus commonly viewed as a Pascal's mugging: "plausible" but not plausible enough to compel any consideration or action. A common attitude is that AI take-over has a "0+epsilon" chance of occurring.I'm hoping that being more clear and modest in the claims I/we aim to establish can help move discussions with researchers forward. I've recently been leaning heavily on the unpredictability of the future and making ~0 mention of my own estimates about the likelihood of AI x-risk, with good results.

The 3 core claims:

1) The development of advanced AI increases the risk of human extinction (by a non-trivial amount, e.g. 1%), for the following reasons:

Goodhart's law
Instrumental goals
Safety-performance trade-offs (e.g. capability control vs. motivation control)

2) To mitigating this existential risk (x-risk) we need progress in 3 areas:

Knowing how to build safe systems ("control problem")
Knowing that we know how to build safe systems ("justified confidence")
Preventing people from building unsafe systems ("global coordination")

3) Mitigating AI x-risk seems like an ethical priority because it is:

high impact
neglected
challenging but tractable

Reception:

Unfortunately, only 3 people showed up to our session (despite something like 30 expressing interest). So I didn't learn to much about the effectiveness of this presentation. My 2 main take-aways are:

Somewhat unsurprisingly, claim 1 had the least support. While I find this claim and the supporting arguments quite compelling and intuitive, there seem to be inferential gaps that I struggle to address quickly/easily. A key sticking point seems to be the lack of a highly plausible concrete scenario. I think it might also require more discussion of epistemics in order to move people from "I understand the basis for concern" to "I believe there is a non-trivial chance of an out-of-control AI killing everyone".
The phrase "ethical priority" raises alarm bells for people, and should be replaced of clarified. Once I clarified that I meant it in the same way as "combating climate change is an ethical priority", people seemed to accept it.

Some more details on the event:

The title for our session was: The case for AI as an existential risk, and a call for discussion and debate. Our blurb was: A growing number of researchers are concerned about scenarios in which machines, instead of people, control the future. What is the basis for these concerns, and are they well-founded? I believe they are, and we have an obligation as a community to address them. I can lead with a few minutes summarizing the case for that view. We can then discuss what nuances, objections, and take-aways.I also started with some basic background to make sure people understood the topic:

X-risk = risk of human extinction
The 3 kinds of risk (misuse, accident, structural)
The specific risk scenario I'm concerned with: out of control AI

Planned summary for the Alignment Newsletter:

When making the case for work on AI x-risk to other ML researchers, what should we focus on? This post suggests arguing for three core claims:

1. Due to Goodhart's law, instrumental goals, and safety-performance trade-offs, the development of advanced AI increases the risk of human extinction non-trivially.
2. To mitigate this x-risk, we need to know how to build safe systems, know that we know how to build safe systems, and prevent people from building unsafe systems.
3. So, we should mitigate AI x-risk, as it is impactful, neglected, and challenging but tractable.

Planned opinion:

This is a nice concise case to make, but I think the bulk of the work is in splitting the first claim into subclaims: this is the part that is usually a sticking point.

Huh, I wonder what you think of a different way of splitting it up. Something like:

It's a scientific possibility to have AI that's on average better than humanity at the class of tasks "choose actions that achieve a goal in the real world." Let's label this by some superlative jargon like "superintelligent AI." Such a technology would be hugely impactful.
It would be really bad if a superintelligent AI was choosing actions to achieve some goal, but this goal wasn't beneficial to humans. There are several open problems that this means we need to solve before safely turning on any such AI.
We know enough that we can do useful work on (most of) these open problems right now. Arguing for this also implies that superintelligent AI is close enough (if not in years, then in "number of paradigm shifts") that this work needs to start getting done.
We would expect a priori that work on these open problems of beneficial goal design should be under-prioritized (public goods problem, low immediate profit, not obvious you need it before you really need it). And indeed that seems to be the case (insert NIPS survey here), though there's work going on at nonprofits that have different incentives. So consider thinking about this area if you're looking for things to research.

I'm definitely interested in hearing other ways of splitting it up! This is one of the points of making this post. I'm also interested in what you think of the ways I've done the breakdown! Since you proposed an alternative, I guess you might have some thoughts on why it could be better :)

I see your points as being directed more at increasing ML researchers respect for AI x-risk work and their likelihood of doing relevant work. Maybe that should in fact be the goal. It seems to be a more common goal.

I would describe my goal (with this post, at least, and probably with most conversations I have with ML people about Xrisk) as something more like: "get them to understand the AI safety mindset, and where I'm coming from; get them to really think about the problem and engage with it". I expect a lot of people here would reason in a very narrow and myopic consequentialist way that this is not as good a goal, but I'm unconvinced.

Well, you mentioned that a lot of people were getting off the train at point 1. My comment can be thought of as giving a much more thoroughly inside-view look at point 1, and deriving other stuff as incidental consequences.

I'm mentally working with an analogy to teaching people a new contra dance (if you don't know what contra dancing is, I'm just talking about some sequence of dance moves). The teacher often has an abstract view of expression and flow that the students lack, and there's a temptation for the teacher to try to share that view with the students. But the students don't want to abstractions, what they want is concrete steps to follow, and good dancers will dance the dance just fine without ever hearing about the teacher's abstract view. Before dancing they regard the abstractions as difficult to understand and distracting from the concrete instructions; they'll be much more equipped to understand and appreciate them *after* dancing the dance.

A key sticking point seems to be the lack of a highly plausible concrete scenario.

IMO coming up with highly plausible concrete scenarios should be a major priority of people working on AI safety. It seems both very useful for getting other researchers involved, and also very useful for understanding the problem and making progress.

In terms of talking to other researchers, in-person conversations like the ones you're having seem like a great way to feel things out before writing public documents.