Thanks for this response, I'm glad to see more public debate on this!
The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?
I think the crux here is whether we expect AI systems to by default collude with one another. They might -- they have a lot of things in common that humans don't, especially if they're copies of one another! But coordination in general is hard, especially if it has to be surreptitious.
As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.
This isn't totally fanciful -- the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances -- if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d'etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it's also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they'll really have your back if you support them?
An interesting consequence of this is that it's ambiguous whether making AI more cooperative makes the situation better or worse.
Interesting points, I agree that our response to part C doesn't address this well.
AI's colluding with each other is one mechanism for how things could go badly (and I do think that such collusion becomes pretty likely at some point, though not sure it's the most important crux). But I think there are other possible reasons to worry as well. One of them is a fast takeoff scenario: with fast takeoff, the "AIs take part in human societal structures indefinitely" hope seems very unlikely to me, so 1 - p(fast takeoff) puts an upper bound on how much optimism we can derive from that. It's harder to make an airtight x-risk argument using fast takeoff, since I don't think we have airtight arguments for p(fast takeoff) being close to 1, but still important to consider if we're figuring out our overall best guess, rather than trying to find a reasonably compact argument for AI x-risk. (To put this differently: the strongest argument for AI x-risk will of course consider all the ways in which things could go wrong, rather than just one class of ways that happens to be easiest to argue for).
A more robust worry (and what I'd probably rely on for a compact argument) is something like What Failure Looks Like Part 1: maybe AIs work within the system, in the sense that they don't take over the world in obvious, visible ways. They usually don't break laws in ways we'd notice, they don't kill humans, etc. On paper, humans "own" the entire economy, but in practice, they have an increasingly hard time achieving outcomes they want (though they might not notice that, at least for a while).This seems like a mechanism for AIs to collectively "take over the world" (in the sense that humans don't actually have control of the long-run trajectory of the universe anymore), even if no individual AI can break out of the system, and if AIs aren't great at collaborating against humanity.
Addressing a few specific points:
humanity will get better at aligning and controlling AI systems as we gain more experience with them,
True to some extent, but I'd expect AI progress to be much faster than human improvement at dealing with AI (the latter is also bounded presumably). So I think the crux is the next point:
and we may be able to enlist the help of AI systems to keep others in check.
Yeah, that's an important point. I think the crux boils down to how well approaches like IDA or debate are going to work? I don't think that we currently know exactly how to make them work sufficiently well for this, I have less strong thoughts on whether they can be made to work or how difficult that would be.
I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.
In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control.
However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.
E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. It seems unclear whether these things would tend to balance out, or whether e.g. everyone will inevitably be exposed to some persuasion that causes irreparable damage. Of course, things could also work out better than expected, if our ability to keep AIs in check scales better than dangerous capabilities.
This problem of human irrelevancy seems somewhat orthogonal to the alignment problem; even a maximally aligned AI will strip humans of their agency, as it knows best. Making the AI value human agency will not be enough; humans suck enough that the other objectives will override the agency penalty most of the time, especially in important matters.
I agree that aligned AI could also make humans irrelevant, but not sure how that's related to my point. Paraphrasing what I was saying: given that AI makes humans less relevant, unaligned AI would be bad even if no single AI system can take over the world. Whether or not aligned AI would also make humans irrelevant just doesn't seem important for that argument, but maybe I'm misunderstanding what you're saying.
This is a great response to a great post!
I mostly agree with the points made here, so I'll just highlight differences in my views.
Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").
A few other notable differences:
EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).
This is a crux for me, as it is why I don't think slow takeoff is good by default. I think deceptive alignment is the default state barring interpretability efforts that are strong enough to actually detect mesa-optimizers or myopia. Yes, Foom is probably not going to happen, but in my view that doesn't change much regarding risk in total.
TBC, "more concerned" doesn't mean I'm not concerned about the other ones... and I just noticed that I make this mistake all the time when reading people say they are more concerned about present-day issues than x-risk....... hmmm........
Thanks for the interesting comments!
Briefly, I think Katja's post provides good arguments for (1) "things will go fine given slow take-off", but this post interprets it as arguing for (2) "things will go fine given AI never becomes dangerously capable". I don't think the arguments here do quite enough to refute claim (1), although I'm not sure they are meant to, given the scope ("we are not discussing").
Yeah, I didn't understand Katja's post as arguing (1), otherwise we'd have said more about that. Section C contains reasons for slow take-off, but my crux is mainly how much slow takeoff really helps (most of the reasons I expect iterative design to fail for AI still apply, e.g. deception or "getting what we measure"). I didn't really see arguments in Katja's post for why slow takeoff means we're fine.
We don't necessarily need to reach some "safe and stable state". X-risk can decrease over time rapidly enough that total x-risk over the lifespan of the universe is less than 1.
Agreed, and I think this is a weakness of our post. I have a sense that most of the arguments you could make using the "existentially secure state" framing could also be made more generally, but I haven't figured out a framing I really like yet unfortunately.
EtA: I am still more concerned about "not enough samples to learn human preferences" than ELK or inner optimization type failures. This seems to be a fairly unpopular view, and I haven't scrutinized it too much (but would be interested to discuss it cooperatively).
Would be interested in discussing this more at some point. Given your comment, I'd now guess I dismissed this too quickly and there are things I haven't thought of. My spontaneous reasoning for being less concerned about this is something like "the better our models become (e.g. larger and larger pretrained models), the easier it should be to make them output things humans approve of". An important aspect is also that this is the type of problem where it's more obvious if things are going wrong (i.e. iterative design should work here---as long as we can tell the model isn't aligned yet, it seems more plausible we can avoid deploying it).
Responding in order:
1) yeah I wasn't saying it's what her post is about. But I think you can get two more interesting cruxy stuff by interpreting it that way.
2) yep it's just a caveat I mentioned for completeness.
3) Your spontaneous reasoning doesn't say that we/it get(/s) good enough at getting it to output things humans approve of before it kills us. Also, I think we're already at "we can't tell if the model is aligned or not", but this won't stop deployment. I think the default situation isn't that we can tell if things are going wrong, but people won't be careful enough even given that, so maybe it's just a difference of perspective or something... hmm.......
This is a response to the recent Counterarguments to the basic AI x-risk case ("Counterarguments post" from here on). Based on its reception, it seems that the Counterarguments post makes points that resonate with many, so we're glad that the post was written. But we also think that most of the gaps it describes in the AI x-risk case have already been addressed elsewhere or vanish when using a slightly different version of the AI x-risk argument. None of the points we make are novel, we just thought it would be useful to collect all of them in one reply.
Before we begin, let us clarify what we are arguing for: we think that current alignment techniques are likely insufficient to prevent an existential catastrophe, i.e. if AI development proceeds without big advances in AI alignment, this would probably lead to an existential catastrophe eventually. In particular, for now, we are not discussing
These are all important questions, but the main thrust of the Counterarguments post seems to be "maybe this whole x-risk argument is wrong and things are actually just fine", so we are focusing on that aspect.
Another caveat: we're attempting to present a minimal high-level case for AI x-risk, not to comprehensively list all arguments. This means there are some concepts we don't discuss even though they might become crucial on further reflection. (For example, we don't talk about the concept of coherence, but this could become important when trying to argue that AI systems trained for corrigibility will not necessarily remain corrigible). Anticipating all possible counterarguments and playing out the entire debate tree isn't feasible, so we focus on those that are explicitly made in the Counterarguments post. That said, we do think the arguments we outline are broadly correct and survive under more scrutiny.
Summary
Our most important points are as follows:
Those would roughly be our one-sentence responses to sections A, B, and C of the Counterarguments post respectively. Below, we first describe some modifications we would make to the basic case for AI x-risk and then go through the points in the Counterarguments post in more detail.
Notes on the basic case for AI x-risk
To recap, the Counterarguments post gives the following case for AI x-risk:
This is roughly the case we would make as well, but there are a few modifications and clarifications we'd like to make.
First, we will focus on a purely behavioral definition of "goal-directed" throughout this post: an AI system is goal-directed if it ensures fairly reliably that some objective will be achieved. For point I. in the argument, we expect goal-directed systems in this weak sense to be built simply because humans want to use these systems to achieve various goals. One caveat is that you could imagine building a system, such as an oracle, that does not itself ensure some objective is met, but which helps humans achieve that objective (e.g. via suggesting plans that humans can execute if desired). In that case, the combined system of AI + humans is goal-directed, and our arguments are meant to apply to this combined system.
Note that we do not mention utility maximization or related concepts at all; we think these are important ideas, but it's possible to make an AI x-risk case without them, so we will avoid them to hopefully simplify the discussion. We do believe that goal-directed systems will in fact likely have explicit internal representations of their goals, but won't discuss this further for the purpose of our argument.
Second, we would frame "superhuman AI" differently. The Counterarguments post defines it as "systems that are somewhat more capable than the most capable human". This means it is unclear just how big a risk superhuman AI would be—a lot depends on the "somewhat more capable" and how exactly to define that. So we will structure the argument differently: instead of saying that we will eventually build "somewhat superhuman" AI and that such an AI will be able to disempower humanity, we will argue that by default, we will keep building more and more capable AI systems, and that at some point, they will become able to disempower humanity.
These changes have some effects on which parts of the argument bear most of the burden. Specifically, there are two natural questions given our definitions:
We will address the first question in our response to part A of the Counterarguments post. Since the second question doesn't fit in anywhere naturally, we will give an answer now.
Why will we keep building more and more capable AI systems?
This point rests on the assumption that building increasingly powerful AI systems, at least up to the level where they could disempower humanity, is technologically feasible. Our impression is that this has been discussed at length and doesn't seem to be a crux for the Counterarguments post, so we won't discuss it further. Nevertheless, this is a point where many AI x-risk skeptics will disagree.
Assuming it is feasible, the question becomes: why will there be incentives to build increasingly capable AI systems? We think there is a straightforward argument that is essentially correct: some of the things we care about are very difficult to achieve, and we will want to build AI systems that can achieve them. At some point, the objectives we want AI systems to achieve will be more difficult than disempowering humanity, which is why we will build AI systems that are sufficiently capable to be dangerous if unaligned.
Some of the objectives we will want to achieve are simply difficult in their own right, e.g. "prevent all diseases". In other cases, zero-sum games and competition can create objectives of escalating difficulty. For example, the Counterarguments post gives "making Democrats win an election" as an example of a thing people might want to do with AI systems. Maybe it turns out that AI systems can e.g. learn to place ads extremely effectively in a way that's good enough for winning elections, but not dangerous enough to lead to AI takeover. But if the other side is using such an AI system, the new objective of "win the election given that the opponent is using a powerful ad-placement AI" is more difficult than the previous one.
The Counterarguments post does make one important counterpoint, arguing that economic incentives could also push in the opposite direction:
In our framing, a similar point would be: "If trying to achieve outcomes above some difficulty threshold leads to bad outcomes in practice, then people will not push beyond that threshold until alignment has caught up".
We agree that people will most likely not build AI systems that they know will lead to bad outcomes. However, things can still go badly, essentially in worlds where iterative design fails. What failure looks like also describes two ways in which we could get x-risks without knowingly deploying AI systems leading to bad outcomes. First, we might simply not notice that we're getting bad outcomes and slowly drift into a world containing almost no value. Second, existing alignment techniques might work fine up to some level of capabilities, and then fail quite suddenly, resulting in an irreversible catastrophe.
This seems like an important potential crux: will iterative design work fine at least up to the point where we can hand off alignment research to AIs or perform some other kind of pivotal act, or will it fail before then? In the latter case, people can unintentionally cause an existential catastrophe. In our view, it is quite likely that iterative design will break too soon, but this seems like a potential crux and an interesting point to discuss further.
Responses to specific counterarguments
We hope our behavioral definition of goal-directedness provides some clarity. To summarize:
To expand on the second point, let's discuss the following counterargument from the post: AI systems could be good at achieving specific outcomes, and be economically competitive, without being strongly goal-directed in the sense of e.g. coming up with weird-to-humans ways of achieving outcomes. The post calls such AI systems "weak pseudo-agents":
The key question here is how difficult the objective O is to achieve. If O is "drive a car from point A to point B", then we agree that it is feasible to have AI systems that "strongly increase the chance of O occuring" (which is precisely what we mean by "goal-directedness") without being dangerous. But if O is something that is very difficult to achieve (i.e. all of humanity is currently unable to achieve it), then it seems that any system that does reliably achieve O has to "find new and strange routes to O" almost tautologically.
Once we build AI systems that find such new routes for achieving an objective, we're in dangerous territory, no matter whether they are explicit utility maximizers, self-modifying, etc. The dangerous part is coming up with new routes that achieve the objective, since most of these routes will contain steps that look like "acquire resources" or "manipulate humans". It should certainly be possible to achieve the desired outcome without undesired side effects, and correspondingly it should be possible to build AI systems that pursue the goal in a safe way. It's just that we currently don't know how to do this for sufficiently difficult objectives (see our response to part B for more details). There is ample empirical evidence for this point in present-day AI systems, as training them to achieve objectives routinely leads to unintended side effects. We expect that those side-effects will scale in dangerousness with the degree to which achieving an objective requires strong capabilites.
As outlined above, we think that humans will want to build AI systems to achieve objectives of increasing difficulty, and that AI systems that can achieve these objectives will have existentially catastrophic side effects at some point. Moreover, we are focusing on the question of whether we will or will not reach an existentially secure state eventually, and not the particular timing. Hence, we interpret this point as asking the question "How difficult to achieve are objectives such as 'prevent anyone else from building AI that destroys the world'?".
Our impression is that this objective, and any other we currently know of that would get us to a permanent safe state, are sufficiently difficult that the instrumental convergence arguments apply to the extent that such plans would by default be catastrophic. But we agree that this is currently based on fuzzy intuitions rather than on crisp quantitative arguments, and disagreement about this point could be an important crux.
In one sense, this is just correct: if the "true" utility function is U, and we're using a proxy U', and |U(s)−U′(s)|≤ε for all states s, then optimal behavior under U′ will be almost optimal under U (with a regret bound linear in ε). So some small differences are indeed fine.
But the more worrying types of differences look quite different: they mean that we get the utility function close to perfect on some simple cases (e.g. situations that humans can easily understand), but then get it completely wrong in some other cases. For example, the ELK report describes a situation where an AI hacks all of the sensors, leading us to believe that things are fine when they are really not. If we just take a naive RLHF approach, we'll get a reward function that might be basically correct in most "normal" situations, but incorrectly assigns high reward to this sensor hacking scenario. A reward function that's only "slightly wrong" in this sense—being very wrong in some cases—seems likely to be catastrophic.
The examples from the Counterarguments post (such as different humans having different goals) all seem to be closer to the ε-bound variety—all humans would presumably dislike the sensor hacking scenario if they knew what was going on.
(Note that all of this could also be phrased without referring to utility functions explicitly, talking instead about whether an objective is sufficiently detailed and correct to ensure safe outcomes when an AI system achieves that objective; or we could talk about ways of distinguishing good from bad action sequences, as the ELK report explicitly does. Essentially the same issue appears in different guises depending on the imagined setup for AGI.)
In part, we discussed this under the previous point: there are different senses in which discrepancies can be "small", and the dangerous one is the sense where the discrepancies are big for some situations.
But there are some more specific points in this section we want to address:
We don't think the finite number of training examples is the core issue—it is one potential problem, but arguably not the hardest one. Instead, there's one big problem that isn't mentioned here, namely that we don't have an outer-aligned reward signal. Even if we could learn perfectly what humans would say in response to any preference comparison query, that would not be safe to optimize for. (Though see A shot at the diamond-alignment problem for some thoughts on how things might turn out fine in practice, given that we likely won't get an agent that optimizes for the reward signal).
Another important reason to expect large value differences is inner alignment. The Counterarguments post only says this on the subject:
Since this doesn't really present a case against deceptively aligned mesaoptimizers, we won't say more on the subject of how likely they are to arise. We do think that inner misalignment will likely be a big problem without specific countermeasures—it is less clear just how hard to find those countermeasures are.
These don't seem like central examples of the types of failures we are worried about. The sensor hacking example from the ELK report seems much more important: the issue is not that the AI hasn't seen enough data and thus makes some slight mistakes. Instead, the problem is just that our training signal doesn't distinguish between "the sensors have been hacked to look good to humans" vs "things are actually good". To be clear, this is just one specific thing that could go wrong, the more general version is that we are selecting outcomes based on "looking good to humans" rather than "being actually good if humans knew more and had more time to think".
Another thing to mention here is that the AI might very well know that it's not doing what humans want in some sense. But that won't matter unless we figure out how to point an AI towards its best understanding of what humans want, as opposed to some proxy that was learned using a process we specified.
Our main response is similar to the previous points: human values are not fragile to ε-perturbations, but they are fragile to the mistakes we actually expect the objective to contain, such as conflating "the actual state of the world" and "what humans think is the actual state of the world".
One specific new point the Counterarguments post makes in this section is that AI can generate very human-like faces, even though slight changes would make a face very clearly unrealistic (i.e. "faces are fragile"). There's a comment thread here that discusses whether this is a good analogy. In our view, the issue is that the face example only demonstrates that AI systems can generate samples from an existing distribution well. To achieve superhuman performance under some objective, this is not enough, since we don't have any examples of superhuman performance. The obvious way to try to get superhuman performance would be to optimize against a face discriminator, but current AI systems are not robust under such optimization (the "faciest image" according to the discriminator is not a face). It's not obvious how else the face example could be extended to superhuman performance.
The argument in this section is that if we only train near-myopic AI systems (i.e. systems with no long-term goals), then we avoid most of the danger of goal-directedness. Why might we be fine with only achieving short-term objectives? The post argues:
We think that humans do actually have goals that stretch over a hundred years, and that they would try to build AI systems that look that far ahead, but perhaps the more important crux is that goals spanning only a few years already seem more than enough to be dangerous. If an AI can take over the world within a year (which we think is an overestimate for the time required), then the most natural way of achieving goals within the next few years is very plausibly to take over the world.
For this section, we won't address each argument individually, since our framing of the basic x-risk argument is different in an important way. The Counterarguments post mainly seems to say "AI that is slightly more capable than a single human might not be that dangerous to all of humanity" and similar points. In our framework, we argue that humanity will keep building more and more powerful AI systems, such that they will be able to overpower us at some point. Framed like that, it doesn't make much sense to ask whether "a superhuman AI" would be able to take over the world. Instead, the key questions are:
We briefly discussed the first point in the "Why will we keep building more and more capable AI systems?" section. The second point could be an interesting crux—to us, it seems easier to take over the world than to ensure that no other AI system will do so in the future. That said, we are interested in clever ways of using AI systems that are weak enough to be safe in order to reach a stable state without AI x-risk.
Regarding the "Headroom" section of the Counterarguments post (which does play a role in our framework, since it says that there might not be much room for improvement over humans along important axes): it seems very clear to us that there are some economically valuable tasks with lots of headroom (which will incentivize building very powerful AI systems), and that there is a lot of headroom at "taking over the world" (more precisely, AI will be better at taking over the world than we are at defending against that). Our guess is that most people reading this will agree with those claims, so we won't go into more detail for now. Note that the x-risk argument doesn't require all or even most tasks to have a lot of headroom, so we don't find examples of tasks without much headroom very convincing.
The question in this section is: doesn't the whole AI x-risk argument prove too much? For example, why doesn't it also say that corporations pose an existential risk?
The key reason we are more worried about AI than about corporations in terms of existential risk is that corporations don't have a clear way of becoming more powerful than humanity. While they are able to achieve objectives better than individual humans, there are various reasons why collections of humans won't be able to scale their capabilities as far as AI systems. One simple point is just that corporations are made up from a relatively small number of humans, so we shouldn't expect them to become a threat to all of humanity. One way they could become dangerous is if they were extremely well coordinated, but humans in a corporation are in fact only partially value-aligned and face large costs from inadequate coordination.[1]
Summary of potential cruxes
The following are some points (a) where we guess some AI x-risk skeptics might disagree with people who are more pessimistic, and (b) that we think it would be particularly nice to have crisper arguments for and against than we currently do.
As an aside, we do in fact think that in the long term, economic incentives could lead to very bad outcomes (via Moloch/race to the bottom-style dynamics). But this seems to happen far more slowly than AI capability gains, if at all.