tl;dr: In AI safety, we are worried about certain problems with using powerful AI. (For example, the difficulty of value specification, instrumental convergence, and the possibility that a misaligned AI will come up with takeover strategies that didn't even occur to us.) To study these problems or convince others that they are real, we might wish to describe them using mathematical models. However, this requires using models that are sufficiently rich that these problems could manifest in the first place. In this post, I suggest thinking about how such "rich-enough" models could look like. Also, I raise the possibility that models which are rich enough to capture problems relevant to AI alignment might be too complex to be amenable to a rigorous analysis.
Epistemic status: Putting several related observations into one place. But I don't have strong opinions on what to make of them.
In the previous post, I talked about "straightforwardly evaluating" arguments by modelling the dynamics described in those arguments. In this post, I go through some dynamics that seem central to AI risk. However, none of these dynamics is meant to be novel or surprising. Instead, I wish to focus on the properties of the mathematical models that could capture these dynamics. How do such models look like? How complicated are they? And --- to the extent that answering some questions about AI risk requires modeling the interplay between multiple dynamics --- is there some minimal complexity of models which can be useful for answering those questions?
Laundry List of Dynamics Closely Tied to AI Risk
In this section, I list a number of dynamics that seem closely tied to AI risk, roughly[1] grouped based on which part of the "AI risk argument" they relate to. Below each part of this list, I give some commentary on which models might be useful for studying the given dynamics. I recommend reading selected parts that seem interesting to you, rather than going through the whole text.
For the purpose of skimming, here is a list of the dynamics, without any explanations:
A key part of worries about AI risk is that formally writing down what we want --- or even somehow indirectly gesturing at it --- seems exceedingly difficult. Some issues that are related to this are:
Concepts that are relevant for specifying our preferences (e.g., "humans'" and "alive'') on the one hand, and concepts that are primitive in the environment (e.g., laws of physics) on the other, are separated by many levels of abstraction.
Consider the ontology of our agents (e.g., the format of their input/output channels, or the concepts they use for "thinking"). Then this ontology is (a) also far removed from fundamental physics and (b) different from our ontology.
We don't even know what is the correct language in which to describe low-level reality (ie, fundamental physics).
Even ignoring all of the above problems, we do not actually know what our preferences are.
If we want to use mathematical models to learn about these issues, we need to use models in which these issues can arise in the first place. I am not sure how a good model would look like. But I can at least give a few illustrative comments:
A typical gridworld can't capture (1), because our preferences (with respect to the gridworld) are defined over the same ontology that defines the environment (eg, "keep the agent-square from the lava-squares").
A gridworld agent, that directly observes what is on each square and outputs actions such as "move up", fails (2). In contrast, a simulated robot, that receives pixels as input and outputs joint torques, encounters (2a) to some extent.
All formal environments considered in AI and CS fail (3) (since somebody programmed them in the first place). However, we could simulate (3) by having one person design the environment, and a different -- less knowledgeable -- person design the agent. This would be somewhat reminiscent of "capability sandwiching".
Many fully formal environments (historically used in AI and CS), that I can think of, fail (4). That is, they come with a well-defined notion of what we want to achieve. - However, there is also a number of exceptions such as: (i) Computer games (where the goal was to have the agent "play well", but RL agents often instead find some way to game the specification, exploit bugs, etc). (ii) Getting simulated robots to do things. (iii) Natural language tasks. - Ultimately, the tricky part seems to be getting an environment where we have imperfect knowledge of our preferences, yet we are simultaneously able to tell whether what the agent is performing well or not. As in (3), the only possibility I see, for circumventing this apparent impossibility, is having the agent's designer be ignorant of some information that is available to the environment's designer.
II. Human extinction as a convergent byproduct of terminal goals[3]
The above-mentioned difficulty of specifying our preferences is, by itself, not dangerous. For example, it could, hypothetically, be possible that we never figure out how to get AI to do what we want, but all of the resulting failures are benign. (Perhaps "sleeping all day long" is a convergently instrumental goal of any strong optimiser?)
However, the "difficulty of specifying our preferences" problem does become dangerous when combined with other issues. One such issue is that, conceivably, most terminal goals[6] are such that doing well according to them results in the extinction of humanity as a byproduct. Some claims closely tied to this issue are:
Essentially anything in the environment can be dismantled for parts or energy.
Access to materials and energy seems, all else being equal, instrumantal for a wide range of goals.
Humans (and life on Earth in general) have been heavily optimised the current conditions. As a result, significant disruptions to the environment are likely to be fatal for us.
Some notes on models that can be informative regarding these dynamics:
Prerequisite: A precondition of talking about human extinction is that the state space of our model needs to at least contain states that correspond to humanity being extinct. (And a vast majority of models fail that condition. As a silly illustration, Cart Pole or ImageNet do not allow us to talk about humans dying at all, nor do they seem relevant to it. An example of a "model" that allows for rather large death tolls is Civilisation V, but this environment is not even trying to be a realistic model of AI risk.)
One might think that environments such as Smallville can serve as rich models that sort-of-satisfy (3). However, such models fail (1). This is because while such models might have many parts, most of those parts are inert (e.g., you can't destroy the buildings in Smallville).
To some degree, the computer game Minecraft satisfies (1). However, access to resources is much less instrumental in Minecraft than in our world. For example, in the real world, even seemingly simple goals (such as "sit here for as long as possible") need resources to fight against entropy.[7]
Regarding (3), I think that most existing AI environments only allow for outcomes that the designers considered relevant to the task for which the environment has been designed. For example, if the model contains a "door object", the possible states of this object might be "open" and "closed", but not "destroyed". This is is very sensible for the purpose of the original meaning of the model. However, it means that such models will be fundamentally unable to capture some of the dynamics that are relevant to AI risk.
III. Human extinction as a convergently instrumental subgoal[4]
Another issue that can prove fatal, when combined with inability to accurately specify our preferences, is that even a terminal goal doesn't cause human extinction as a byproduct, many goals might still make it instrumental to get rid of humanity. Some related claims are:
The current state of the world has already been heavily optimised according to human preferences. This means that, all else being equal, any disruptions of the environment are likely to be undesirable from our point of view.
(i) The preferences of many humans, and probably humanity as a whole, include strong opinions on how the universe should look like in the future. (ii) As a result, humanity is likely to be in conflict with any agent that wants to transform the environment (even non-fatally), or that is power-seeking[8].
For a wide range of goals, the best way to prevent human interference is to effectively exterminate humanity.
Some comments on capturing these dynamics in mathematical models:
Regarding (1), note that in many AI environments or benchmarks, a random state of the environment is not much worse -- according to our preferences -- than the initial state. In contrast, almost any change to the real world is undesirable (unless we interpret "almost any" with respect to the prior of actions-that-seem-common-sense-to-humans).
Regarding (2), note that most environments do not contain powerful agents with strong preferences over the future, and tendencies to attack agents with competing plans. (Many multiagent environments contain adversarial agents. But these are often adversarial no matter what.)
IV. Most attempts to constrain an AI's actions fail for superintelligent AIs[5]
Finally, some people concede that AI might form the "desire" to act in ways that are harmful to humanity, but they suggest that we might be able to prevent the AI from acting on these desires. This could be done either by physically preventing the AI from taking certain actions (eg, not plugging it to the internet) or by modifying the AI's design (eg, giving the AI a negative reward for connecting to the internet). However, there are worries that a sufficiently capable AI would be able to circumvent any such restrictions. Some proposed ways an AI could do this are:
The AI could exploit loopholes the specification in any internal restrictions placed on it.[9]
The AI could act through others --- tools that can be used, weaker agents that can be instructed, or powerful agents and optimisation processes that can be nudged.
The AI could also use strategies that take us by surprise, such as inventing new technologies, exploiting laws of physics we are not aware of yet, or simply taking actions its designers didn't think of.
The AI can exploit the fact that it, and everything affecting it, is embedded in the world. That is, it can attack the physical implementations of any restrictions -- both external and internal.
As before, the key question we want to ask is which mathematical models would allow us to learn about these issues. Here are some thoughts on this topic:
(1), the difficulties with specifying restrictions, are very similar to the above-mentioned difficulties with specifying our preferences. As a result, the same models are likely to be useful for studying both of these issues.
Some ways in which a model could capture the dynamics outlined in (2) are: (i) Containing simple tools or agents that can be micro-managed. (ii) Allowing the agent to create sub-agents, successors, etc. (iii) Containing other powerful agents, or powerful optimisation processes.
One way to allow for (3) --- that is, strategies that come as a surprise to the AI's designer --- is to have the AI be designed by somebody who isn't privy to all of the workings of the environment. Another approach would be to use a complicated environment where low-level effects can combine into a wide range of high-level effects
Some examples of models with embedded agents are Core War, (arguably) Game of Life[10], and Turing Machines.
Popular AI Benchmarks Seem to Lack Dynamics Relevant to AI-risk
As an exercise, I find it helpful to go through your favourite models (AI environments, benchmarks, etc), and think about which of the above-mentioned dynamics can possibly arise in them.
I recommend being somewhat explicit about which interpretation of the model at hand you are considering: For example, it is valid to think about game-theoretical chess as a model of Alice and Bob playing a game of chess in the real world. It is also valid to view chess as a (rather inaccurate) metaphor for war between two medieval kingdoms. But we should avoid the motte-and-bailey fallacy where we shift between the interpretations. (EG, the game-theoretical model of chess can be accurate when used to model real-world chess. Or it can allow for people dying, when used as a metaphor for war. But claiming that it is accurate and allows for people dying would be fallacious.)
More generally, we want a model that is both (i) somewhat accurate and (ii) rich enough to model some version of the AI-risk argument. And, across all models that I am aware of, models which satisfy (ii) make no attempt at (i) --- that is, they are good enough to conclude that "if the world worked like this, AI risk would be an issue". But they are not informative as to whether the world actually is like that or not. However, my search for good models has not been very exhaustive. So if you can think of any relevant models, please share them!
Are Informative Models Necessarily Complicated?
Finally, when trying to come up with mathematical models to investigate certain alignment proposal, I kept encountering the following issue: I wanted to have a model that is as simple as possible, such that I can rigorously analyse what the algorithm will do. And I came up with models where the proposal would work. But then it turned out that in those models, one could just as well use RLHF (which is easier to implement and more competitive). In other words, these models were too simple and did not manifest the problems that the algorithm was trying to solve. But when I imagined extensions of the models, rich enough that the problems could arise there, the resulting models were always too complicated for me to analyse the algorithm formally. And instead of doing formal math, I was back to vaguely gesturing at things like "but the your goal specification will have loopholes".[11]
This experience leads me to conjecture that perhaps there is something more fundamental at play. That is, I imagine that for every problem one might wish to study, there is some lower bound on the complexity of models that are still informative enough that studying them can tell us interesting things about reality. And perhaps the topic of AI risk is such that models that can teach us something about it are necessarily very complicated? Perhaps even complicated enough that formally analysing them is practically infeasible for us, at least at this point?
I am not fully sold on this point. For example, I expect that we might be able to, somehow, cleverly decompose the AI-risk question into mostly-independent parts, and study each of them separately, using sufficiently simple models. But I think that the possibility --- that some problems are real, yet too complex to be amenable to formal analysis --- is worth taking seriously.
Acknowledgments: Most importantly, many thanks to @Chris van Merwijk, who contributed to many of the ideas here (but wasn't able to look over this post, so I am not sure if he approves of it or not). I would also like to thank Vince Conitzer, TJ, Caspar Oesterheld, and Cara Selvarajah (and likely others I am forgetting), for discussions at various stages of this work.
The list of AI-risk dynamics is not meant to be exhaustive or authoritative. The goal is to give an intuition for how mathematical models (used to studying AI risk) could look like.
One intuition for this is that human values are complex (see for example The Hidden Complexity of Wishes). However, an independent problem is the difficulty of robustly pointing an optimiser towards anything at all (see Diamond maximizer).
Note that the issue [of human extinction as a by-product of achieving terminal goals] is not something that can be easily solved by approaches such as "just use virtue ethics". That is, the issue arises for any system that optimises a goal. And this can include goals such as "follow this operationalisation of virtue ethics", as long as there is (for example) a strong selection pressure towards policies which result in fewer cases where the system fails to follow the rules. (After all, following virtue ethics is likely to be much easier if all the humans and other moral patients are dead.)
However, perhaps it could be interesting to add entropy-like dynamics into environments, in order to give agents convergently instrumental drives for resource acquisition. (For example, in some computer games, constant monster attacks can serve as pressure for accruing resources to use for defense.)
Another way to view (2) is that humans are power-seeking, and power seems zero-sum (beyond some level). So any powerful power-seeking agent might find it instrumental to get rid of humanity preemptively.
I was able to come up with models where, for example, my favourite algorithm works, RLHF doesn't work, and the model is still simple enough to analyse formally. However, this always came at the cost of realism --- I had to give up any aspirations for the model to be a faithful (though extremely simplified, and thus inaccurate) representation of reality.
This post overlaps with our recent paper Extinction Risk from AI: Invisible to Science?.
tl;dr: In AI safety, we are worried about certain problems with using powerful AI. (For example, the difficulty of value specification, instrumental convergence, and the possibility that a misaligned AI will come up with takeover strategies that didn't even occur to us.) To study these problems or convince others that they are real, we might wish to describe them using mathematical models. However, this requires using models that are sufficiently rich that these problems could manifest in the first place.
In this post, I suggest thinking about how such "rich-enough" models could look like. Also, I raise the possibility that models which are rich enough to capture problems relevant to AI alignment might be too complex to be amenable to a rigorous analysis.
Epistemic status: Putting several related observations into one place. But I don't have strong opinions on what to make of them.
In the previous post, I talked about "straightforwardly evaluating" arguments by modelling the dynamics described in those arguments. In this post, I go through some dynamics that seem central to AI risk. However, none of these dynamics is meant to be novel or surprising. Instead, I wish to focus on the properties of the mathematical models that could capture these dynamics. How do such models look like? How complicated are they? And --- to the extent that answering some questions about AI risk requires modeling the interplay between multiple dynamics --- is there some minimal complexity of models which can be useful for answering those questions?
Laundry List of Dynamics Closely Tied to AI Risk
In this section, I list a number of dynamics that seem closely tied to AI risk, roughly[1] grouped based on which part of the "AI risk argument" they relate to. Below each part of this list, I give some commentary on which models might be useful for studying the given dynamics. I recommend reading selected parts that seem interesting to you, rather than going through the whole text.
For the purpose of skimming, here is a list of the dynamics, without any explanations:
I. Difficulty of specifying our preferences[2]:
II. Human extinction as a convergent byproduct of terminal goals[3]:
III. Human extinction as a convergently instrumental subgoal[4]:
IV. Most attempts to constrain an AI's actions fail for superintelligent AIs[5]:
I. Difficulty of specifying our preferences[2]
A key part of worries about AI risk is that formally writing down what we want --- or even somehow indirectly gesturing at it --- seems exceedingly difficult. Some issues that are related to this are:
If we want to use mathematical models to learn about these issues, we need to use models in which these issues can arise in the first place. I am not sure how a good model would look like. But I can at least give a few illustrative comments:
- However, there is also a number of exceptions such as: (i) Computer games (where the goal was to have the agent "play well", but RL agents often instead find some way to game the specification, exploit bugs, etc). (ii) Getting simulated robots to do things. (iii) Natural language tasks.
- Ultimately, the tricky part seems to be getting an environment where we have imperfect knowledge of our preferences, yet we are simultaneously able to tell whether what the agent is performing well or not. As in (3), the only possibility I see, for circumventing this apparent impossibility, is having the agent's designer be ignorant of some information that is available to the environment's designer.
II. Human extinction as a convergent byproduct of terminal goals[3]
The above-mentioned difficulty of specifying our preferences is, by itself, not dangerous. For example, it could, hypothetically, be possible that we never figure out how to get AI to do what we want, but all of the resulting failures are benign. (Perhaps "sleeping all day long" is a convergently instrumental goal of any strong optimiser?)
However, the "difficulty of specifying our preferences" problem does become dangerous when combined with other issues. One such issue is that, conceivably, most terminal goals[6] are such that doing well according to them results in the extinction of humanity as a byproduct. Some claims closely tied to this issue are:
Some notes on models that can be informative regarding these dynamics:
(And a vast majority of models fail that condition. As a silly illustration, Cart Pole or ImageNet do not allow us to talk about humans dying at all, nor do they seem relevant to it. An example of a "model" that allows for rather large death tolls is Civilisation V, but this environment is not even trying to be a realistic model of AI risk.)
III. Human extinction as a convergently instrumental subgoal[4]
Another issue that can prove fatal, when combined with inability to accurately specify our preferences, is that even a terminal goal doesn't cause human extinction as a byproduct, many goals might still make it instrumental to get rid of humanity. Some related claims are:
(ii) As a result, humanity is likely to be in conflict with any agent that wants to transform the environment (even non-fatally), or that is power-seeking[8].
Some comments on capturing these dynamics in mathematical models:
IV. Most attempts to constrain an AI's actions fail for superintelligent AIs[5]
Finally, some people concede that AI might form the "desire" to act in ways that are harmful to humanity, but they suggest that we might be able to prevent the AI from acting on these desires. This could be done either by physically preventing the AI from taking certain actions (eg, not plugging it to the internet) or by modifying the AI's design (eg, giving the AI a negative reward for connecting to the internet). However, there are worries that a sufficiently capable AI would be able to circumvent any such restrictions. Some proposed ways an AI could do this are:
As before, the key question we want to ask is which mathematical models would allow us to learn about these issues. Here are some thoughts on this topic:
Popular AI Benchmarks Seem to Lack Dynamics Relevant to AI-risk
As an exercise, I find it helpful to go through your favourite models (AI environments, benchmarks, etc), and think about which of the above-mentioned dynamics can possibly arise in them.
I recommend being somewhat explicit about which interpretation of the model at hand you are considering: For example, it is valid to think about game-theoretical chess as a model of Alice and Bob playing a game of chess in the real world. It is also valid to view chess as a (rather inaccurate) metaphor for war between two medieval kingdoms. But we should avoid the motte-and-bailey fallacy where we shift between the interpretations. (EG, the game-theoretical model of chess can be accurate when used to model real-world chess. Or it can allow for people dying, when used as a metaphor for war. But claiming that it is accurate and allows for people dying would be fallacious.)
More generally, we want a model that is both (i) somewhat accurate and (ii) rich enough to model some version of the AI-risk argument. And, across all models that I am aware of, models which satisfy (ii) make no attempt at (i) --- that is, they are good enough to conclude that "if the world worked like this, AI risk would be an issue". But they are not informative as to whether the world actually is like that or not.
However, my search for good models has not been very exhaustive. So if you can think of any relevant models, please share them!
Are Informative Models Necessarily Complicated?
Finally, when trying to come up with mathematical models to investigate certain alignment proposal, I kept encountering the following issue: I wanted to have a model that is as simple as possible, such that I can rigorously analyse what the algorithm will do. And I came up with models where the proposal would work. But then it turned out that in those models, one could just as well use RLHF (which is easier to implement and more competitive). In other words, these models were too simple and did not manifest the problems that the algorithm was trying to solve. But when I imagined extensions of the models, rich enough that the problems could arise there, the resulting models were always too complicated for me to analyse the algorithm formally. And instead of doing formal math, I was back to vaguely gesturing at things like "but the your goal specification will have loopholes".[11]
This experience leads me to conjecture that perhaps there is something more fundamental at play. That is, I imagine that for every problem one might wish to study, there is some lower bound on the complexity of models that are still informative enough that studying them can tell us interesting things about reality. And perhaps the topic of AI risk is such that models that can teach us something about it are necessarily very complicated? Perhaps even complicated enough that formally analysing them is practically infeasible for us, at least at this point?
I am not fully sold on this point. For example, I expect that we might be able to, somehow, cleverly decompose the AI-risk question into mostly-independent parts, and study each of them separately, using sufficiently simple models. But I think that the possibility --- that some problems are real, yet too complex to be amenable to formal analysis --- is worth taking seriously.
Acknowledgments: Most importantly, many thanks to @Chris van Merwijk, who contributed to many of the ideas here (but wasn't able to look over this post, so I am not sure if he approves of it or not). I would also like to thank Vince Conitzer, TJ, Caspar Oesterheld, and Cara Selvarajah (and likely others I am forgetting), for discussions at various stages of this work.
The list of AI-risk dynamics is not meant to be exhaustive or authoritative. The goal is to give an intuition for how mathematical models (used to studying AI risk) could look like.
One intuition for this is that human values are complex (see for example The Hidden Complexity of Wishes). However, an independent problem is the difficulty of robustly pointing an optimiser towards anything at all (see Diamond maximizer).
For some intuitions, see Edge instantiation.
For an intuition, see the Catastrophic Convergence Conjecture.
For some intuitions, see Patch resistance and Nearest unblocked strategy.
Note that the issue [of human extinction as a by-product of achieving terminal goals] is not something that can be easily solved by approaches such as "just use virtue ethics". That is, the issue arises for any system that optimises a goal. And this can include goals such as "follow this operationalisation of virtue ethics", as long as there is (for example) a strong selection pressure towards policies which result in fewer cases where the system fails to follow the rules. (After all, following virtue ethics is likely to be much easier if all the humans and other moral patients are dead.)
However, perhaps it could be interesting to add entropy-like dynamics into environments, in order to give agents convergently instrumental drives for resource acquisition. (For example, in some computer games, constant monster attacks can serve as pressure for accruing resources to use for defense.)
Another way to view (2) is that humans are power-seeking, and power seems zero-sum (beyond some level). So any powerful power-seeking agent might find it instrumental to get rid of humanity preemptively.
Specifying the restrictions without any loopholes can be difficult for the same reasons that it can be difficult to specify our "primary" preferences.
If we imagine that some patterns can be viewed as agents.
I was able to come up with models where, for example, my favourite algorithm works, RLHF doesn't work, and the model is still simple enough to analyse formally. However, this always came at the cost of realism --- I had to give up any aspirations for the model to be a faithful (though extremely simplified, and thus inaccurate) representation of reality.