FYI, I've found this concept useful in thinking, but I think "atomic" is a worse word than just saying "non-interruptible". When I'm explaining this to people I just say "unbounded, uninterruptible optimization". The word atomic only seems to serve to make people say "what's that?" and then I say "uninterruptible"
MIRI’s early work (for example modal combat and work on Loeb’s theorem) assumed that UAO would be instantiated through hand-written AI programs that were just good enough to improve themselves slightly, leading to an intelligence explosion (with a bunch of other assumptions).
Agent foundations work makes / needs no assumptions about how first AGIs are written, or intelligence explosion, it's not about that. It's about deconfusion, noticing and formulating concepts that help with thinking about agents-in-a-very-loose-sense.
You probably know better than me, but I still have this intuition that seed-AI and FOOM have oriented the framing of the problem and the sort of question asked. I think people who came to agent foundations from different routes ended up asking slightly different questions.
I could totally be wrong though, thanks for making this weakness of my description explicit!
This post is part of the work done at Conjecture.
Disagree to Agree
(Practically-A-Book Review: Yudkowsky Contra Ngo On Agents, Scott Alexander, 2022)
(Personal Communication about a conversation with Evan Hubinger, John Wentworth, 2022)
If there's one thing alignment researchers excel at, it's disagreeing with each other.
I dislike the term pre paradigmatic, but even I must admit that it captures one obvious feature of the alignment field: the constant debates about the what and the how and the value of different attempts. Recently, we even had a whole sequence of debates, and since I first wrote this post Nate shared his take on why he can’t see any current work in the field actually tackling the problem. More generally, the culture of disagreement and debate and criticism is obvious to anyone reading the AF.
Yet Scott Alexander has a point: behind all these disagreements lies so much agreement! Not only in discriminating the "any hope at all" proposals from the "no hope at all", as in John's quote above; agreement also manifests itself in the common components of the different research traditions, for example in their favorite scenarios. When I look at Eliezer's FOOM, at Paul's What failure looks like, at Critch's RAAPs, and at Evan's Homogeneous takeoffs, the differences and incompatibilities jump to me — yet they still all point in the same general direction. So much so that one can wonder if a significant part of the problem lies outside of the fine details of these debates.
In this post, I start from this hunch — deep commonalities — and craft an abstraction that highlights it: unbounded atomic[1] optimization (abbreviated UAO and pronounced wow). That is, alignment as the problem of dealing with impact on the world (optimization) that is both of unknown magnitude (unbounded) and non-interruptible (atomic). As any model, it is necessarily mistaken in some way; I nonetheless believe it to be a productive mistake, because it reveals both what we can do without the details and what these details give us when they're filled in. As such, UAO strikes me as a great tool for epistemological vigilance.
I first present UAO in more details; then I show its use as a mental tool by giving four applications:
Thanks to TJ, Alex Flint, John Wentworth, Connor Leahy, Kyle McDonell, Laria Reynolds, Raymond Arnold, Steve Byrnes, Rohin Shah, Evan Hubinger, James Lucassen, Rob Miles, Jamie Bernardi, Lucas Teixeira, and Andrea Motti for discussions on these ideas and comments on drafts.
Pinning UAO down
Let's first define this abstraction. Unbounded Atomic Optimization, as the name subtly hints, is made of three parts:
Optimization: making the world go your way
Optimization seems to forever elude full deconfusion, but an adaptation of Alex Flint’s proposal will do here: optimization is pushing the world into a set of states.[2] Note that I’m not referring to computational optimization in the sense of a search algorithm; it is about changing the physical world.
When I’m talking about “amount” of optimization, I’m thinking of an underdefined quantity that captures a notion of how much effort/work/force is spent in pushing the world towards the target set of states. Here’s a non-exhaustive list of factors that can increase the amount of optimization needed:
Unboundedness: phase transition in optimization
Humans optimize all the time, as do institutions, animals, economic systems, and many other parts of our world. But however impressive the optimization, it is always severely bounded. We talk about absolute power for a king or an emperor, but none of them managed to avoid death or maintain their will for thousands of years yet (most couldn’t even get their teeth fixed better than paupers).
Classical scenarios of AI risk, on the other hand, stress the unboundedness of the optimization being done. Tiling the whole lightcone with paper clips gives a good example of massive amounts of optimization.
Another example of unbounded optimization common in alignment is manipulation: the AI optimizing for convincing the human of something. We’re decently good at manipulating each other, but there’s still quite clear bounds in our abilities to do so (although some critical theorists and anthropologists would argue we underapproximate the bounds in the real world). If the amount of optimization that can be poured into manipulation is arbitrarily large, though, we have no guarantee that any belief or system of beliefs is safe from that pressure.
More generally, unbounded optimization undermines solutions that are meant to deal with only some reasonable range of force/effort (like buttresses in structural engineering). So it means that no amount of buttresses is enough to keep the cathedral of our ideals from collapsing.
Atomicity: don’t stop me now
In distributed computing, an atomic operation is one that cannot be observed “in the middle” from another process — either it didn’t happen yet, or it’s already finished. Ensuring atomicity plays a crucial role in abstracting the mess of distributed interleavings, loss of messages, and other joys of the cloud.
I use atomic analogously to mean “uninterruptible in practice”. It might be physically possible to interrupt it, but that would require enormous amounts of resources or solving hard problems like coordination.
In alignment, we’re worried about atomic optimization: the optimization of the world which we can’t interrupt or stop until it finishes.
What does this look like? FOOM works perfectly as an initial example: it instantiates atomicity through exponential growth and speed difference — you can’t stop the AI because it acts both far too smartly and quickly. But the whole point of using atomicity instead of FOOM is to allow other implementations. Paul Christiano (What failure looks like), Evan Hubinger (Homogeneity vs heterogeneity in AI takeoff scenarios) and Andrew Critch (What Multipolar Failure Looks Like) all propose different AI Risks scenarios with atomicity without FOOM. Instead of speed, their atomicity comes from the need to solve a global coordination problem in order to stop the optimization. And coordination is just hard.
Application 1: Highlight Convergence of AI Risk Scenarios
In almost any AI Risk story, you can replace the specific means of optimization with UAO, and the scenario still works.
For me, this highlights a crucial aspect of Alignment and AI Risk: it’s never about the specific story. I get endlessly frustrated when I see people who disagree with AI Risk not because they disagree with the actual arguments, but because they can’t imagine something like FOOM ever happening, or judge it too improbable.[3]
The problem with this take is not that FOOM is obviously what’s going to happen with overwhelming probability (I’m quite unconvinced of that), but that it doesn’t matter how UAO is implemented — as long as we have it, we’re in trouble.
And because UAO based arguments abstract many (all?) of the concrete ones, they are at least as probable (and probably strictly more probable) as any of them. Not only that, they even gain from new formulations and scenarios, as these offer additional mechanisms for implementing UAO. So having a variety of takeoff speeds, development models, and scenarios turns from a curse to a boon!
What this also entails is that to judge the probability of these risks, we need to assess how probable UAO is, in any implementation.
Convergence to UAO
To start with unboundedness, it follows straightforwardly from technological progress. Humanity is getting better and better at shaping the world according to its whims. You might answer that this leads to many unwanted consequences, but that’s kind of the point, isn’t it? At least no one can say that we don’t have a massive impact on the world!
This is also where AI gets back into the picture: ML and other forms of AI are particularly strong modern ways of applying optimization to the world.[4] And we currently have no idea where it stops. Add to that the induction from past human successes that huge gains can come from insights into how to think about a problem, and you have a recipe for massively unbounded optimization in the future.
As for atomicity, it has traditionally been instantiated through three means in AI Risk arguments:
The gist is that we’re getting better and better at optimizing, through technology in general and computers and automation in particular. This in turn leads to a more and more atomic use of optimization, due to the high speed of computers and the incentives to automate. With the compounding effect of the difficulty to coordinate, we have an arms race for building more and more atomic optimization power, leading to virtually unbounded atomic optimization.
Application 2: Explore Conditions for AI Risk
While UAO is a crucial ingredient of AI Risk, it is not enough: most scenarios need some constraints on how UAO is applied. The abstraction of UAO lets us then focus on exploring these conditions, to better understand the alignment problem. As such, UAO provides a crucial tool for epistemological vigilance on the assumptions underlying our risk scenarios.
Let’s look at two classes of proxies for an example: overapproximation proxies and utility maximization proxies. These two capture many of the concrete proxies that are used in AI Risk scenarios, and illustrate well how UAO can clarify where to investigate.
The Danger of Overapproximations
Overapproximation proxies point to quite reasonable and non-world-shattering results, like “Make me rich”.
Here are their defining properties:
Let’s look at what happens when we apply UAO to such proxies. Our proxy gives us a fixed, overapproximated target set of states. Let’s say something like “produce 20 billion paperclips in the United States per year” (about twice the current amount). You don’t need to tile the universe to reach that target at all. So it’s relatively easy to end up in the set of states we’re aiming for. But what about reliability and robustness, the other two requirements of the proxy? Well if you want to guarantee that you’ll reach the target set and not get out of it, one way to do so is to aim for the part of this target state that is more controlled and guaranteed.[7] Like for example, the one where the Earth is restructured for better paperclip-making conditions (without these bothering humans for example!). As the optimization increases, it is increasingly spent on reliability and robustness, which strongly incentivizes using the many degrees of freedom to guarantee the result and its perennity. Hello instrumental convergence!
The story is thus: unbounded atomic optimization + overapproximate proxies => incentive for numerous degrees of freedom to be used in systematically bad ways.
Note that if we want to avoid this fate, our abstract conditions give us multiple points of intervention:
Terrible Returns on Utility
Utility maximization proxies are specified by the maximal states according to some utility function. It should come to no surprise to readers of this post that maximizing utility can lead to terrible outcomes — the question is: what is needed for that to happen?
This part shows more how UAO can lead to asking relevant questions. My current best guess is that we also need two conditions on the proxy:
With these two conditions, it follows that UAO will push us into the terrible upper set, and lead to catastrophic AI Risk.
The interesting bit here lies in analyzing these conditions for actual utility functions, like “maximizing paperclips”. And just like with the overapproximation proxies, multiple points of interventions emerge from this analysis:
Application 3: Anchor Operationalization Pluralism
In my last post, I discussed different levels at which pluralism might be applied and justified. The one that UAO is relevant to in my opinion is operationalization pluralism, or pursuing multiple operationalization (frames/perspectives/ways of filling the details) for the same problem.
Because the tricky part in operationalization pluralism is to capture the problem abstractly enough to allow multiple operationalization, without losing the important aspects of the problem.
UAO provides one candidate abstraction for the alignment problem.
In some sense, UAO acts as a fountain of knowledge: it rederives known operationalizations when you fill in the implementation details or make additional assumptions. As such, it serves both as a concrete map and as a tool to explore the untapped operationalizations. We can pick unused assumptions, and generate the corresponding operationalization of the alignment problem.
Three concrete ways of generating operationalizations are
Let’s look at examples of all three in alignment research.
Filling in the blanks: neural nets, brain-like algorithms and seed AI
The obvious way of operationalizing UAO is to make it concrete. This is exactly what Prosaic Alignment, Steve Byrnes’ Brain-like AGI Alignment and some of MIRI’s early work on seed AIs do.
These assumptions were historically made from a normative perspective: each researcher believed that this kind of AI was either the most probable, or had a significant enough probability to warrant study and investigation.[9]
But here we’re starting from UAO instead. By making these additional assumptions, each operationalization unlocks new ways of framing and exploring the problem. As an analogy, in programming language theory, the more generic a type, the less you can do with it; and the more specific it becomes, the more methods and functions can be used on it. So if we assume that UAO will be instantiated as neural networks trained by gradient descent, we have more handles for exploring the general problem and investigating mechanisms. A perfect example is the small research tradition around gradient hacking, which looks for very concrete neural networks implementations of a certain type of treacherous turn incentivized by instrumental convergence.
Yet there are also risks involved in such an instantiation. First, if the instance is a far simpler case than the ones we will have to deal with, this is an argument against the relevance of solving that instance. And more insidiously, what can look like an instantiation might just pose a completely different problem. That’s one failure mode when people try to anchor alignment in ML and end up solving purely bounded optimization problems without any theory of change about the influence on unbounded atomic optimization.[10]
Working directly on the abstraction
Another category of operationalizations stays at the abstract level, and focuses instead on one possible intervention on UAO as the royal road to alignment. A lot of the work published on the AF fits this category, including almost all deconfusion.[11] Among others, there are:
The tricky part is that so much of the work at this level looks like fundamental science: it’s about exploring the problem almost as a natural object, in the way computer scientists would study a complexity class and its complete problems. In the best cases, this level of abstraction can yield its secrets to simple and powerful ideas, like “high-level summary statistics at a distance” or “counting options through permutations”. But even then, drawing conclusions for the solution of the problem is hard, and requires epistemological vigilance.
That being said, such work still plays a crucial role in alignment research, and we definitely need more of it. Even when working from within an instantiation like prosaic alignment, it’s often fruitful to move between this level and the more concrete. I conjecture that it comes both from the purity of the models used (which leads to focus on nice math) and from removing the details that obscure or hide the core of unbounded atomic optimization.
Privileging particular tools
The last category in my non-exhaustive list are those operationalizations which start from their methods and the veins of evidence where they go searching for hidden bits.
Here the risks are to take an irrelevant field, or one with only superficial links to alignment. I think it’s possible to analyze the expected productivity of an analogy, for example based on the successes in that field. Also relevant, if the field in question doesn’t have many successes, is whether the analogy reduces alignment to a currently really hard problem (like P vs NP), or to some simpler problem that these other fields have a reasonable chance to tackle.
My attitude to this category of operationalization is that we should look for even more opportunities and bring as many analogies as we can, as long as we expect them to become productive for alignment. The PIBBSS Fellowship is pushing in that direction, and I expect a clearer framing of the constraints to help.
Application 4: Separate AI Alignment From Other Forms of Alignment
As a final application of UAO, let’s separate alignment of AIs from other forms of alignment.
Here I want to turn to Alex Flint’s nice analysis of Alignment vs AI Alignment,[12] where he attempts to separate aligning AI from alignment of other systems like oneself or society. Concretely, his non-AI examples are:
Alex then asks what separates aligning an AI from all these examples.
My answer: the combination of unboundedness and atomicity in the optimization. In all these examples, unbounded optimization applied atomically is irrelevant. In principle each example can be optimized somewhat unboundedly, but it happens so slowly that we can iterate — an assumption requiring epistemological vigilance in alignment.
Or said differently, it’s unbounded optimization but applied little by little, with time to change course in between. Just like cathedral builders could see cracks and failure happening over the course of decades and correct them.
Note that this doesn’t mean these fields can’t help with alignment. Just that alignment is qualitatively different from the phenomena traditionally tackled by economics, behavior change, and these other fields. This difference must be kept in mind when building a theory of change for applying insights from these other disciplines.
UAO, a Productive Mistake
We’ve seen that unbounded atomic optimization serves in multiple applications:
This makes me think that UAO is a productive mistake.
How is it a mistake? That is, what does it hide away or distort? Mostly it assumes the hardness of the problem. Some people believe that alignment is significantly easier than dealing with UAO — maybe the increases in optimization between iteration of AIs will be slow enough to adapt and break atomicity, for example. I’m personally dubious of such simplifications, as they look more like wishful thinking than arguments to me. But UAO is definitely colored by my takes, and my general stance towards epistemological vigilance.
Still, UAO can act as a characterization of the hard alignment problem that is more conducive to debates about the difficulty of alignment and the assumptions we can get away with.
Here the word "atomic" refers to the etymological meaning "indivisible", rather than the common usage "small"
This setting can deal with utility functions by focusing on the sets with maximal utility (which exists because there are finitely many states).
How do I know that they might agree with the actual argument? Because most often, when I then present them a more structural implementation of UAO like Critch’s RAAPs, they end up agreeing with the risks!
Here again, it’s important to note that I’m using optimization in the “physically changing the world” sense, not in the computational “internal search” sense. So what AI gives us here is the ability to “internally search” for better ways of acting in the world, and this whole process fits under what I call optimization.
This is where the atomicity comes from in fast takeoffs and FOOM-like scenarios.
Exploring these structural factors is the big contribution of Critch’s RAAPs in my opinion.
This is but another way of framing Bostrom’s insightful point about how even a wireheading AI would have reasons to tile the universe to protect itself and its wireheading.
Important to note that this subclass of alignment is comparatively far larger (at least in terms of active research) than the other two, and has additional specializations (for example whether the NN will be trained by RL or self-supervised learning).
Critch feels like a strong exception, because I interpret his introduction of RAAPs as an attempt to add structural perspective to alignment to round off the field. And although Paul believes in the normative claim that the first AGI will probably be prosaic, he does argue that even if that's not the case, we should expect a solution to prosaic alignment to translate to the other version and capture some hard parts of the problem. And when I asked him the question, he told me that what mattered was to make the problem well-defined.
See this post for an exploration of the common assumptions that need to be questioned in alignment to not fall into this trap.
Some exceptions are Evan Hubinger’s et al. inner optimization and Paul Christiano’s universality, which are tailored for prosaic alignment. Yet they end up being useful for other approaches too.
Discussions with Alex while he was writing that post ultimately led me to realizing the need for the atomicity condition, so he gets the credit for that!