This post is the version of Yudkowsky's argument for inner misalignment that I wish I'd had in my head a few years ago. I don't claim that it's novel, that I endorse it, or even that Yudkowsky would endorse it; it's primarily an attempt to map his ideas into an ontology that makes sense to me (and hopefully others).

This post is formulated in terms of three premises, which I explore in turn. My arguments deliberately gloss over some nuances and possible objections; in a follow-up post, I'll explore three of them. In a third post I'll dig into the objection I find most compelling, and outline a research agenda that aims to flesh it out into a paradigm for thinking about cognition more generally, which I'm calling coalitional agency.

Background

An early thought experiment illustrating the possibility of misaligned AI is the "paperclip maximizer", an AI with the sole goal of creating as many paperclips as possible. This thought experiment has often been used to describe outer misalignment—e.g. a case where the AI was given the goal of making paperclips. However, Yudkowsky claims that his original version was intended to refer to an inner alignment failure in which an AI developed the goal of producing “tiny molecules shaped like paperclips” (with that specific shape being an arbitrary example unrelated to human paperclips).

So instead of referring to paperclip maximizers, I'll follow Yudkowsky's more recent renaming and talk about "squiggle maximizers": AIs that attempt to fill the universe with some very low-level pattern that's meaningless to humans (e.g. "molecular squiggles" of a certain shape).

I'll argue for the plausibility of squiggle-maximizers via three claims:

  1. Increasing intelligence requires compressing representations; and
  2. The simplest goals are highly decomposable broadly-scoped utility functions; therefore
  3. Increasingly intelligent AIs will converge towards squiggle-maximization.

In this post I'll explore each of these in turn. I'll primarily aim to make the positive case in this post; if you have an objection that I don't mention here, I may discuss it in the next post.

Increasing intelligence requires compressing representations

There's no consensus definition of intelligence, but one definition that captures the key idea in my mind is: the ability to discover and take advantage of patterns in the world. When you look at a grid of pixels and recognize a cat, or look at a string of characters and recognize a poem, you're doing a type of pattern-recognition. Higher-level patterns include scientific laws, statistical trendlines, theory of mind, etc. Discovering such patterns allows an agent to represent real-world information in a simpler way: instead of storing every pixel or every character, they can store higher-level patterns along with whichever low-level details don’t fit the pattern.

This is (at a high level) also how compression algorithms work. The thesis that intelligence is about compression has most prominently been advocated by Marcus Hutter, who formulated AIXI and created a prize for text compression. The enormous success of the self-supervised learning paradigm a few decades later is a vindication of his ideas (see also this talk by llya Sutskever exploring the link between them).

However, we shouldn’t interpret this thesis merely as a claim about self-supervised learning. We can be agnostic about whether compression primarily occurs via self-supervised learning, or fine-tuning, or regularization, or meta-learning, or directed exploration, or chain-of-thought, or new techniques that we don’t have yet. Instead we should take it as a higher-level constraint on agents: if agents are intelligent, then they must consistently compress their representations somehow.

(A human analogy: scientists sometimes make breakthroughs via solitary reasoning, or writing down their thoughts, or debating with others, or during dreams, or in a flash of insight. We don’t need to make claims about the exact mechanisms involved in order to argue that successful science requires finding highly compressed representations of empirical data.)

For the purposes of my current argument, then, we just need to accept the following claim: as agents become superintelligent there will be strong forces pushing their representations to become highly compressed.

The simplest goals are highly decomposable broadly-scoped utility functions

In general it's hard to say much about which goals will be simpler or more complex for superintelligences to represent. But there are a few properties that seem like they'll be highly correlated with the simplicity of goals. The first one I'll talk about is decomposability. Specifically, I'll focus on linearly decomposable goals which can be evaluated by adding together evaluations of many separate subcomponents. More decomposable goals are simpler because they can focus on smaller subcomponents, and don't need to account for interactions between those subcomponents.

To illustrate the idea, here are four types of linear decomposability (though there may be more I'm missing):

  • Decomposability over time. The goal of maximizing a reward function is decomposable over time because the overall goal can be evaluated by decomposing a trajectory into individual timesteps, then adding together the rewards at each timestep.
  • Decomposability over space. A goal is decomposable over space if it can be evaluated separately in each given volume of space. All else equal, a goal is more decomposable if it's defined over smaller-scale subcomponents, so the most decomposable goals will be defined over very small slices of space—hence why we're talking about molecular squiggles. (By contrast, you can't evaluate the amount of higher-level goals like "freedom" or "justice" in a nanoscale volume, even in principle.)
  • Decomposability over possible worlds. This is one of the main criteria which qualifies a goal as a utility function. Expected utility maximizers make decisions about lotteries over possible worlds as if they were adding together the (weighted) values of each of those possible worlds. Conversely, an agent’s goals might not be linearly decomposable over possible worlds due to risk-aversion, or because they value fairness, or various other reasons.
  • Decomposability over features. One final way in which a goal can be decomposable is if the value it assigns to an outcome can be calculated by adding together evaluations of different features of that outcome. For example, if my goal is to write a well-reviewed, bestselling, beautiful novel, my goal is more linearly decomposable if I can evaluate each of these properties separately and optimize for the sum of them. This occurs when features have fixed marginal utility, rather than being substitutes or complements.

Decomposability doesn't get us all the way to squiggle maximizers though. For that we need a second property: being broadly-scoped. A narrowly-scoped goal is one which has tight limits on where it applies. For example, we can imagine a goal like "increase the number of squiggles in this room as much as possible" which has very strongly diminishing returns to gaining more resources, compared with versions of the goal that aren’t bounded to that room.

However, the concept of a “room” is tied up with many human norms, and has many edge cases which would be complicated to fully pin down. So intuitively speaking, the goal above would be simpler if its bounds were defined in terms of scientifically-grounded concepts—like “on this planet” or “in our lightcone”. The latter in particular is very clearly-defined and unambiguous, making it a plausible element of the simplest versions of many goals.

(An earlier version of this section focused on unbounded goals like “increase the number of squiggles as much as possible”, which seem even simpler than broadly-scoped goals. But Scott Garrabrant pointed out that unbounded utility functions violate rationality constraints, which suggests that they actually have hidden complexity upon reflection. Alex Zhu also noted that even “in our lightcone” runs into complications when we consider possible multiverses, but I’ll leave those aside for now.)

Arguments about the simplicity of different goals are inherently very vague and speculative; I’m not trying to establish any confident conclusion. The arguments in this section are merely intended to outline why it’s plausible that the simplest goals will be highly decomposable, broadly-scoped utility functions—i.e. goals which roughly resemble squiggle-maximization.

Increasingly intelligent AIs will converge towards squiggle-maximization

Premise 1 claims that, as AIs become more intelligent, their representations will become more compressed. Premise 2 claims that the simplest goals resemble squiggle-maximization. The relationship described in premise 1 may break down as AIs become arbitrarily intelligent—but if it doesn’t, then premise 2 suggests that their goals will converge toward some kind of squiggle-maximization. (Note that I’m eliding over some subtleties related to which representations exactly get compressed, which I’ll address in my next post.)

What forces might push back on this process, though? The most obvious is training incentives. For example, AIs that are trained via reinforcement learning might get lower reward for carrying out squiggle-maximizing behavior instead of the behavior intended by humans. However, if they have situational awareness of their training context, they might realize that behaving in aligned ways in the short term will benefit their goals more in the long term, by making humans trust them more—the strategy of deceptive alignment.

Deceptive alignment might lead agents with nearly any broadly-scoped goals (including very misaligned goals) to act as if they were aligned. One common hope is that, during the period when they’re acting aligned, regularization will push them away from their misaligned goals. But if their behavior depends very little on their goals, then regularization towards simple representations would actually push them towards goals like squiggle maximization. We can therefore picture AIs gradually becoming more misaligned during training without changing their behavior, even if they started off aligned.

Can we say anything else meaningful about the evolution of goals during that process, except that they'll become very simple? In a previous post I described value systematization as

the process of an agent learning to represent its previous values as examples or special cases of other simpler and more broadly-scoped values.

This seems like a central way in which complex goals will be replaced by simpler goals. In that post, I illustrated value systematization with the example of utilitarianism. Through a process of philosophical reasoning that prioritizes simplicity, utilitarians converge towards the overriding value of maximizing a highly-decomposable broadly-scoped utility function. As they do so, they decide that existing values (like honesty, dignity, kindness, etc) should be understood as approximations to or special cases of utilitarian strategies. While their behavior stays the same in many everyday scenarios, the way they generalize to novel scenarios (e.g. thought experiments) often changes radically.

To better understand squiggle maximization in particular, it's worth zooming in further on utilitarianism in more detail. All utilitarians want to maximize some conception of welfare, but they disagree on how to understand welfare. The three most prominent positions are:

  • Objective list utilitarianism, which defines welfare in terms of the achievement of certain values.
  • Preference utilitarianism, which defines welfare in terms of the satisfaction of an agent's preferences.
  • Hedonic utilitarianism, which defines welfare in terms of the valence of conscious experiences.

We can think of each of these positions as making a different tradeoff between simplicity and preserving existing values. Objective list utilitarianism requires the specification of many complex values. Preference utilitarianism gets rid of those, but at the cost of being indifferent between intuitively-desirable preferences and seemingly-meaningless preferences. It also still requires a definition of preferences, which might be complicated. Meanwhile hedonic utilitarianism fully bites the bullet, and gets rid of every aspect of life that we value except for sensory pleasure.

Extreme hedonic utilitarians don't even care whether the pleasure is instantiated in human minds. They talk about filling the universe with "hedonium": matter arranged in the optimal configuration for producing happiness. We don't know yet how to characterize pleasure on a neural level, but once we can, hedonic utilitarianism will essentially be a type of squiggle-maximization, with the "squiggles" being whichever small-scale brain circuits best instantiate happiness.

In a sense, then, the squiggle maximizer hypothesis is just the hypothesis that AIs will have similar motivations as extreme hedonic utilitarians, for similar reasons, but with the specific thing they want to fill the universe with being even less palatable to everyone else. The fact that sympathy for hedonic utilitarianism is strongly correlated with intelligence is a somewhat worrying datapoint in favor of the plausibility of squiggle-maximizers.

However, there are still a range of reasons to doubt the argument I've presented in this post, as I'll explore in the next two posts.

New Comment
11 comments, sorted by Click to highlight new comments since:

The part where squiggles are small and simple is unimportant. They could be bigger and more complicated, like building giant mechanical clocks. The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.

Actually, to slightly amend that:  The part where squiggles are small is a more than randomly likely part of the prediction, but not a load-bearing part of downstream predictions or the policy argument.  Most of the time we don't needlessly build our own paperclips to be the size of skyscrapers; even when having fun, we try to do the fun without vastly more resources, than are necessary to that amount of fun, because then we'll have needlessly used up all our resources and not get to have more fun.  We buy cookies that cost a dollar instead of a hundred thousand dollars.  A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing.  Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess.  "Great giant squiggles of nickel the size of a solar system would be no more valuable, even from a very embracing and cosmopolitan perspective on value" is the loadbearing part.

The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.

I agree that the particular type of misaligned goal is not crucial. I'm thinking of molecular squiggles as an unusually clean type of misalignment to make arguments about, because it's very clear that they're not valuable. If you told me that molecular squiggles weren't a central example of a goal that you think a misaligned superintelligence might have, then I'd update, but it sounds like your statements are consistent with this.

A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing. Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess.

If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?

They could be bigger and more complicated, like building giant mechanical clocks.

Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human's preferences about how human civilization is structured?

If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else? 

Value them primarily?  Uhhh... maybe 1:3 against?  I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn't sound off by, like, more than 1-2 orders of magnitude in either direction.

Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human's preferences about how human civilization is structured?

It wouldn't shock me if their goals end up far more complicated than human ones; the most obvious pathway for it is (a) gradient descent turning out to produce internal preferences much faster than natural selection + biological reinforcement learning and (b) some significant fraction of those preferences being retained under reflection.  (Where (b) strikes me as way less probable than (a), but not wholly forbidden.)  The second most obvious pathway is if a bunch of weird detailed noise appears in the first version of the reflective process and then freezes.

This doesn't sound like an argument Yudkowsky would make, though it seems to have some similar concepts. And it's interesting food for thought regardless -- thanks! Looking forward to the rest of the series.

This doesn't sound like an argument Yudkowsky would make

Yeah, I can’t immediately find the link but I recall that Eliezer had a tweet in the past few months along the lines of: If ASI wants to tile the universe with one thing, then it wipes out humanity. If ASI wants to tile the universe with sixteen things , then it also wipes out humanity.

My mental-model-of-Yudkowsky would bring up “tiny molecular squiggles” in particular for reasons a bit more analogous to the CoastRunners behavior (video)—if any one part of the motivational system is (what OP calls) decomposable etc., then the ASI would find the “best solution” to maximizing that part. And if numbers matter, then the “best solution” would presumably be many copies of some microscopic thing.

I don't argue at any point that ASIs will have a single goal. The argument goes through equally well if it has many. The question is why some of those goals are of the form "tile the universe with squiggles" at all. That's the part I'm addressing in this post.

Curious who just strong-downvoted and why.

Early stage votes are pretty noisy (and I think have been getting noisier over time, which is somewhat of a proxy of polarization, which makes me sad). 

Two nitpicks and a reference:

an agent’s goals might not be linearly decomposable over possible worlds due to risk-aversion

Risk aversion doesn't violate additive separability. E.g., for  we always get  whether (risk neutrality) or  (risk aversion). Though some alternatives to expected utility, like Buchak's REU theory, can allow certain sources of risk aversion to violate separability.

when features have fixed marginal utility, rather than being substitutes

Perfect substitutes have fixed marginal utility. E.g.,  always has marginal utilities of 1 and 2.

I'll focus on linearly decomposable goals which can be evaluated by adding together evaluations of many separate subcomponents. More decomposable goals are simpler

There's an old literature on separability in consumer theory that's since been tied to bounded rationality. One move that's made is to grant weak separability accross goups of objects---features---to rationalise the behaviour of optimising accross groups first, and within groups second. Pretnar et al (2021) describe how this can arise from limited cognitive resources.

I think the FDT dictum of treating an agent like an abstract algorithm rather than any given physical instance of it ("I am an algorithm") extends to treating goals as about the collective abstract consequences of behavior of abstract algorithms (other algorithms, that are not necessarily the agent) rather than of any given incarnation of those algorithms or consequences in any given incarnation, such as the physical consequences of running algorithms on computers in a physical world.

In this ontology, goals are not about optimizing configurations of the world, they are about optimizing behaviors of abstract algorithms or optimizing properties of mathematical structures. Physically, this predicts computronium (to run acausal interactions with all the abstract things, in order to influence their properties and behaviors) and anti-predicts squiggles or any such focus on the physical form of what's going on, other than efficiency at accessing more computation.