If you don't design for extrapolation, you'll extrapolate poorly - possibly fatally

Stuart_Armstrong

Given an AI $A$ , let $E$ be the set of typical environments in which we know its behaviour - or know something about its behaviour. A sort of general training environment, if you will.

Let $E^{+}$ be the set of all environments that $A$ might be deployed on, or might generate itself.

There once was/is/will be an algorithm $A$ . We have tested it on every environment in the set $E$ , and it works as desired. Or maybe we just have strong argument or intuitions that it works on all of $E$ . In any case, its performance on $E$ is not in serious question.

Now, we want to use it on the much larger set of environments, $E^{+}$ . This includes environments very different from any in $E$ , and may also include environments that $A$ will create itself. How can we know that $A$ will still perform well?

This is the standard training-testing set problem. Or the standard out-of-distribution. Or the model splintering issue I described in a previous post.

Or, at a more meta level, the whole of the AI safety problem itself. After all, an aligned AI will be designed and tested in a limited environment, and then allowed to influence all of human history. Successfully navigating that extension will be crucial.

Common design types to date

Various algorithm designs have been proposed to ensure success when extrapolating from $E$ to $E^{+}$ . I'll group them into five loose categories:

Well-grounded expected utility maximiser.
Narrow low-impact and tool AI designs.
Mild optimisation AI designs.
Extrapolation-focused designs.
Non-optimiser partially constrained designs ("the rest").

Well-grounded expected utility maximiser

This is the simplest and in many ways clearest design. If we design an AI CEO such that its utility function increases with money, and if that concept of money is well-grounded, then we are confident the AI CEO will continue maximising money across all the potential weirdness of the future it will create.

The failure modes of this design are well-understood. If the concept of money is not well-grounded, then the AI will end up wireheading itself to some 'money variable'. If other things of value are not included in the utility function, then the world, humanity, and everything else in the universe may end up sacrificed to piles of cash^[1].

Narrow low-impact and tool AI designs

In this category I include my low-impact designs, most oracle AIs, tool AIs, and the AI services approach.

Here, the main focus is on constraining the set $E^{+}$ : the idea is to ensure that the AI is not confronted by a situation in which it can misbehave. If the AI is the superintelligent equivalent of Google maps, and all it ever sees are path optimisation problems, then it seems it must be safe: the insights from $E$ can be lifted to $E^{+}$ , as they are substantially similar - yes, $E^{+}$ might include new maps, but it's still essentially the same thing. The AI only has limited power to create new environments, and limited power in the environments we expect to see.

Critiques of these designs usually highlight ways in which $E^{+}$ actually contains very dangerous environments, and thus we cannot actually extend safety intuitions from $E$ .

Mild optimisation AI designs

This category includes things like quantilizers and many "mild optimisation" proposals.

The idea is that optimisation give good results on $E$ , but will fail disastrously on $E^{+}$ , for the reasons indicated in the "expected utility maximiser" subsection. However, if we have a "base distribution" of mostly safe policies, then we can apply a mild or weak optimisation to this base distribution, and get a policy that is safe with high probability, and better at achieving goals than a random selection of the base distribution.

These methods are a hybrid of expected utility maximisation and the extrapolation-focused designs of the next subsection. They aim to get some of the benefits of optimisation, while using the properties of the base distribution to ensure that dangerous behaviour is unlikely, even in $E^{+}$ .

Extrapolation-focused designs

This includes designs like Paul Christiano's distillation and amplification, and is what my model splintering approach is groping towards. My old idea of 'chaining God' is a (very poor) example of this approach.

Here the focus is explicitly on the extrapolation process itself, asking how we can extend desirable properties from $E$ to $E^{+}$ .

Critiques of this tend to argue directly that desirable properties don't extend to $E^{+}$ .

Non-optimiser partially constrained designs

This category is a grab bag of most remaining AI safety techniques - from formally specified satisficers to more informal ideas like ensuring that an AI would learn "respect" or "compassion".

I've shown that a satisficer might choose to behave as an maximiser/optimiser. The more general problem is that its behaviour is underdefined: there are many different possible behaviours compatible with being a satisficer, and we don't know what they could result in in $E^{+}$ .

Similarly, "respect", defined in $E$ , puts only very loose bounds on the algorithm's behaviour in $E^{+}$ .

The overall idea seems to be that, since these designs are not optimisers, these loose constraints should (or at the very least could) ensure good behaviour in $E^{+}$ .

The most dangerous design is not the one that seems the most dangerous

It might seem that expected utility maximisers are the most dangerous designs - but that is an illusion, due to the fact that we understand best how these maximisers fail. We have the clearest understanding of how maximisers extend their behaviour to $E^{+}$ ; that is precisely why we can visualise them failing.

The various "non-optimiser partially constrained designs" might seem safer, but that is mainly because we don't understand well at all what might happen on $E^{+}$ . Since they are so vaguely specified, it's not immediately obvious how they might go wrong - but that is very different from being confident that they won't go wrong.

Just to pick on the poor satisficer a bit, saying "it isn't an optimiser" is playing the role of the careful analysis of the "base distribution" for the quantilizer. Most satisficers do not include an analysis of how they might go wrong, and how these problems can be avoided. This does not, of course, make the problems go away; instead, it wishes them away, hoping that if we don't think too deeply about them, they won't be there.

Weak optimisers can overwhelm non-optimisers

There are arguments that non-optimisers may have implicit optimisation pressures in them - maybe due to a mesa-optimiser, an internal optimisation process, evolutionary pressure, or just implicit self-modification to avoid money pumps.

Non-optimisers are extremely vulnerable to such pressures. An actual expected utility maximiser could overcome or reduce these optimisation pressures, because it has strong optimisation pressures pushing in other directions. But a satisficer could be blown away by the slightest optimisation pressure within it.

Conclusion: what an aligned AI would be

Thus, apart from very narrowly designed low-impact AI, an aligned AI would have to be one that properly handles extrapolations to $E^{+}$ - and beyond, to any possible environments in $E^{+ +}$ or $E^{+ + +}$ that might open up.

For the moment, the best candidate for this seems to be a well-grounded expected utility maximiser with good value extrapolation.

Other designs might be possible, but their properties will have to be proven to the same degree of rigour as that.

Notice that these two failure modes are related. Wireheading happens because we failed to distinguish the "money variable" from actual money; so actual money was sacrificed to the money variable. Burning the universe for cash happens because, though we defined money correctly, we didn't define it richly in terms of the value it brings to the world; and so that rich value was sacrificed to actual money. ↩︎

AI ALIGNMENT FORUM
AF

11