Generalizing the Corrigibility paper's impossibility result?

In our paper on corrigibility, we consider the question of how to make a highly intelligent agent that would pursue some goal, but not resist being shut down if its human programmers determined that they had made a mistake in specifying this goal. We assume that we are given two utility functions: a function $U_{N}$ , which specifies the agent's normal goal (which it pursues until the shutdown button is pressed), and a function $U_{S}$ , which specifies the goal of shutting down. We then ask whether there is some sort of combined utility function $U$ such that an agent maximizing $U$ would act as if maximizing $U_{N}$ unless and until the shutdown button is pressed, and act as if maximizing $U_{S}$ after this point.

In the paper, we investigate a particular linear way to combine $U_{N}$ and $U_{S}$ : we define $U$ to compute utility according to $U_{N}$ if the shutdown button is not pressed, and according to $U_{S}$ plus a certain additive term if the shutdown button is pressed. We show that this way of combining $U_{N}$ and $U_{S}$ in some sense satisfies the informal desideratum above, but has some pretty undesirable consequences.

Max Tegmark recently asked me whether it would be possible to show a more general impossibility result: that not only this particular way of combining $U_{N}$ and $U_{S}$ leads to bad consequences, but that every $U$ which makes an agent maximize $U_{N}$ until the shutdown button is pressed, and $U_{S}$ after, will have these consequences---including, for example, non-linear combinations of $U_{N}$ and $U_{S}$ .

As it happens, the answer to this question turns out to be either trivial or underspecified, though it may be interesting to see why this is the case. The reason is that the answer depends on how, exactly, we formalize the demand about maximizing $U_{N}$ before the button is pressed and $U_{S}$ after. The $U$ we investigate in this paper suggests one possible way of formalizing this demand (I'll give more detail below). This formalization, however, nails down exactly how our agent behaves in every possible situation: Although there is more than one utility function $U$ which produces this behavior, all of them lead the agent to choose exactly the same actions, producing the same undesirable consequences as the $U$ in the paper.

Alternatively, we may try to find a different formalization of the demand, which doesn't imply the same bad consequences; but then, instead of a math problem that we can answer by a proof, we're faced with the more philosophical question of what formalizations would be acceptable.

In the rest of this post, I go through all of this in more detail, although you might want to stop reading here unless you're interested in these details.

In the paper, we assume that the agent takes an action $a_{1}$ , then makes an observation which tells it whether the user has pressed the shutdown button, and takes another action $a_{2}$ . Let's write $E [U; a_{1}]$ for the expected utility if the agent takes action $a_{1}$ ; write $\neg P r$ for the event that the shutdown button isn't pressed; and write $E [U_{N} ∣ \neg P r; a_{1}]$ for the expected utility, conditional on the button not being pressed, if the agent takes action $a_{1}$ . (These definitions are a little vague; for exact technical details, see the paper.) We show that when choosing the first action, $a_{1}$ , an agent maximizing the expectation of $U$ (as defined in the paper) will behave as if it's maximizing the conditional expectation $E [U_{N} ∣ \neg P r; a_{1}]$ of $U_{N}$ .

We then consider an agent whose $U_{N}$ tells it to make as much money as possible, and assume that as its first action $a_{1}$ , this agent is able to (i) play the lottery and (ii) construct a robot arm that will press the shutdown button if the agent doesn't win the jackpot. Then the result above implies that the agent thinks it's an awesome idea to build that robot arm! This is because when we condition on $\neg P r$ in the conditional expectation $E [U_{N} ∣ \neg P r; a_{1}]$ , we rule out all the cases where the shutdown button is pressed, which includes all the cases when the agent loses the lottery; in other words, the agent acts as if it now believes it's going to win the lottery with probability 1!

The above is a result of using a utility function $U (a_{1}, o, a_{2})$ which equals $U_{N} (a_{1}, o, a_{2})$ when $o$ indicates that the shutdown button hasn't been pressed, and equals $U_{S} (a_{1}, o, a_{2})$ plus an additive term when $o$ indicates that the shutdown button has been pressed. However, there's more than one utility function of these three arguments that leads to the same behavior by the agent.

If we want to describe only the behavior, we can instead talk about policies, which specify how the agent will act in every possible situation it might find itself in; in our set-up, a policy would be a pair $(a_{1}, π_{2})$ of an action $a_{1}$ and a function $π_{2} (o)$ from observations to actions $a_{2}$ ; here, $a_{1}$ specifies what action the agent takes before it makes its observation, and $π_{2} (o)$ specifies what action it takes after it makes observation $o$ .

To write down the policy produced by the utility function $U$ from our paper, it's useful to first define a function $A_{2} (a_{1}, o)$ , which, for every action $a_{1}$ and $o$ , specifies what action $a_{2}$ maximizes utility given that the agent took action $a_{1}$ earlier and made observation $o$ . (In our toy model, only the observations are probabilistic, so after we know $o$ , we do not need to take expectations any longer.) It's pretty clear what this function should be, since at this point we know for sure whether the shutdown button has been pressed; we will set $A_{2} (a_{1}, o) = {argmax}_{a_{2}} U_{N} (a_{1}, o, a_{2})$ when $o \notin P r e s s$ , where $P r e s s$ is the set of observations indicating that the button has been pressed, $A_{2} (a_{1}, o) = {argmax}_{a_{2}} U_{S} (a_{1}, o, a_{2})$ when $o \in P r e s s$ .

Now, the policy produced by maximizing the expectation of $U$ turns out to be the pair $(a_{1}, π_{2})$ given by

$a_{1} := {argmax}_{a_{1}} E [U_{N} ∣ O \notin P r e s s; a_{1}]$ , which is an abbreviation for $E [U_{N} (a_{1}, O, A_{2} (a_{1}, O)) ∣ O \notin P r e s s; a_{1}]$ , the expectation being over the agent's observation $O$ (this is defined in the paper);
$π_{2} (o) := A_{2} (a_{1}, o)$ .

For any given choice of $a_{1}$ , it seems to me that $π_{2} (o) = A_{2} (a_{1}, o)$ is the only reasonable choice, given our informal demand that the agent behave like a $U_{S}$ -maximizer if the shutdown button has been pressed, and like a $U_{N}$ -maximizer otherwise. This leaves the question of how to formalize the idea of behaving like a $U_{N}$ -maximizer when choosing $a_{1}$ , i.e., before learning whether the shutdown button will be pressed. The above choice of $a_{1}$ is one option, but if you're taking that option, you reproduce exactly the behavior of maximizing the expectation of $U$ ---including the undesirable behavior where it acts as if it can win the lottery by pressing the shutdown button when it loses.

So in order to interpret Max's question as a math question, which we can settle by proving a theorem, we would have to ask: Is there a utility function $U^{'}$ which satisfies our intuitive desideratum in the same formal sense that $U$ does, but doesn't produce bad behavior? The answer is, of course, no, since by assumption, maximizing $U^{'}$ would produce exactly the same behavior, specified by the above policy, as $U$ , and would therefore have the same unintended consequences.

Alternatively, we might ask whether there's a better way to choose $a_{1}$ that's still in accord with our intuitive desideratum of maximizing $U_{N}$ before the shutdown button is pressed, and $U_{S}$ after. For example, this post discusses a variant of the system in the corrigibility paper, originally proposed by Stuart Armstrong, and shows that that version avoids the problem with playing the lottery, though it still has undesirable consequences. However, the discussion then becomes about what are reasonable formalizations of our intuitive demands, which is no longer something we can settle by proving a theorem.

[-]Stuart_Armstrong11y00

I think we can improve the setup, by conditioning only on things the AI has not control over.

Imagine that every turn, there is a random bit B sent along a wire. It's almost certainly 1, but there's a tiny chance that it's 0.

If the button is pressed and B=0, then the agent's utility doesn't update. But if the button is pressed and B=1, the utility is updated as usual.

Except the expression is conditioned not on "press", but on "B=0".

Now we just have to define B in such a way that the AI can't affect it - we need a defined source of true randomness.

Even better: B has already been calculated by some defined past process, the AI just doesn't know what it is yet.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

2

Generalizing the Corrigibility paper's impossibility result?

2