Updated Deference is not a strong argument against the utility uncertainty approach to alignment

Ivan Vendrov

Thesis: The problem of fully updated deference is not a strong argument against the viability of the assistance games / utility uncertainty approach to AI (outer) alignment.

Background: A proposed high-level approach to AI alignment is to have the AI maintain a probability distribution over possible human utility functions instead of optimizing for any particular fixed utility function. Variants of this approach were advocated by Stuart Russell in Human Compatible and by Hadfield-Menell et al in the CIRL paper. Adding utility uncertainty intuitively seems to provide a number of safety benefits relative to having a fixed objective, including:

Utility uncertainty gives the AI an incentive to adjust in response to a human operator's corrective actions.
Utility uncertainty weakens the AI's incentive to harm its human operators, since this might result in a permanent loss of utility-relevant information.
Utility uncertainty incentivizes the AI to avoid irreversible changes to the state of the world, since those might lead to permanently low utility.

Despite the high profile and intuitive appeal of utility uncertainty, almost none of the alignment researchers I know consider it a promising approach to AI alignment. The most common reason cited seems to be the problem of fully updated deference (e.g. Richard Ngo's alignment research exercises point to this as the reason for why CIRL doesn't solve the alignment problem).

In this post I will argue why fully updated deference should not be seen as a strong argument against utility uncertainty as approach to AI alignment. This is not meant as an argument in favor of the uncertainty approach; it may have other irresolvable difficulties which I discuss briefly in the conclusion.

Outline: The Arbital post that seems to be the canonical reference for updated deference contains many heuristic arguments and one concrete, worked-out example in the section Moral uncertainty and its relation to corrigibility. I will mostly engage with the example, and argue that

It conflates the problem of updated deference with the independent problem of prior mis-specification.
If we remove prior mis-specification, there is no problem in the limit of increasing AI capability.

The Problem of Updated Deference

The example in the post has an AI that is uncertain between three utility functions , $U_{2}$ , $U_{3}$ whereas the human's true utility function is $V$ . The AI believes that the utility that will be attained in each of the three possible worlds is

$u_{i}$ with AI assistance
$v_{i}$ if the human optimizes $V$ without the AI's assistance (e.g. because the humans shut the AI down)

If the AI is much more powerful than humans, the argument goes, then $u_{i} >> v_{i}$ in any of the three worlds, so the AI will not let itself be shut down. The uncertainty doesn't help because the AI can choose to keep gathering information until it has fully updated. Since it's more powerful than the humans, it can gather that information more efficiently when it's not shut down, and therefore ignores the shutdown signal.

Factoring out prior mis-specification

The original example has the AI system assign probability 0 to the true human utility function $V$ , presumably because its prior probability was 0. I think any advocate of the utility uncertainty approach would agree that assigning a nonzero prior to the true human utility function is critical for the approach to work. Describing such a prior abstractly is easy (just take the Solomonoff prior over programs), implementing a real CIRL agent that reasons with such a prior could be intractably difficult, but this is clearly a separate problem from "fully updated deference". So from now on we'll assume that the human $V$ is one of $U_{1}$ , $U_{2}$ , $U_{3}$ , and see if we still have a problem.

Dissolving Updated Deference

The AI refuses to be shut down because it believes that it can achieve higher utility than the humans in any of the three possible worlds if the humans shut it down. The key question to consider is: is the AI right that shutdown is bad for the true human utility function?

Insofar as "shut down is bad" is a mistaken belief, we expect the problem of updated deference to dissolve as AI capabilities grow, since more capable AIs will make fewer mistakes. Note that in the original example, the plausibility of the AI's belief relies on the AI system being better at optimizing than unassisted humans, but "unassisted humans" is not likely to be the real world counterfactual. If the humans were able to deploy an AI system this powerful, they could also deploy another AI system equally powerful and (plausibly) more aligned. In other words, $u_{i} \leq v_{i}$ with very high probability, contrary to assumption. So the AI will shut down unless it expects the humans to do something irreversibly bad after shutting it down, which brings us to:
Insofar as "shut down is bad" is a correct belief, there is no problem - even a fully aligned superintelligence should be expected to resist shutdown if it believed this would lead to a permanent and irreversible loss of utility to humans. This could happen e.g. if the AI was confident that the humans would deploy a catastrophically unaligned AI on the next iteration, or if it believed humans would permanently curtail their technological potential. In other words, in this very unusual scenario where humans are about to make a catastrophic mistake, hard corrigibility and alignment are at odds. I don't think this scenario will happen, but if it does I think it's clear we should choose alignment over corrigibility.

A counter-argument to 1 would be that it is very possible for an AI system to be extremely capable but still have mistaken beliefs. This could be because there is an error in its source code; but this objection applies to almost any alignment approach. A more serious objection to the utility uncertainty agenda is that truth-seeking is anti-competitive and we will by default select models more for their ability to take impactful actions than for their ability to have true beliefs about the world. In fact, the core argument of Human Compatible is that we should work on differentially improving our models' ability to reason about uncertainty relative to their ability to optimize over actions. It may be that this a good strategy in theory but too hard in practice (it imposes too much of an alignment tax) but that argument should be made explicitly and it has little to do with updated deference.

Conclusion

So what does this tell us about whether utility uncertainty is a promising approach to AI alignment? Not much. I do think the "problem of updated deference" is better understood as a combination of prior mis-specification and competitiveness penalties from maintaining well-calibrated true beliefs. But I basically agree with Rohin that utility uncertainty smuggles all the complexity of alignment into

creating a "reasonable" prior over reward functions
creating a "reasonable" human model
finding a computationally tractable way to do Bayesian inference with (1) and (2)

and it's not obvious whether this is actual conceptual progress, or a wrong-way reduction of a hard problem into an impossible problem.

I think the key point is that utility uncertainty does not in itself produce corrigibility after updating on all of the evidence. So you still need to write down a value learning procedure which produces the right answer in the limit of infinite data. Many people working on outer alignment think that's a very difficult step, and are excited about something like corrigibility because it could provide an "out" that saves you from needing to solve that problem; they view fully-updated-deference is an argument that utility uncertainty can't provide such an "out." One way of putting this is that these researchers are mostly resigned to a prior mis-specification and/or the unidentifiability of human values given availability of data, such that they are unhappy "factoring out" that problem (e.g. see Jacob Steinhardt and Owain Evan's post on on misspecifiction in value learning).

Here is a related post I wrote, the easy goal inference problem is still hard, trying to argue for what I view as the "hard core" of a value-learning-based approach. I consider fully-updated-deference a good argument that in the limit, utility uncertainty is not a way of dodging the basic difficulties with such a value-learning-based approach. Some other papers (especially out of CHAI) try to directly engage with realistic models of human errors in a way that could yield a solution of the easy goal inference problem, though I'm currently not persuaded that any of these would meaningfully address the main difficulties in outer alignment. (For example, I think it's instructive imagining them as potential solutions to ELK.)

One could still be optimistic about utility uncertainty if you either thought "the limit is far away" or were optimistic about confronting the other difficulties with value learning. This is obviously especially appealing if you are legitimately worried about failures caused by the AI's lack of understanding of what humans want. (I'm less excited about that because I think failure modes like "the AI murders everyone" are very unlikely to emerge from realistic uncertainty about what humans want, since this is a pretty obvious fact about human preferences.)

From a discussion with Stuart Russell, my understanding is that his belief is that the easy goal inference problem may be hard "in the limit," but that it may be possible for cognitive science to "keep up" with AI progress so that we always have a good enough solution to value learning that we'd be happy with AIs optimizing our current best guess about human values, as defined by the best prior we can currently right down. I think this is mostly plausible if you imagine our AI alignment approaches as either a stopgap for a brief period, or if you imagine highly-automated AI cognitive science.

I discuss some other tangentially relevant issues in IRL and VOI, and in particular I contrast "corrigibility as preference" from "corrigibility as emergent phenomenon under reward uncertainty" as approaches to a basic problem for current RF optimization. This is closely related to ambitious vs narrow value learning.

Overall I think it's plausible that narrow value learning works well enough for learning corrigibility, such that fully-updated deference wouldn't be a big problem / you wouldn't need any clever approach to corrigibility. But even in that case, I'm not convinced that reward uncertainty is addressing the major problems and I think the important problems are being addressed in other parts of the design.

It's also worth briefly mentioning that CIRL doesn't necessarily have to proceed through an explicit reward uncertainty approach, and so an Eliezer or Richard-like objection to CIRL itself might be more like "this is a problem restatement; it may be reasonable as a way of communicating to AI researchers what the problem is in a way that doesn't talk about robots killing you, but it's not an approach to that problem and so should be compared to other problem statements rather than other approaches." (That said I'm not sure if this is actually the view of Eliezer and Richard, and my guess would be that they just don't have a good understanding of how e.g. Dylan Hadfield-Menell thinks about what CIRL is.)

Note that this comment is referencing my own writing mostly because it's primarily an expression of my own views, rather than claiming that I'm the first or most important person to make any of these points, etc.

If you can come up with a prior that can learn human preferences, why put that prior into a superintelligent agent instead of first updating it to match human preferences? It seems like the latter could be safer as one could then investigate the learned preferences directly, and as one then doesn't have to deal with it making mistakes before it has learned much.

13

Updated Deference is not a strong argument against the utility uncertainty approach to alignment

13

The Problem of Updated Deference

Factoring out prior mis-specification

Dissolving Updated Deference

Conclusion