Understanding the outer alignment problem

What really is outer alignment? In “Risks from Learned Optimization,” we defined outer alignment in the context of machine learning as “aligning the specified loss function with the intended goal.” But that's not a perfectly well-defined statement—what does it mean for a loss function to be “aligned” with the intended goal? If the goal we care about is maximizing , do we need exactly for constants ? That's a pretty high bar.

Well, what exactly do we want outer alignment for? At the end of the day, we care about whether the model that pops out the other end of our training procedure will be safe, which is a complicated question involving the loss function, the architecture, the implicit inductive biases, and so on. In what sense, then, is it even reasonable to look at just the lost function in isolation and ask whether it's aligned or not?

I think the strongest case for outer alignment being a meaningful problem in isolation comes from the argument that loss functions seem to scale pretty well with generic machine learning progress. If, as a silly example, your outer alignment scheme is to “train image classification models,” that's something that ML has progressively gotten better at over time. Compare that to the silly inner alignment scheme of “train a straightforward CNN”—that's something that ML has passed by pretty rapidly in favor of architectural improvements like residual connections even just for the task of image classification. Of course, outer alignment alone does not an aligned AGI make, so you still have to have some notion of how you're going to do inner alignment in mind—but loss functions scaling better is still a perfectly valid reason for focusing on outer alignment.[1]

Thus, it does seem quite reasonable to me to put effort into finding “aligned” loss functions. But that still brings us back to the question of what exactly makes a loss function “aligned.” In the context of a specific training/inner alignment scheme, we can say that a loss function is aligned if, when plugged into that training scheme, it produces models which are aligned with our goals. But in the absence of any specific training scheme, what does it mean to say that a loss function is aligned in isolation? We can of course ask for as I stated previously, though in my opinion I think achieving something like that is likely to be nearly impossible.

Outer alignment at optimum

I think there is another version of “outer aligned in isolation,” however, which is both meaningful and (at least somewhat) achievable which I will call outer aligned at optimum. Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals—that is, they are at least trying to do what we want. More precisely, let and . For a given loss function , let . Then, is outer aligned at optimum if, for all such that , is trying to do what we want.

That's the definition—now why should we care? In basically any practical setting we're never going to reach perfect loss, so why should it matter if those functions which do have perfect loss are aligned or not? In my opinion, I think there is a strong argument for loss functions which are aligned at optimum being significantly less susceptible to Goodhart's Law as we scale up ML capabilities. Suppose you know that a loss function is aligned for current ML capabilities. When you then scale up those capabilities and push harder on minimizing , you immediately run into all the issues of Goodhart's Law where can quickly cease to be a good proxy for alignment as you push harder on it. If you have a guarantee that is aligned at optimum, however, then, while still quite possible, it's a lot harder for Goodhart's Law to bite you. In particular, if you think about the Goodhart taxonomy, alignment at optimum almost entirely rules out both Causal and Extremal Goodhart—since you know the relationship is valid and doesn't break down at the extremes—and ensures that Regressional and Adversarial Goodhart won't show up in the limit, though you could still see them before that point. Though this obviously doesn't just give you an alignment guarantee—before you get to the true optimum, you can still get Regressional Goodhart biting you through proxy alignment or Adversarial Goodhart biting you through deceptive alignment, for example—I think it is nevertheless still a very nice thing to have.

The case for imitative amplification

With all of that being said, I can get to the reason that I want to talk about all of this: I think that specifically what I will call imitative amplification—in contrast to other amplification-based approaches or debate-based approaches—has a strong claim to being outer aligned at optimum.[2] Specifically, when I say imitative amplification, I mean the class of training procedures which are attempting to produce models which approximate HCH as closely as possible. As a concrete example, consider the scheme where you train a model to minimize the difference between its output and the output of a human consulting that model. I want to contrast this with approval-based amplification, by which I mean the class of training procedures where the loss is generated using an approval signal from an amplified overseer. As a concrete example, consider the scheme where you train a model to maximize the extent to which a human consulting that model would approve of that model's output.[3]

So, why does imitative amplification have a stronger case for being outer aligned at optimum than approval-based amplification or debate? Well, precisely because we know what the optimum of imitative amplification is—it's HCH—whereas we really don't know what perfect approval-based amplification or perfect debate look like.[4] Though some challenges have been raised regarding whether HCH is actually aligned or not, I tend to be fairly skeptical of these challenges—HCH is just a bunch of humans after all and if you can instruct them not to do things like instantiate arbitrary Turing machines, then I think a bunch of humans put together has a strong case for being aligned.[5] That being said, the same argument does not at all apply to approval-based amplification or debate.

First, let's consider approval-based amplification.[6] We know what the optimum of imitative amplification looks like—but what is the optimum of approval-based amplification? At first glance, one might imagine that the optimum of approval-based amplification looks like a model whose output is selected to be maximally approved of by HCH. That's very much not the case for the approval-based scheme I described earlier, however. If each step of training is done via maximizing an approval signal, then instead of a tree of just humans you get a tree of humans interspersed with models trying to maximize the approval that their descendants in the tree would assign to their answers. And if you think that human approval can be gamed—which seems extremely likely in my opinion given that we see exactly that sort of gaming happening in our world already all the time—then this is very much not a safe tree. Now, one could make the argument that approval-based amplification can just become imitative amplification if the humans determine their approval by computing a distance function between what they would have said and what the model produced as its output. For example, you could ask your humans to come up with their answers first, then show them the model's answer and ask them to rate how close it was. I'm pretty skeptical of this approach, however—it doesn't seem at all clear to me that this gets around the approval-gaming problem, since the humans still get to see the model's answer and doing so could significantly change how they're thinking about the rating problem.[7]

Now, second, let's consider debate with a human judge. In many ways, debate was designed as an approach meant to fix the problems of approval-based reward signals. With a standard approval-based reward signal, the argument goes, it's easy to be tricked by a bad argument that you don't fully understand. In a debate setup, however, you get the benefit of having two competing systems trying to point out flaws in each other's arguments, which hopefully should prevent you from being tricked by bad arguments and thus fix the approval-gaming problem. I'm not convinced, though—false things can be significantly easier to argue for than true things just because there are fewer ways to attack them, they're more rhetorically powerful, or any other number of possible ways in which an argument can be subtly wrong yet still persuasive.[8] Regardless, however, I think the more fundamental objection is just that we really have no way of knowing what optimal play in debate looks like, which makes it very difficult to ever know whether it is outer aligned at optimum or not. With HCH, we know that it just looks like a tree of humans, which at least means we can reason about the parts and how they interact. With optimal debate, however, we have to somehow analyze, understand, and be confident in the alignment of superhuman play on a game involving humans assessing arbitrary strings of characters, which is something that in my opinion seems extremely difficult to do.

Addressing competitiveness concerns

All of that is an argument for why we should prefer imitative amplification from an alignment standpoint. However, there's also the problem of imitative amplification just not being competitive in terms of capabilities with other approaches. First of all, I think it's important to remember the importance of putting safety first—if something isn't safe, then we shouldn't build it. Of course, arms race dynamics could end up pushing one's hand into going with a best available current option in order to beat some other team which one believes will produce an AI which is even less likely to be safe, though I think it's important to remember that that's a last resort, not the default option. Furthermore, even in such a situation, it's still probably fine to eat an overhead cost that is just something like a constant factor worse.

With that being said, I still think there are strong arguments to be made for why imitative amplification can be done competitively. First, like the silly outer alignment scheme of “just train an image classification model” from earlier, imitative amplification gets to piggy-back off of generic ML progress. Imitative amplification is just a language modeling problem, which means generic progress on language modeling should generally be transferable to imitative amplification. Second, I think there is a strong case for language being sufficiently rich as a dataset for training an AGI (EDIT: where “language” is construed to also include embedded images, videos, etc.), at least for the sorts of tasks which I think you will want to use your first AGI for.[9] For example, if the primary/most important purpose of your first AGI is to help you build your second AGI by helping you improve your AGI design, that's the sort of highly cognitive task which I think language is sufficient for. Certainly, if you needed your first AGI to be able to do fine motor control to be competitive, then imitative amplification probably won't get you there—but it seems pretty unlikely to me that ability to do fine motor control will be an important desiderata. Third, a common criticism of imitative amplification is that because imitation treats all data points the same, it won't be able to focus on the most important ones. However, that's not something that's fundamental to the task of imitation. For example, you could use active learning to select the most important data points rather than just using a fixed curriculum.[10] Or, you could even weight different data points in your imitation loss using some outside importance criterion while still maintaining the guarantee of perfect imitation at optimum.

Regardless, I think the case for imitative amplification's safety is a strong argument in favor of at least focusing on figuring out whether it works and is safe first, before we give up and move to other approaches. Furthermore, even if imitative amplification on its own isn't competitive, I don't think that means we have to abandon it completely—there are modifications to imitative amplification that can be made to help improve competitiveness without sacrificing all of its benefits. For example, you could do reward-modeling-based distillation (e.g. RL + IRL as the distillation step) instead of imitation-based distillation, which, while not imitative (as the optimum isn't HCH anymore), also isn't based on human approval, which could be a nice property. Alternatively, you could first train an HCH model, and then use that model as the judge to train a debate model, which could have significant benefits over just using a human judge. While I don't think we should be focusing on those sorts of things now, the fact that such options exist makes it more likely that imitative amplification work can transfer to future approaches even if imitative amplification itself ends up not being competitive. In any event, I think the case for focusing on imitative amplification right now both from an outer alignment perspective as well as from a competitiveness perspective is quite strong.[11]


  1. There is still lots of potential for outer alignment work to be outdated by machine learning progress, however—see, for example, Daniel Kokotajlo's “A dilemma for prosaic AI alignment.” ↩︎

  2. I mentioned this previously a bit in this comment. ↩︎

  3. Note that the two categories of “imitative” and “approval-based” amplification do not cover the entire space of possible amplification-based approaches—there are other possible schemes in this domain as well. For example, you could use imitative amplification to train an HCH approximator, then do RL to produce a model which maximizes that model's approval—or even use your HCH model as a judge in a debate. Alternatively, you could do imitative amplification but instead of using standard imitation learning you could do IRL + RL instead. All of these different approaches have different alignment properties—I have singled out imitative amplification, approval-based amplification, and debate with a human judge because they are the approaches I'm most interested in talking about there, though they are far from the only ones. ↩︎

  4. Note that for the optimum of imitative amplification to be precisely HCH, you need it to be the case that you progressively enlarge your training data as you go along. The fact that you don't get good guarantees for finite datasets is certainly a problem, though it's one that you basically have to solve via inner alignment techniques and thus not one I want to focus on right now. ↩︎

  5. The question of whether theoretical HCH is aligned or not is a pretty complicated question that I don't really want to go into in full detail right now, so if you strongly disagree just take it as a given for this post. ↩︎

  6. Though there was a previous claim by William Saunders that RL amplification and imitative amplification are equivalent, I think that both of William's proposals there fall into my approval-based category, not my imitative category. See Rohin Shah's and Wei Dai's comments on William's post to that effect. ↩︎

  7. In particular, this breaks the analogy to counterfactual oracles. ↩︎

  8. I have a lot more to say on this point regarding reasons why false arguments can be more persuasive than true ones, though that's not something I want to go into in too much detail right now. ↩︎

  9. Much of my thinking here owes a debt to Geoffrey Irving. I also talked about the case for language being all you need a bit previously in this comment. ↩︎

  10. Such an active learning scheme could even borrow lessons from across-episode exploration in RL. ↩︎

  11. I also think imitative amplification has some nice inner alignment properties as well, since it gives you an amplified overseer to use for transparency/relaxed adversarial training. ↩︎

New Comment
11 comments, sorted by Click to highlight new comments since:

Aside from some quibbles, this matches my understanding pretty well, but may leave the reader wondering why Paul Christiano and Ought decided to move away from imitative amplification to approval-based amplification. To try to summarize my understanding of their thinking (mostly from an email conversation in September of last year between me, you (Evan), Paul Christiano, and William Saunders):

  • William (and presumably Paul) think approval-based amplification can also be outer aligned. (I do not a good understand why they think this, and William said "still have an IOU pending to provide a more fleshed out argument why it won't fail.")
  • Paul thinks imitative amplification has a big problem when the overseer gets amplified beyond the capacity of the model class that's being trained. (Approximating HCH as closely as possible wouldn't lead to good results in that case unless we had a rather sophisticated notion of "close".)
  • I replied that we could do research into how the overseer could effectively dumb itself down, similar to how a teacher would dumb themselves down to teach a child. One approach is to use a trial-and-error process, for example ramping up the difficulty of what it’s trying to teach and then backing down if the model stops learning well, and trying a different way of improving task performance and checking if the model can learn that, and so on. (I didn't get a reply on this point.)
  • William also wrote, "RL-IA is easier to run human experiments in, because the size of trees to complete tasks, and the access to human experts with full knowledge of the tree (eg the Ought reading comprehension experiments) I'd lean towards taking the position that we should try to use SL-IA where possible, but some tasks might just be much easier to work with in RL-AI"

Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

I would argue that according to this definition, there are no loss functions that are outer aligned at optimum (other than ones according to which no model performs optimally). [EDIT: this may be false if a loss function may depend on anything other than the model's output (e.g. if it may contain a regularization term).]

For any model that performs optimally according to a loss function there is a model that is identical to except that at the beginning of the execution it hacks the operating system or carries out mind crimes. But for any input, and formally map that input to the same output, and thus also performs optimally according to , and therefore is not outer aligned at optimum.

Overall, I think imitative amplification seems safer, but I maybe don't think the distinction is as clear cut as my impression of this post gives.

if you can instruct them not to do things like instantiate arbitrary Turing machines

I think this and "instruct them not to search over arbitrary text strings for the text string that gives the most approval", and similar things, are the kind of details that would need to be filled out to make the thing you are talking about actually be in a distinct class from approval-based amplification and debate (My post on imitation and RL amplification was intended to argue that without further restrictions, imitation amplification is in the same class as approval-based amplification, which I think we'd agree on). I also think that specifying these restrictions in a way that still lets you build a highly capable system could require significant additional alignment work (as in the Overseer's Manual scenario here)

Conversely, I also think there are ways that you can limit approval-based amplification or debate - you can have automated checks, for example, that discard possible answers that are outside of a certain defined safe class (e.g. debate where each move can only be from either a fixed library of strings that humans produced in advance or single direct quotes from a human-produced text). I'd also hope that you could do something like have a skeptical human judge that quickly discards anything they don't understand + an ML imitation of the human judge that discards anything outside of the training distribution (don't have a detailed model of this, so maybe it would fail in some obvious way)

I think I do believe that for problems where there is a imitative amplification decomposition that solves the problem without doing search, that's more likely to be safe by default than approval-based amplification or debate. So I'd want to use imitative amplification as much as possible, falling back to approval only if needed. On imitative amplification, I'm more worried that there are many problems it can't solve without doing approval-maximizing search, which brings the old problems back in again. (e.g. I'm not sure how to use imitative amplification at the meta-level to produce better decomposition strategies than humans use without using approval-based search)

I tend to be fairly skeptical of these challenges—HCH is just a bunch of humans after all and if you can instruct them not to do things like instantiate arbitrary Turing machines, then I think a bunch of humans put together has a strong case for being aligned.

Minor nitpick: I mostly agree, but I feel like a lot of work is being done by saying that they can't instantiate arbitrary Turing machines, and that it's just a bunch of humans. Human society is also a bunch of humans, but frequently does things that I can't imagine any single intelligent person deciding. If your model breaks down for relatively human-human combinations, I think there is a significant risk that true HCH would be dangerous in quite unpredictable ways.

a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

Why is the word "trying" necessary here? Surely the literal optimal model is actually doing what we want, and never has even benign failures?

The rest of the post makes sense with the "trying to do what we want" description of alignment (though I don't agree with all of it); I'm just confused with the "outer alignment at optimum" formalization, which seems distinctly different from the notion of alignment used in the rest of the post.

I think I'm quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it's aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can't really consider HCH aligned (and thus can't consider imitative amplification outer aligned at optimum).

I may have asked this already somewhere, but do you know if there's a notion of "outer aligned" that is applicable to oracles/predictors in general, as opposed to trying to approximate/predict HCH specifically? Basically the problem is that I don't know what "aligned" or "trying to do what we want" could mean in the general case. Is "outer alignment" meant to be applicable in the general case?

This post talks about outer alignment of the loss function. Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we'd call that an "outer alignment failure". Or would it make more sense to use different terminology for that?

Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according that loss function are aligned with our goals—that is, they are at least trying to do what we want.

So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?

Also, related to Ofer's comment, can you clarify whether it's intended for this definition that the loss function only looks at the model's input/output behavior, or can it also take into account other information about the model?

HCH is just a bunch of humans after all and if you can instruct them not to do dumb things like instantiate arbitrary Turing machines

I believe the point about Turing machines was that given Low Bandwidth Overseer, it's not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines. But other issues arise with HBO, as William Saunders wrote in the above linked post:

The reason for this system [LBO] being introduced is wanting to avoid security issues as the system scales. The fear is that there would be an “attack” on the system: an input that could be shown to an overseer that would cause the overseer to become corrupted and try to sabotage the system. This could be some kind of misleading philosophical argument, some form of blackmail, a human adversarial example, etc. If an input like this exists, then as soon as the first agent is corrupted, it can try to spread the attack to other agents. The first agent could be corrupted either by chance, or through an attack being included in the input.

I understand you don't want to go into details about whether theoretical HCH is aligned or not here, but I still want to flag that "instruct them not to do dumb things like instantiate arbitrary Turing machines" seems rather misleading. I'm also curious whether you have HBO or LBO in mind for this post.

Another thing that maybe I didn't make clear previously:

I believe the point about Turing machines was that given Low Bandwidth Overseer, it's not clear how to get HCH/IA to do complex tasks without making it instantiate arbitrary Turing machines.

I agree, but if you're instructing your humans not to instantiate arbitrary Turing machines, then that's a competitiveness claim, not an alignment claim. I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn't be aligned.

I think there are lots of very valid reasons for thinking that HCH is not competitive—I only said I was skeptical of the reasons for thinking it wouldn’t be aligned.

But if you put aside competitiveness, can't HCH be trivially aligned? E.g., you give the humans making up HCH instructions to cause it to not be able to answer anything except simple arithmetic questions. So it seems that a claim of HCH being aligned is meaningless unless the claim is about being aligned at some level of competitiveness.

That's a good point. What I really mean is that I think the sort of HCH that you get out of taking actual humans and giving them careful instructions is more likely to be uncompetitive than it is to be unaligned. Also, I think that “HCH for a specific H” is more meaningful than “HCH for a specific level of competitiveness,” since we don't really know what weird things you might need to do to produce an HCH with a given level of competitiveness.

Is "outer alignment" meant to be applicable in the general case?

I'm not exactly sure what you're asking here.

Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we'd call that an "outer alignment failure".

I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn't consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).

So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?

Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.

Also, related to Ofer's comment, can you clarify whether it's intended for this definition that the loss function only looks at the model's input/output behavior, or can it also take into account other information about the model?

I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.

I'm also curious whether you have HBO or LBO in mind for this post.

I generally think you need HBO and am skeptical that LBO can actually do very much.