Disentangling inner alignment failures

Erik Jenner

TL;DR: This is an attempt to disentangle some concepts that I used to conflate too much as just "inner alignment". This will be old news to some, but might be helpful for people who feel confused about how deception, distributional shift, and "sharp left turns" are related. I first discuss them as entirely separate threat models, and then talk about how they're all aspects of "capabilities are more robust than alignment".

Here are three different threat models for how an AI system could very suddenly do catastrophic things:

Deception: The AI becomes deceptively aligned at some point during training, and then does what we want only for instrumental reasons (because it wants to be deployed). Once we deploy, it starts pursuing its actual objective, which is catastrophic for humans.
Distributional shift: The AI behaves well during training, perhaps using some messy set of heuristics and proxy objectives. We deploy, and there's distributional shift in the AI's inputs, which leads to the model's proxies no longer being aligned with human values. But it's still behaving capably, so we again get catastrophic outcomes.
Capability gains/sharp left turn: At some point (while training or in deployment), the AI becomes much more capable, including at a bunch of things we didn't explicitly train for. This could happen quite suddenly, e.g. because it learns some crucial general skill in relatively few gradient steps, or because it starts learning from something other than gradients that's way faster. The properties of the AI that previously ensured alignment are too brittle and break during this transition.

Note that these can be formulated as entirely distinct scenarios. For example, deception doesn't require a distributional shift^[1] nor capability gains; instead, the sudden change in model behavior occurs because the AI was "let out of the box" during deployment. Conversely, in the distributional shift scenario, the model might not be deceptive during training, etc. (One way to think about this is that they rely on changes along different axes of the training/deployment dichotomy).

Examples

I don't think we have any empirical examples of deception in AI systems, though there are thought experiments. We do see kind of similar phenomena in interactions between humans, basically whenever someone pretends to have a different goal than they actually do in order to gain influence.

To be clear, here's one thing that is not an example of deception in the sense in which I'm using the word: an AI does things during training that only look good to humans even though they actually aren't, and then continues to do those things in deployment. To me, this seems like a totally different failure mode, but I've also seen this called "deception" (e.g. "Goodhart deception" in this post), thus the clarification.

We do have experimental evidence for goal misgeneralization under distributional shift (the second scenario above). A well-known one is the CoinRun agent from Goal misgeneralization in Deep RL, and more recently, DeepMind published many more examples.

A classic example for sudden capability gains is the history of human evolution. Relatively small changes in the human brain compared to other primates made cultural evolution feasible, which allowed humans to improve from a source other than biological evolutionary pressure. The consequence were extremely quick capability gains for humanity (compared to evolutionary time scales). This example contains both the "threshold mechanism", where a small change to cognitive architectures has big effects, and the "learning from another source mechanism", with the former enabling the latter.

In ML, grokking might be an example for the "threshold mechanism" for sudden capability gains: a comparatively small number of gradient steps can massively improve generalization beyond the training distribution. An example of learning from something other than gradients is in-context learning in language models (e.g. you can give an LM information in the prompt and it can use that information). But for now, this doesn't lead to permanent improvements to the language model.

Relations between these concepts

I used to conflate deception, distributional shift, and sharp left turns as "inner alignment" in a way that I now think wasn't helpful. But on the other hand, these do feel related, so what makes them similar?

One obvious aspect is that these could all lead to very sudden failures (as opposed to a pure "going out with a whimper" scenario). In each case, the AI might behave fine for a while—not just in terms of "looking fine" to human observers, but even under some ideal outer alignment solution. Then something changes, typically quite suddenly, and the AI behaves very differently (and likely badly in a way that would be obvious to us). The reason these scenarios are dangerous is thus that the AI could make high-stakes decisions, to use Paul's framing. I think this is the sense in which they all feel related to inner alignment.

A more interesting (but also more hand-wavy) point is that all three are in some sense about capabilities being more robust than alignment:

Deception: Just because we let the AI out of the box, it doesn't suddenly become incompetent. That would be quite strange indeed, assuming there is no significant distributional shift! It's also not obvious that alignment would fail. But at least there are plausible arguments for why gradient descent might most easily find models that do well in training but then are competently misaligned once they detect they are deployed. In contrast, there is no reason to think we'd get systems that decide to become incompetent once deployed. Such models are possible solutions to the outer optimization problem—you could have an AI that detects whether it's still in training, and if not just performs random actions, and this AI would get low training loss. But it's not a natural thing for gradient descent to find, it's just unnecessarily complex.
Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn't generalize, we'd be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we'll get goal misgeneralization specifically, the underlying worry again seems to be that capabilities will be robust while alignment will not.
Capability gains: Nate's sharp left turn post explicitly discusses how this is about capability generalization without alignment generalization. We are worried that an AI will learn capabilities that robustly transfer to domains it hasn't been explicitly trained on, while alignment properties are brittle.

A note on distributional shift

I've specifically been talking about goal misgeneralization under distributional shift. Here, we mainly care about not generalizing catastrophically, i.e. competently pursuing a misaligned goal. There's a different type of distributional shift that's important in alignment, where we care about actually good generalization, rather than just non-catastrophic one. For example, for ELK we want to generalize from questions that humans can answer to questions we can't answer. But we don't just want to generalize in a way that doesn't kill us if we read the answer, we want to generalize in the way that gives correct answers. This seems distinct from what I've discussed in this post in some important ways, though there are definitely parallels (e.g. the generalization we want not being natural enough that we're guaranteed to get it automatically).

Thanks to Leon Lang for feedback on a draft of this post!

^{^}
A tiny distributional shift would be enough to let the model determine whether it's in training or deployment, and in principle it could even be deceptive without any expected distributional shift, by defecting randomly.

Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn't generalize, we'd be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we'll get goal misgeneralization specifically, the underlying worry again seems to be that capabilities will be robust while alignment will not.

One thing to flag is that even if for any given model, the probability of capabilities generalizing is very low, total doom can still be high, since there might be many tries at getting models that generalize well across distributional shifts, whereas the selection pressures to getting alignment robustness is comparably weaker. You can imagine a 2x2 quadrant of capabilities vs alignment generalizability across distributional shift:

Capabilities doesn't generalize, alignment doesn't: irrelevant

Capabilities doesn't generalize, alignment does: irrelevant

Capabilities generalizes, alignment doesn't: potentially very dangerous, especially if power-seeking. Agent (or agent and friends) acquires more power and may attempt a takeover.

Capabilities generalizes, alignment does: Good, but not clearly great. By default I won't expect it to be powerseeking (unless you're deliberately creating a sovereign), so it only has as much power as humans allow it to have. So the AI might risk being outcompeted by their more nefarious peers.

I like this post and agree that there are different threat models one might categorize broadly under "inner alignment". Before reading this I hadn't reflected on the relationship between them.

Some random thoughts (after an in-person discussion with Erik):

For distributional shift and deception, there is a question of what is treated as fixed and what is varied when asking whether a certain agent has a certain property. E.g., I could keep the agent constant but put it into a new environment, and ask whether it is still aligned. Or I could keep the environment constant but "give the agent more capabilities". Or I could only change some random number generator's input or output and observe what changes. The question of what I'm allowed to change to figure out whether the agent could do some unaligned thing in a new condition is really important; e.g., if I can change everything about the agent, the question becomes meaningless.
One can define deception as a type of distributional shift. E.g., define agents as deterministic functions. We model different capabilities via changing the environment (e.g. giving it more options) and treat any potential internal agent state and randomness as additional inputs to the function. In that case, if I can test the function on all possible inputs, there is no way for the agent to be unaligned. And deception is a case where the distributional shift can be extremely small and still lead to very different behavior. An agent that is "continuous" in the inputs cannot be deceptive (but it can still be unaligned after distributional shift in general).
It is a bit unclear to me what exactly the sharp left turn means. It is not a property that an agent can have, like inner misalignment or deceptiveness. One interpretation would be that it is an argument for why AIs will become deceptive (they suddenly realize that being deceptive is optimal for their goals, even if they don't suddenly get total control over the world). Another interpretation would be that it is an argument why we will get x-risks, even without deception (because the transition from subhuman to superhuman capabilities happens so fast that we aren't able to correct any inner misalignment before it's too late).
One takeaway from the second interpretation of the sharp left turn argument could be that you need to have really fine-grained supervision of the AI, even if it is never deceptive, just because it could go from not thinking about taking over the world to taking over the world in just a few gradient descent steps. Or instead of supervising only gradient descent steps, you would also need to supervise intermediate results of some internal computation in a fine-grained way.
It does seem right that one also needs to supervise intermediate results of internal computation. However, it probably makes sense to focus on avoiding deception, as deceptiveness would be the main reason why supervision could go wrong.

Thanks for the comments!

One can define deception as a type of distributional shift. [...]

I technically agree with what you're saying here, but one of the implicit claims I'm trying to make in this post is that this is not a good way to think about deception. Specifically, I expect solutions to deception to look quite different from solutions to (large) distributional shift. Curious if you disagree with that.

Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.)

I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would not go ahead and suddenly start maximizing some different goal function, but instead would query the human first). Stuart Armstrong might claim a similar thing about concept extrapolation?

I personally think it is probably best to just try to work on deceptiveness directly instead of solving some more general problem and hoping non-deceptiveness is a side effect. It is probably harder to find a general solution than to solve only deceptiveness. Though maybe this depends on one's beliefs about what is easy or hard to do with deep learning.

Distributional shift: The worry is precisely that capabilities will generalize better than goals across the distributional shift. If capabilities didn't generalize, we'd be fine. But as the CoinRun agent examplifies, you can get AIs that capably pursue a different objective after a distributional shift than the one you were hoping for. One difference to deception is that models which become incompetent after a distributional shift are in fact quite plausible. But to the extent that we think we'll get goal misgeneralization specifically, the underlying worry again seems to be that capabilities will be robust while alignment will not.