Robustness, as used in ML, means that your model continues to perform well even for inputs that are off-distribution relative to the training set.

Inner alignment refers to the following problem: How can we ensure that the policy an AI agents ends up with is robustly pursuing the objective that we trained it on? By default, we would only expect the policy to track the objective on the training distribution.

Both a lack of robustness and inner alignment failure thus lead to an AI agent that might do unforeseen things when it encounters off-distribution inputs.

What’s the difference? I can (maybe) construct a difference if I assume that AI agents have distinct “competences” and “intent”.

There is some intuition that a lack of robustness relates to competence: The self-driving car really “wanted” to bring its passengers home safely. But then it started snowing and because the car’s vision system was only trained in sunny weather, the car didn’t spot the red traffic light and crashed. It was an honest mistake.

And there is some intuition that an inner alignment failure relates to intent: The nascent AGI never really cared about helping humans. It just play nice because it knew it would be deleted otherwise. As soon as it became powerful enough to take over the world (a situation it didn’t encounter during training), it did so.

However, the distinction between “competences” and “intent” doesn’t seem to apply to RL agents (and maybe not even to humans). RL agents just receive inputs and select actions. I wouldn’t be able to point to the “intent” of an RL agent. So what’s the difference between robustness an inner alignment then?

Frontpage
New Answer
New Comment

1 Answers sorted by

Evan Hubinger

100

This a good question. Inner alignment definitely is meant to refer to a type of robustness problem—it's just also definitely not meant to refer to the entirety of robustness. I think there are a couple of different levels on which you can think about exactly what subproblem inner alignment is referring to.

First, the definition that's given in “Risks from Learned Optimization”—where the term inner alignment comes from—is not about competence vs. intent robustness, but is directly about the objective that a learned search algorithm is searching for. Risks from Learned Optimization broadly takes the position that though it might not make sense to talk about learned models having objectives in general, it certainly makes sense to talk about a model having an objective if it is internally implementing a search process, and argues that learned models internally implementing search processes (which the paper calls mesa-optimizers) could be quite common. I would encourage reading the full paper to get a sense of how this sort of definition plays out.

Second, that being said, I do think that the competence vs. intent robustness framing that you mention is actually a fairly reasonable one. “2-D Robustness” presents the basic picture here, though in terms of a concrete example of what robust capabilities without robust alignment could actually look like, I am somewhat partial to my maze example. I think the maze example in particular presents a very clear story for how capability and alignment robustness can come apart even for agents that aren't obviously running a search process. The 2-D robustness distinction is also the subject of this alignment newsletter, which I'd also highly recommend taking a look at, as it has some more commentary on thinking about this sort of a definition as well.

Great answer, thanks!