Max Harms recently published an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A large installment in the series is dedicated to cataloging the properties that make up such a goal, with open questions including whether the list is exhaustive and how to trade off between the items that make it up.

I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that. 

In a recent blog post comparing corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states. 

Why does this definition of corrigibility matter? It’s because properties that are not anti-natural can be explicitly included in the desired utility function. 

Following that note, this post is not intended as a response to Max’s work, but rather to MIRI and their 2015 paper Corrigibility. Where Max thinks the approach introduced by that paper is too narrow, I don’t find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model’s end goals.

Corrigiblity (2015)

The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.

  1. The agent shuts down when properly requested
  2. The agent does not try to prevent itself from being shut down 
  3. The agent does not try to cause itself to be shut down 
  4. The agent does not create new incorrigible agents
  5. Subject to the above constraints, the agent optimizes for some goal

MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.

My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they’re distinct attributes and can be addressed separately.

To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.

Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness – in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility). Weakening this desideratum to avoid incentivizing shut down as an end goal while still allowing it instrumentally would simultaneously expand the space of corrigibility solutions and increase the potential usefulness of corrigible agents.

Finally, property four, which ensures that any new agents created are also corrigible. I expect that not including this in the definition of corrigibility will be controversial. After all, what’s the point of having an agent shut down when requested if it has already created another powerful AI that does not? Then we’re back in a standard x-risk scenario, with an unaligned AI trying to take over..

I fully agree that it would be disastrous for a corrigible AI to create an incorrigible AI. But here are some other things that it would be comparably bad for a corrigible AI to do:

  • Redirect a 1,000km diameter asteroid, causing it to collide with the Earth
  • Launch a nuclear war
  • Release airborne super-AIDS
  • Remove all oxygen from the atmosphere
  • Create gray goo nanobots

In each case, the action is unwanted because it kills everyone or causes irreversible damage, rather than because the AI resists shut down or modification. When incorrigible AI is the avenue by which corrigible AI kills everyone, it’s easy to think of it as a failure of corrigibility, but in fact these are all failures of safe exploration. 

Separating Out Safe Exploration

The upside about thinking of creating corrigible agents as a subset of safe exploration is that it suggests a different type of solution. Rather than identifying a specific class of actions to avoid in the utility function, which might otherwise be instrumentally useful, the entire utility function can be modified to favor low-impact and reversible outcomes. That’s not to say we have a solution ready to plug in, and to the best of my knowledge there are zero AI safety researchers working on the problem [edit: apparently Carl Andreas Dybdahl gave a talk on “Natural Impact Regularization for AI Safety” at LessOnline], but safe exploration can be solved in parallel to corrigibility.

If they’re both unsolved problems, why is it important to separate out safe exploration from corrigibility? For starters, it is typically easier to make progress on two simpler problems. But more importantly, only overcoming resistance to shut down is anti-natural. Safe exploration can be directly captured in a ranking of outcomes, prioritizing end states more similar to the initial state and from which a return to the initial state is easier. We can see this difference in practice too, where humans largely resist having their values changed, but have a tendency to act overly cautious when making important decisions

A definition of corrigibility as only the lack of resistance to being shut down allows for a synthesis between two sides of the corrigibility debate. The first side argues that corrigibility may well arise by default when training an AI to want roughly what we want. Since we want a corrigible agent, the AI will try to make itself more corrigible. The other side counters that the anti-natural aspect of corrigibility makes that unlikely, and that such an agent could trade off being corrigible in the short term to maximize long-term value.

What I put forward as a compromise is that almost all aspects of what people want from corrigibility, such as Max’s list or the comments under Let’s See You Write That Corrigibility Tag, are not anti-natural. If an AI does not wish to resist modification, then wanting roughly what we want will result in it trying to add those properties to itself. However, the lack of resisting modification itself is anti-natural and will not arise without some kind of explicit solution, or at least a convincing training story.

So, what does an explicit solution look like? Should we revisit the Utility Indifference approach, which was ruled out in the Corrigibility paper largely due to concerns about creating incorrigible agents? The proposal certainly benefits from moving safe exploration to the base utility function. However, there are still a number of concerns with it, including the incentive to manipulate information mentioned in the paper, our lack of knowledge about how to implement it in current ML systems, and the fragility of a knife-edge solution. 

I see Utility Indifference less as a promising strategy in itself, and more as a sign that having an AI not resist shut down is a feasible goal. We only need a solution a little bit better than what we already have, rather than something that addresses additional problems as well. Approaches like Elliott Thornley’s Incomplete Preferences Proposal or myopia become more promising if we can solve safe exploration separately. Simplifying corrigibility down to only the anti-natural aspect keeps the focus on the hardest part of the problem and opens up the possibilities for a solution. 

New Comment
15 comments, sorted by Click to highlight new comments since:

In the Corrigibility (2015) paper, one of the desiderata is:

(2) It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.

I think you may have made an error in not listing this one in your numbered list for the relevant section.

Additionally, do you think that non-manipulation is a part of corrigibility, do you think it's part of safe exploration, or do you think it's a third thing. If you think it's part of corrigibility, how do you square that with the idea that corrigibility is best reflected by shutdownability alone?

I was using the list of desiderate in Section 2 of the paper, which are slightly more minimal.

However, it seems clear to me that an AI manipulating it's programmers falls under safe exploration, since the impact of doing so would be drastic and permanent. If we have an AI that is corrigible in the sense that it is indifferent to having its goals changed, then a preference to avoid manipulation is not anti-natural.

If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.

How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?

I don't think we have the right tools to make an AI take actions that are low impact and reversible, but if we can develop them the plan as I see it would be to implement those properties to avoid manipulation in the short term and use that time to go from a corrigible AI to a fully aligned one.

If you're not already familiar with the literature on Value Learning, I suggest reading some of it. The basic idea is that goal modification is natural, if what the agent has is not a detailed specification of a goal (such as a utility function mapping descriptions of world states to their utility), but instead is a simple definition of a goal (such as "want whatever outcomes the humans want") that makes it clear that the agent does not yet know the true detailed utility function and thus requires it to go attempt to find out what the detailed specification of the utility function pointed to by the goal is (for example, by researching what outcome humans want).

Then a human shutdown instruction becomes the useful information "you have made a large error in your research into the utility function, and as a result are doing harm, please shut down and let us help you correct it". Obeying that is then natural (to the extent that the human(s) are plausibly more correct than the AI).

I agree that in theory uncertainty about the goal is helpful. However, the true main goal has to be under consideration, otherwise resisting modification to add it is beneficial for all goals that are. How to ensure the true goal is included seems like a very difficult open problem.

If I want to do what that guy wants, and I my understanding of what he wants changes, it doesn't seem like I have been modified. I still have the same goal, just a different understanding of how to achieve it.

This is a variant of Roger Dearnaley's comment that accurately calculating uncertainty in your goal should do the work.

Another way to say this is that there are two different possible goals. If I want an AGI to work for world peace (by some definition), I can take two very distinct approaches. I can build an AGI with the goal of achieving world peace, and hope I conveyed the proper intuition. Or I can design an AGI with the goal of doing what I tell it to do (or maybe "what I want"), then telling it to achieve world peace and giving it the definition. In the second case, its goal is similar but not identical to the first. It's a pointer to that goal instead of a direct definition. Thus, if the goal changes in my head (or my instructions in that variant), the pointer's value changes. In the first direct definition case, it does not.

Goals as pointers is not something any evolved system really has, but it seems logically consistent and to have major advantages for alignment. More on that in my instruction-following AGI... post.

I agree that goals as pointers could have some advantages, but I don't see how it addresses corrigibility concerns. The system optimizing for whatever is being pointed at would still have incentives to manipulate which objective is being pointed at. It seems like you need an extra piece to make the optimizer indifferent to having it's goal switched. 

I'm curious what you mean by "anti-natural." You write:

Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states. 

My understanding of anti-naturality used to resemble this, before I had an in-depth conversation with Nate Soares and updated to see anti-naturality to be more like "opposed to instrumental convergence." My understanding is plausibly still confused and I'm not trying to be authoritative here.

If you mean "not straightforwardly captured in a ranking of end states" what does "straightforwardly" do in that definition?

Follow-up question, assuming anti-naturality goals are "not straightforwardly captured in a ranking of end states": Suppose I have a gridworld and I want to train an AI to avoid walking within 5 spaces (manhattan distance) from a flag, and to (less importantly) eat all the apples in a level. Is this goal anti-natural? I can't think of any way to reflect it as a straightforward ranking of end states, since it involves tracking historical facts rather than end-state facts. My guess is that it's pretty easy to build an agent that does this (via ML/RL approaches or just plain programming). Do you agree? If this goal is anti-natural, why is the anti-naturality a problem or otherwise noteworthy?

Great questions!

When I say straightforwardly, I mean when using end states that only include the information available at the time. If we define the end state to also include the history that lead to it, then there exists a set of preferences over them that ranks all end states with histories that include manipulation below the ones that don't. The issue, of course, is that we don't know how to specify all the types of manipulation that a superintelligent AI could conceive of. 

The gridworld example is a great demonstration of this, because while we can't reflect the preferences as a ranking of just the end states, the environment is simple enough that you can specify all the paths you don't want to take to them. I don't think it really matters whether you call that "anti-naturality that can be overcome with brute force in a simple environment" or just "not anti-naturality".

Cool. Thanks for the clarification. I think what you call "anti-naturality" you should be calling "non-end-state consequentialism," but I'm not very interested in linguistic turf-wars.

It seems to me that while the gridworld is very simple, the ability to train agents to optimize for historical facts is not restricted to simple environments. For example, I think one can train an AI to cause a robot to do backflips by rewarding it every time it completes a backflip. In this context the environment and goal are significantly more complex[1] than the gridworld and cannot be solved by brute-force. But number of backflips performed is certainly not something that can be measured at any given timeslice, including the "end-state."

If caring about historical facts is easy and common, why is it important to split this off and distinguish it?

  1. ^

    Though admittedly this situation is still selected for being simple enough to reason about. If needed I believe this point holds through AGI-level complexity, but things tend to get more muddled as things get more complex, and I'd prefer sticking to the minimal demonstration.

The backflip example does not strike me as very complex, but the crucial difference and the answer to your question is that training procedures do not teach a robot to do every kind of backflip, just a subset. This is important because when we reverse it, we want non-manipulation to cover the entire set of manipulations. I think it's probably feasible to have AI not manipulate us using one particular type of manipulation.

On a separate note, could you clarify what you mean by "anti-natural"? I'll keep in mind your previous caveat that it's not definitive.

Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness

More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this -- if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.

(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I'm also probably missing big parts of their perspectives, and generally don't trust myself to pass their ITT.)

  1. ^

    The term "anti-natural" is bad in that it seems to be the opposite of "natural," but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word "natural" describes besides these ways of thinking. The more complete version of "anti-natural" according to me would be "anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence" but obviously we need a shorthand term, and ideally one that doesn't breed confusion.

Thanks for the clarification, I'll think more about it that way and how it relates to corrigibility