Promoted to curated: I disagree with lots of stuff in this sequence, but I still found it the best case for corrigibility as a concept that I have seen so far. I in-particular liked 2. Corrigibility Intuition as just a giant list of examples of corrigibility, which I feel like did a better job of defining what corrigible behavior should look like than any other definitions I've seen.
I also have lots of other thoughts on the details, but I am behind on curation and it's probably not the right call for me to delay curation to write a giant response essay, though I do think it would be valuable.
Do you want to make an actual sequence for this so that the sequence navigation UI shows up at the top of the post?
You can do it. Just go to https://www.lesswrong.com/library and scroll down until you reach the "Community Sequences" section and press the "Create New Sequence" button.
What the heck is up with “corrigibility”? For most of my career, I had a sense that it was a grab-bag of properties that seemed nice in theory but hard to get in practice, perhaps due to being incompatible with agency.
Then, last year, I spent some time revisiting my perspective, and I concluded that I had been deeply confused by what corrigibility even was. I now think that corrigibility is a single, intuitive property, which people can learn to emulate without too much work and which is deeply compatible with agency. Furthermore, I expect that even with prosaic training methods, there’s some chance of winding up with an AI agent that’s inclined to become more corrigible over time, rather than less (as long as the people who built it understand corrigibility and want that agent to become more corrigible). Through a slow, gradual, and careful process of refinement, I see a path forward where this sort of agent could ultimately wind up as a (mostly) safe superintelligence. And, if that AGI is in the hands of responsible governance, this could end the acute risk period, and get us to a good future.
This is not the path we are currently on. As far as I can tell, frontier labs do not understand corrigibility deeply, and are not training their models with corrigibility as the goal. Instead, they are racing ahead with a vague notion of “ethical assistance” or “helpful+harmless+honest” and a hope that “we’ll muddle through like we always do” or “use AGI to align AGI” or something with similar levels of wishful thinking. Worse, I suspect that many researchers encountering the concept of corrigibility will mistakenly believe that they understand it and are working to build it into their systems.
Building corrigible agents is hard and fraught with challenges. Even in an ideal world where the developers of AGI aren’t racing ahead, but are free to go as slowly as they wish and take all the precautions I indicate, there are good reasons to think doom is still likely. I think that the most prudent course of action is for the world to shut down capabilities research until our science and familiarity with AI catches up and we have better safety guarantees. But if people are going to try and build AGI despite the danger, they should at least have a good grasp on corrigibility and be aiming for it as the singular target, rather than as part of a mixture of goals (as is the current norm).
My goal with these documents is primarily to do three things:
Alas, my writing is not currently very distilled. Most of these documents are structured in the format that I originally chose for my private notes. I’ve decided to publish them in this style and get them in front of more eyes rather than spend time editing them down. Nevertheless, here is my attempt to briefly state the key ideas in my work:
Note: I’m a MIRI researcher, but this agenda is the product of my own independent research, and as such one should not assume it’s endorsed by other research staff at MIRI.
Note: Much of my thinking on the topic of corrigibility is heavily influenced by the work of Paul Christiano, Benya Fallenstein, Eliezer Yudkowsky, Alex Turner, and several others. My writing style involves presenting things from my perspective, rather than leaning directly on the ideas and writing of others, but I want to make it very clear that I’m largely standing on the shoulders of giants, and that much of my optimism in this research comes from noticing convergent lines of thought with other researchers. Thanks to Nate Soares, Steve Byrnes, Jesse Liptrap, Seth Herd, Ross Nordby, Jeff Walker, Haven Harms, and Duncan Sabien for early feedback. I also want to especially thank Nathan Helm-Burger for his in-depth collaboration on the research and generally helping me get unconfused.
Overview
1. The CAST Strategy
In The CAST Strategy, I introduce the property corrigibility, why it’s an attractive target, and how we might be able to get it, even with prosaic methods. I discuss the risks of making corrigible AI and why trying to get corrigibility as one of many desirable properties to train an agent to have (instead of as the singular target) is likely a bad idea. Lastly, I do my best to lay out the cruxes of this strategy and explore potent counterarguments, such as anti-naturality and whether corrigibility can scale. These counterarguments show that even if we can get corrigibility, we should not expect it to be easy or foolproof.
2. Corrigibility Intuition
In Corrigibility Intuition, I try to give a strong intuitive handle on corrigibility as I see it. This involves a collection of many stories of a CAST agent behaving in ways that seem good, as well as a few stories of where a CAST agent behaves sub-optimally. I also attempt to contrast corrigibility with nearby concepts through vignettes and direct analysis, which includes a discussion of why we should not expect frontier labs, given current training targets, to produce corrigible agents.
3a. Towards Formal Corrigibility
In Towards Formal Corrigibility, I attempt to sharpen my description of corrigibility. I try to anchor the notion of corrigibility, ontologically, as well as clarify language around concepts such as “agent” and “reward.” Then I begin to discuss the shutdown problem, including why it’s easy to get basic shutdownability, but hard to get the kind of corrigible behavior we actually desire. I present the sketch of a solution to the shutdown problem, and discuss manipulation, which I consider to be the hard part of corrigibility.
3b. Formal (Faux) Corrigibility ← the mathy one
In Formal (Faux) Corrigibility, I build a fake framework for measuring empowerment in toy problems, and suggest that it’s at least a start at measuring manipulation and corrigibility. This metric, at least in simple settings such as a variant of the original stop button scenario, produces corrigible behavior. I extend the notion to indefinite games played over time, and end by criticizing my own formalism and arguing that data-based methods for building AGI (such as prosaic machine-learning) may be significantly more robust (and therefore better) than methods that heavily trust this sort of formal analysis.
4. Existing Writing on Corrigibility
In Existing Writing on Corrigibility, I go through many parts of the literature in depth including MIRI’s earlier work, some of the writing by Paul Christiano, Alex Turner, Elliot Thornley, John Wentworth, Steve Byrnes, Seth Herd, and others.
5. Open Corrigibility Questions
In Open Corrigibility Questions, I summarize my overall reflection of my understanding of the topic, including reinforcing the counterarguments and nagging doubts that I find most concerning. I also lay out potential directions for additional work, including studies that I suspect others could tackle independently.
Bibliography and Miscellany
In addition to this sequence, I’ve created a Corrigibility Training Context that gives ChatGPT a moderately-good understanding of corrigibility, if you’d like to try talking to it.
The rest of this post is bibliography, so I suggest now jumping straight to The CAST Strategy.
While I don’t necessarily link to or discuss each of the following sources in my writing, I’m aware of and have at least skimmed everything listed here. Other writing has influenced my general perspective on AI, but if there are any significant pieces of writing on the topic of corrigibility that aren’t on this list, please let me know.