A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.
Now I'm going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I'm not interested in the list of corrigibility properties.
I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural.
Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.
As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:
Let's say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won't notice anything wrong and so you would erroneously give it the best score of all.[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I'll call "corrigible" and the other near option 4 which I'll call "incorrigible."
As a second example, suppose that you have asked me to turn off. Some possible behaviors:
Again moving from 1 -> 2 -> 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized "performs well as evaluated by you").
As a third example, suppose that you are using some interpretability tools to try to understand what I'm thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:
Again, 1 -> 2 -> 3 is getting worse and worse, and then 4 is great (as evaluated by you).
What's going on in these scenarios and why might it be general?
This pattern seems like it occurs whenever we ask our AI to help "keep us informed and in control." Intuitively, we are splitting the definition of the behavior we want into two pieces:
If you literally had a metric for which there was a buffer between the "corrigible" and "incorrigible" behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don't think either of those hopes works robustly, so I'm going to leave this at a much vaguer intuition about what "corrigibility" is about.
This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It's also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don't think either of those works, but I do think they are getting at an important intuition for solubility.
My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.
(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)
In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won't affect what clusters are "corrigible" vs "incorrigible" at all.
I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.
- I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.
Can someone explain to me what this crispness is?
As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who's values are getting optimized in the universe).
Then there's this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.
Is that what this crispness is? This little pool of rating fall off?
If yes, it's not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don't know if the pool always exists around the action space, and to the extent it does exist I don't know how to use its existence to build a powerful optimizer that stays on one side of the pool.
Though Paul isn't saying he knows how to do that. He's saying that there's something really useful about it being crisp. I guess that's what I want to know. I don't understand the difference between "corrigibility is well-defined" and "corrigibility is crisp". Insofar as it's not a literally incoherent idea, there is some description of what behavior is in the category and what isn't. Then there's this additional little pool property, where not only can you list what's in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it's a very natural and simple concept to design a system to stay within?
If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly---almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.
If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."
The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?
[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]
I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)
Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)
I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventing low-dimensional, low-barrier "tunnels" (or bridges?) between the basins
Eg some versions of "low impact" often makes the "incorrigible" basin harder to reach, roughly because "elaborate webs of deceptions an coverups" may require complex changes to the environment. (Not robustly)
In contrast, my impression is, what does not count as "principles" are statements about properties which are likely true in the corrigibility basin, but don't seem designable - eg "corrigible AI does not try to hypnotize you". Also the intended level of generality likely is: more specific than "make the basin deeper" and more general than "
Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list above. Otherwise there are many ways how to make the basin work "in most directions".
I'm not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.
Principles which counteract instrumental convergent goals
1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it.
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence
Principles which counteract unbounded rationality
4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned
Traps
7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.
Oversight
10. Human-approval model based on imitation learning, sped up/amplified
11. Human-values ethics model, based on value learning
12. Legal-system-amplified model of negative limits of violating property rights or similar
13. Red-teaming of action plans, AI debate style, feeding into previous
Interpretability
14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
15. Human-level explanations, produced by an independent "translator" system
I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don't think it makes any sense for me to do that,. (I'd happily agree to claims of the type "Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics", but my understanding of the claim is more in the direction "all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones")
Seems like a worthwhile exercise...
There is a distinction between design principles intended to be used as targets/guides by human system designers at design time, vs runtime optimization targets intended to be used as targets/guides by the system itself at runtime. This list consists of design principles, not runtime optimization targets. Some of them would be actively dangerous to optimize for at runtime.
I do not think I currently know what concept Eliezer usually wants to point to with the word "corrigibility", nor am I even sure that he's pointing to a coherent concept at all (as opposed to, say, a bunch of not-actually-unified properties which would make it less likely for a strong AGI to kill us on the first failure).
I have omitted principles of the form "don't do <stupid thing>", like e.g. don't optimize any part of the system for human approval/feedback, don't outsource error-analysis or interpretation to another AI system, etc.
These took 1-2 hours to generate, and another 30 min - 1 hr to write up as a comment.
Minor clarification: This doesn't refer to re-writing the LW corrigibility tag. I believe a tag is a reply in glowfic, where each author responds with the next tag i.e. next bit of the story, with an implied "tag – now you're it!" at the other author.
“myopia” (not sure who correctly named this as a corrigibility principle),
I think this is from Paul Christiano, e.g. this discussion.
Some hopefully-unnecessary background info for people attempting this task:
A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".
An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."
(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)
The basics
Myopia
Non-maximizing
No weird stuff
Human imitation
Querying / robustness
I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.
Quick brainstorm:
Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the "corrigibility" and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I've decided that whenever I need a name for something I don't know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, went over budget by another half hour and added as many topics as I finished. I will spend the other hour and a half to finish this if it seems like a good idea tomorrow.)
-
An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful human control over the AI.
Different parts of the corrigible AI may be restricted relative to an idealized agent - world-modeling, consequence-ranking, or action-choosing. When elements of the agent are updated by learning or training, the updating process must preserve these restrictions. This is nontrivial because simple metrics of success may be better-fulfilled by more agential AIs. See restricted learning for further discussion, especially restricted learning § non-compensation for open problems related to preventing learning or training one part of the AI from compensating for restrictions nominally located in other parts.
Restricted world-modeling is a common reason for AI to be safe. For example, an AI designed to play the computer game Brick-Break may choose the action that maximizes its score, which would be unsafe if actions were evaluated using a complete model of the world. However, if actions are evaluated using a simulation of the game of Brick-Break, or if the AI's world model is otherwise restricted to modeling the game, then it is likely to choose actions that are safe.
Many proposals for "tool AI" or "science AI" fall into this category. If we can create a closed model of a domain (e.g. the electronic properties of crystalline solids), and simple objectives within that domain correspond to solutions to real-world problems (e.g. superconductor design), then learning and search within the model can be safe yet valuable.
It may seem that these solutions do not apply when we want to use the AI to solve problems that require learning about the world in general. However, some closely related avenues are being explored.
Perhaps the simplest is to identify things that we don't want the AI to think about, and exclude them from the world-model, while still having a world-model that encompasses most of the world. For example, an AI that deliberately doesn't know about the measures humans have put in place to shut it off, or an AI that doesn't have a detailed understanding of human psychology. However, this can be brittle in practice, because ignorance incentivizes learning. For more on the learning problem, see restricted learning § doublethink.
Real time intervention on AI designs that have more dynamic interactions between their internal state and the world model falls under the umbrella of thought policing and policeability. This intersects with altering the action-choosing procedure to select policies that do not violate certain rules, see § deontology.
A corrigible AI built with counterfactual agency does not model the world as it is, instead its world model describes some counterfactual world, and it chooses actions that have good consequences within that counterfactual world.
The strategies in this general class are best thought of in terms of restricted action-choosing. We can describe them with an agent that has an accurate model of the world, but chooses actions by generating a counterfactual world and then evaluating actions' consequences on the counterfactual, rather than the agential procedure. Note that this also introduces some compensatory pressures on the world-model.
The difficulty lies in choosing and automatically constructing counterfactuals (see automatic counterfactual construction) so that the AI's outputs can be interpreted by human operators to solve real-world problems, without those outputs being selected by the AI for real-world consequences. For attempts to quantify the selection pressure of counterfactual plans in the real world, see policy decoherence. One example proposal for counterfactual agency is to construct AIs that act as if they are giving orders to perfectly faithful servants, when in reality the human operators will evaluate the output critically.
Counterfactual agency is also related to constructing agents that act as if they are ignorant of certain pieces of knowledge. Taking the previous example of an AI that doesn't know about human psychology, it might still use learning to produce an accurate world model, but make decisions by predicting the consequences in an edited world model that has less precise predictions for humans, and also freezes those predictions, particularly in value of information calculations. Again, see restricted learning § doublethink.
We might hope to lessen the danger of agents by reducing how effectively they search the space of solutions, or otherwise restricting that search.
The two simplest approaches are whitelisting or blacklisting. Both restrict the search result to a set that fulfills some pre-specified criteria. Blacklisting refers to permissive criteria, while whitelisting refers to restrictive ones. Both face difficulty in retaining safety properties while solving problems in the real world.
quantilizers
intervening at intermediate reasoning steps
learned human reasoning patterns
general deontology
myopia
impact regularization
Types of human oversight
Via restrictions on consequence-ranking
Via counterfactual agency
Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of "corrigibility" or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.
This feels to me like very much not how I would go about getting corrigibility.
It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.
Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.
I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems for the "council".
I would make heavy use of techniques that are centered around verifiability, since for some pieces of work it’s possible to set up things in such a way that it would be very hard for the system to "pretend" like it’s doing what I want it to do without actually doing it. There are several techniques I would use to achieve this, but one of them is that I often would ask it to provide a narrow/specialized/interpretable "result-generator" instead of giving the result directly, and sometimes even result-generator-generators (pieces of code that produce results, and that have architectures that make it easy to understand and verify behavior). So when for example getting it to generate simulations, I would get from it a simulation-generator (or simulation-generator-generator), and I would test its accuracy against real-world-data.
Here is a draft for a text where I try to explain myself in more detail, but it's not finished yet: https://docs.google.com/document/d/1INu33PIiRZbOBYjRul6zsCwF98z0l25V3pzMRJsYC_I/edit
The top-rated comment on "AGI Ruin: A List of Lethalities" claims that many other people could've written a list like that.
"Why didn't you challenge anybody else to write up a list like that, if you wanted to make a point of nobody else being able to write it?" I was asked.
Because I don't actually think it does any good, or persuades anyone of anything, people don't like tests like that, and I don't really believe in them myself either. I couldn't pass a test somebody else invented around something they found easy to do, for many such possible tests.
But people asked, so, fine, let's actually try it this time. Maybe I'm wrong about how bad things are, and will be pleasantly surprised. If I'm never pleasantly surprised then I'm obviously not being pessimistic enough yet.
So: As part of my current fiction-writing project, I'm currently writing a list of some principles that dath ilan's Basement-of-the-World project has invented for describing AGI corrigibility - the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.
So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability". (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)
Some of the items on dath ilan's upcoming list out of my personal glowfic writing have already been written up more seriously by me. Some haven't.
I'm writing this in one afternoon as one tag in my cowritten online novel about a dath ilani who landed in a D&D country run by Hell.
One and a half thousand words or so, maybe. (2169 words.)How about you try to do better than the tag overall, before I publish it, upon the topic of corrigibility principles on the level of "myopia" for AGI? It'll get published in a day or so, possibly later, but I'm not going to be spending more than an hour or two polishing it.