"Coherent extrapolated volition" (CEV) is Eliezer Yudkowsky's proposed thing-to-do with an extremely advanced AGI, if you're extremely confident of your ability to align it on complicated targets.
Roughly, a CEV-based superintelligence would do what currently existing humans would want* the AI to do, if counterfactually:
...to whatever extent most existing humans, thus extrapolated, would predictably want* the same things. (For example, in the limit of extrapolation, nearly all humans might want* not to be turned into paperclips, but might not agree* on the best pizza toppings. See below.)
CEV is meant to be the literally optimal or ideal or normative thing to do with an autonomous superintelligence, if you trust your ability to perfectly align a superintelligence on a very complicated target. (See below.)
CEV is rather complicated and meta and hence not intended as something you'd do with the first AI you ever tried to build. CEV might be something that everyone inside a project agreed was an acceptable mutual target for their second AI. (The first AI should probably be a Task AGI.)
For the corresponding metaethical theory see Extrapolated volition (normative moral theory).
Extrapolated volition is the metaethical theory that when we ask "What is right?", then insofar as we're asking something meaningful, we're asking "What would a counterfactual idealized version of myself want* if it knew all the facts, had considered all the arguments, and had perfect self-knowledge and self-control?" (As a metaethical theory, this would make "What is right?" a mixed logical and empirical question, a function over possible states of the world.)
A very simple example of extrapolated volition might be to consider somebody who asks you to bring them orange juice from the refrigerator. You open the refrigerator and see no orange juice, but there's lemonade. You imagine that your friend would want you to bring them lemonade if they knew everything you knew about the refrigerator, so you bring them lemonade instead. On an abstract level, we can say that you "extrapolated" your friend's "volition", in other words, you took your model of their mind and decision process, or your model of their "volition", and you imagined a counterfactual version of their mind that had better information about the contents of your refrigerator, thereby "extrapolating" this volition.
Having better information isn't the only way that a decision process can be extrapolated; we can also, for example, imagine that a mind has more time in which to consider moral arguments, or better knowledge of itself. Maybe you currently want revenge on the Capulet family, but if somebody had a chance to sit down with you and have a long talk about how revenge affects civilizations in the long run, you could be talked out of that. Maybe you're currently convinced that you advocate for green shoes to be outlawed out of the goodness of your heart, but if you could actually see a printout of all of your own emotions at work, you'd see there was a lot of bitterness directed at people who wear green shoes, and this would change your mind about your decision.
In Yudkowsky's version of extrapolated volition considered on an individual level, the three core directions of extrapolation are:
Different people initially react differently to the question "Where should we point a superintelligence?" or "What should an aligned superintelligence do?" - not just different beliefs about what's good, but different frames of mind about how to ask the question.
Some common reactions:
An initial response to each of these frames might be:
CEV's advocates claim that all of these lines of discussion eventually end up converging on the idea of coherent extrapolated volition. For example:
(To do: Write dialogues from each of these four entrance points.)
See the corresponding section in "Extrapolated volition (normative moral theory)".
There are several reasons why CEV is way too challenging to be a good target for any project's first try at building machine intelligence:
Doing this correctly the very first time we build a smarter-than-human intelligence seems improbable. The only way this would make a good first target is if the CEV concept is formally simpler than it currently seems, and timelines to AGI are unusually long and permit a great deal of advance work on safety.
If AGI is 20 years out (or less), it seems wiser to think in terms of a Task AGI performing some relatively simple pivotal act. The role of CEV is of answering the question, "What can you all agree in advance that you'll try to do next, after you've executed your Task AGI and gotten out from under the shadow of immediate doom?"
A frequently asked question is "What if extrapolating human volitions produces incoherent answers?"
According to the original motivation for CEV, if this happens in some places, a Friendly AI ought to ignore those places. If it happens everywhere, you probably picked a silly way to construe an extrapolated volition and you ought to rethink it. [1]
That is:
The original motivation for CEV can also be viewed from the perspective of "What is it to help someone?" and "How can one help a large group of people?", where the intent behind the question is to build an AI that renders 'help' as we really intend that. The elements of CEV can be seen as caveats to the naive notion of "Help is giving people whatever they ask you for!" in which somebody asks you to bring them orange juice but the orange juice in the refrigerator is poisonous (and they're not trying to poison themselves).
What about helping a group of people? If two people ask for juice and you can only bring one kind of juice, you should bring a non-poisonous kind of juice they'd both like, to the extent any such juice exists. If no such juice exists, find a kind of juice that one of them is meh about and that the other one likes, and flip a coin or something to decide who wins. You are then being around as helpful as it is possible to be.
Can there be no way to help a large group of people? This seems implausible. You could at least give the starving ones pizza with a kind of pizza topping they currently like. To the extent your philosophy claims "Oh noes even that is not helping because it's not perfectly coherent," you have picked the wrong construal of 'helping'.
It could be that, if we find that every reasonable-sounding construal of extrapolated volition fails to cohere, we must arrive at some entirely other notion of 'helping'. But then this new form of helping also shouldn't involve bringing people poisonous orange juice that they don't know is poisoned, because that still intuitively seems unhelpful.
What if somebody believes themselves to prefer onions to pineapple on their pizza, prefer pineapple to mushrooms, and prefer mushrooms to onions? In the sense that, offered any two slices from this set, they would pick according to the given ordering?
(This isn't an unrealistic example. Numerous experiments in behavioral economics demonstrate exactly this sort of circular preference. For instance, you can arrange 3 items such that each pair of them brings a different salient quality into focus for comparison.)
One may worry that we couldn't 'coherently extrapolate the volition' of somebody with these pizza preferences, since these local choices obviously aren't consistent with any coherent utility function. But how could we help somebody with a pizza preference like this?
Well, appealing to the intuitive notion of helping:
Conversely, these alternatives seem less helpful:
Advocates of CEV claim that if you blank the complexities of 'extrapolated volition' out of your mind; and ask how you could reasonably help people as best as possible if you were trying not be a jerk; and then try to figure out how to semiformalize whatever mental procedure you just followed to arrive at your answer for how to help people; then you will eventually end up at CEV again.
A primary purpose of CEV is to represent a relatively simple meta-level ideal that people can agree upon, even where they might disagree on the object level. By a hopefully analogous example, two honest scientists might disagree on the correct mass of an electron, but agree that the experimental method is a good way to resolve the answer.
Imagine Millikan believes an electron's mass is 9.1e-28 grams, and Nannikan believes the correct electron mass is 9.1e-34 grams. Millikan might be very worried about Nannikan's proposal to program an AI to believe the electron mass is 9.1e-34 grams; Nannikan doesn't like Millikan's proposal to program in 9.1-e28; and both of them would be unhappy with a compromise mass of 9.1e-31 grams. They might still agree on programming an AI with some analogue of probability theory and a simplicity prior, and letting a superintelligence come to the conclusions implied by Bayes and Occam, because the two can agree on an effectively computable question even though they think the question has different answers. Of course, this is easier to agree on when the AI hasn't yet produced an answer, or if the AI doesn't tell you the answer.
It's not guaranteed that every human embodies the same implicit moral questions, indeed this seems unlikely, which means that Alice and Bob might still expect their extrapolated volitions to disagree about things. Even so, while the outputs are still abstract and not-yet-computed, Alice doesn't have much of a place to stand on which to appeal to Carol, Dennis, and Evelyn by saying, "But as a matter of morality and justice, you should have the AI implement my extrapolated volition, not Bob's!" To appeal to Carol, Dennis, and Evelyn about this, you'd need them to believe that Alice's EV was more likely to agree with their EVs than Bob's was - and at that point, why not come together on the obvious Schelling point of extrapolating everyone's EVs?
Thus, one of the primary purposes of CEV (selling points, design goals) is that it's something that Alice, Bob, and Carol can agree now that Dennis and Evelyn should do with an AI that will be developed later; we can try to set up commitment mechanisms now, or check-and-balance mechanisms now, to ensure that Dennis and Evelyn are still working on CEV later.
A CEV is not necessarily a majority vote. A lot of people with an extrapolated weak preference* might be counterbalanced by a few people with a strong extrapolated preference* in the opposite direction. Nick Bostrom's "parliamentary model" for resolving uncertainty between incommensurable ethical theories, permits a subtheory very concerned about a decision to spend a large amount of its limited influence on influencing that particular decision.
This means that, e.g., a vegan or animal-rights activist should not need to expect that they must seize control of a CEV algorithm in order for the result of CEV to protect animals. It doesn't seem like most of humanity would be deriving huge amounts of utility from hurting animals in a post-superintelligence scenario, so even a small part of the population that strongly opposes* this scenario should be decisive in preventing it.
(ADDED 2023: Thomas Cederborg correctly observes that Nick Bostrom's original parliamentary proposal involves a negotiation baseline where each agent has a random chance of becoming dictator, and that this random-dictator baseline gives an outsized and potentially fatal amount of power to spoilants - agents that genuinely and not as a negotiating tactic prefer to invert other agents' utility functions, or prefer to do things that otherwise happen to minimize those utility functions - if most participants have utility functions with what I've termed "negative skew"; i.e. an opposed agent can use the same amount of resource to generate -100 utilons as an aligned agent can use to generate at most +1 utilon. If trolls are 1% of the population, they can demand all resources be used their way as a concession in exchange for not doing harm, relative to the negotiating baseline in which there's a 1% chance of a troll being randomly appointed dictator. Or to put it more simply, if 1% of the population would prefer to create Hell for all but themselves (as a genuine preference rather than a strategic negotiating baseline) and Hell is 100 times as bad as Heaven is good, compared to nothing, they can steal the entire future if you run a parliamentary procedure running from a random-dictator baseline. I agree with Cederborg that this constitutes more-than-sufficient reason not to start from random-dictator as a negotiation baseline; anyone not smart enough to reliably see this point and scream about it, potentially including Eliezer, is not reliably smart enough to implement CEV; but CEV probably wasn't a good move for baseline humans to try implementing anyways, even given long times to deliberate at baseline human intelligence. --EY)
One of the points of the CEV proposal is to have minimal moral hazard (aka, not tempting the programmers to take over the world or the future); but this may be compromised if CEV's results don't go literally unchecked.
Part of the purpose of CEV is to stand as an answer to the question, "If the ancient Greeks had been the ones to invent superintelligence, what could they have done that would not, from our later perspective, irretrievably warp the future? If the ancient Greeks had programmed in their own values directly, they would have programmed in a glorious death in combat. Now let us consider that perhaps we too are not so wise." We can imagine the ancient Greeks writing a CEV mechanism, peeking at the result of this CEV mechanism before implementing it, and being horrified by the lack of glorious-deaths-in-combat in the future and value system thus revealed.
We can also imagine that the Greeks, trying to cut down on moral hazard, virtuously refuse to peek at the output; but it turns out that their attempt to implement CEV has some unforeseen behavior when actually run by a superintelligence, and so their world is turned into paperclips.
This is a safety-vs.-moral-hazard tradeoff between (a) the benefit of being able to look at CEV outputs in order to better-train the system or just verify that nothing went horribly wrong; and (b) the moral hazard that comes from the temptation to override the output, thus defeating the point of having a CEV mechanism in the first place.
There's also a potential safety hazard just with looking at the internals of a CEV algorithm; the simulated future could contain all sorts of directly mind-hacking cognitive hazards.
Rather than giving up entirely and embracing maximum moral hazard, one possible approach to this issue might be to have some single human that is supposed to peek at the output and provide a 1 or 0 (proceed or stop) judgment to the mechanism, without any other information flow being allowed to the programmers if the human outputs 0. (For example, the volunteer might be in a room with explosives that go off if 0 is output.)
Suppose that Fred is funding Grace to work on a CEV-based superintelligence; and Evelyn has decided not to oppose this project. The resulting CEV is meant to extrapolate the volitions of Alice, Bob, Carol, Dennis, Evelyn, Fred, and Grace with equal weight. (If you're reading this, you're more than usually likely to be one of Evelyn, Fred, or Grace.)
Evelyn and Fred and Grace might worry: "What if a supermajority of humanity consists of 'selfish* bastards', such that their extrapolated volitions would cheerfully vote* for a world in which it was legal to own artificial sapient beings as slaves so long as they personally happened to be in the slaveowning class; and we, Evelyn and Fred and Grace, just happen to be in the minority that extremely doesn't want nor want* the future to be like that?"
That is: What if humanity's extrapolated volitions diverge in such a way that from the standpoint of our volitions - since, if you're reading this, you're unusually likely to be one of Evelyn or Fred or Grace - 90% of extrapolated humanity would choose* something such that we would not approve of it, and our volitions would not approve* of it, even after taking into account that we don't want to be jerks about it and that we don't think we were born with any unusual or exceptional right to determine the fate of humanity.
That is, let the scenario be as follows:
90% of the people (but not we who are collectively sponsoring the AI) are selfish bastards at the core, such that any reasonable extrapolation process (it's not just that we picked a broken one) would lead to them endorsing a world in which they themselves had rights, but it was okay to create artificial people and hurt them. Furthermore, they would derive enough utility from being personal God-Emperors that this would override our minority objection even in a parliamentary model.
We can see this hypothetical outcome as potentially undermining every sort of reason that we, who happen to be in a position of control to prevent that outcome, should voluntarily relinquish that control to the remaining 90% of humanity:
Rather than giving up entirely and taking over the world, or exposing ourselves to moral hazard by peeking at the results, one possible approach to this issue might be to run a three-stage process.
This process involves some internal references, so the detailed explanation needs to follow a shorter summary explanation.
In summary:
In detail:
The particular fallback of "kick out from the extrapolation any weighted portions of extrapolated decision processes that would act unilaterally and without caring for others, given unchecked power" is meant to have a property of poetic justice, or rendering objections to it self-defeating: If it's okay to act unilaterally, then why can't we unilaterally kick out the unilateral parts? This is meant to be the 'simplest' or most 'elegant' way of kicking out a part of the CEV whose internal reasoning directly opposes the whole reason we ran CEV in the first place, but imposing the minimum possible filter beyond that.
Thus if Alice (who by hypothesis is not in any way a contributor) says, "But I demand you altruistically include the extrapolation of me that would unilaterally act against you if it had power!" then we reply, "We'll try that, but if it turns out to be a sufficiently bad idea, there's no coherent interpersonal grounds on which you can rebuke us for taking the fallback option instead."
Similarly in regards to the Fail option at the end, to anyone who says, "Fairness demands that you run Fallback CEV even if you wouldn't like* it!" we can reply, "Our own power may not be used against us; if we'd regret ever having built the thing, fairness doesn't oblige us to run it."
One frequently asked question about the implementation details of CEV is either:
In particular, it's been asked why restrictive answers to Question 1 don't also imply the more restrictive answer to Question 2.
We'll start by considering some replies to the question, "Why not include all mammals into CEV's extrapolation base?"
To expand on this last consideration, we can reply: "Even if you would regard it as more just to have the right animal-protecting outcome baked into the future immediately, so that your EV didn't need to expend some of its voting strength on assuring it, not everyone else might regard that as just. From our perspective as programmers we have no particular reason to listen to you rather than Alice. We're not arguing about whether animals will be protected if a minority vegan-type subpopulation strongly want* that and the rest of humanity doesn't care*. We're arguing about whether, if you want* that but a majority doesn't, your EV should justly need to expend some negotiating strength in order to make sure animals are protected. This seems pretty reasonable to us as programmers from our standpoint of wanting to be fair, not be jerks, and not start any slap-fights over world domination."
This third reply is particularly important because taken in isolation, the first two replies of "You could be wrong about that being a good idea" and "Even if you care about their welfare, maybe you wouldn't like their EVs" could equally apply to argue that contributors to the CEV project ought to extrapolate only their own volitions and not the rest of humanity:
The proposed way of addressing this was to run a composite CEV with a contributor-CEV check and a Fallback-CEV fallback. But then why not run an Animal-CEV with a Contributor-CEV check before trying the Everyone-CEV?
One answer would go back to the third reply above: Nonhuman mammals aren't sponsoring the CEV project, allowing it to pass, or potentially getting angry at people who want to take over the world with no seeming concern for fairness. So they aren't part of the Schelling Point for "everyone gets an extrapolated vote".
Similarly if we ask: "Why not include all sapient beings that the SI suspects to exist everywhere in the measure-weighted multiverse?"
"Why not include all deceased human beings as well as all currently living humans?"
In this case, we can't then reply that they didn't contribute to the human project (e.g. I. J. Good). Their EVs are also less likely to be alien than in any other case considered above.
But again, we fall back on the third reply: "The people who are still alive" is a simple Schelling circle to draw that includes everyone in the current political process. To the extent it would be nice or fair to extrapolate Leo Szilard and include him, we can do that if a supermajority of EVs decide* that this would be nice or just. To the extent we don't bake this decision into the model, Leo Szilard won't rise from the grave and rebuke us. This seems like reason enough to regard "The people who are still alive" as a simple and obvious extrapolation base.
"Why include very young children, uncontacted tribes who've never heard about AI, and retrievable cryonics patients (if any)? They can't, in their current state, vote for or against anything."
Albeit in practice, you would not want an AI project to take a dozen tries at defining CEV. This would indicate something extremely wrong about the method being used to generate suggested answers. Whatever final attempt passed would probably be the first answer all of whose remaining flaws were hidden, rather than an answer with all flaws eliminated.