This work was supported by the Monastic Academy for the Preservation of Life on Earth. You can support my work here.

I will give a short presentation of this work followed by discussion on Wednesday Dec 28 at 12pm Pacific / 3pm Eastern. RSVP here.

Outline

  • I have four questions above coherent extrapolated volition, which I present in the form of four short thought experiments:

    1. What kind of a thing can be extrapolated in the direction of wisdom? (Robot vacuum thought experiment)

    2. What kind of protocol connects with the wisdom of a person who has been extrapolated? (Dream research thought experiment)

    3. What kind of model captures that within a person that we hope to amplify through extrapolation? (Twitter imitator thought experiment)

    4. What kind of environment is sufficient to grow true wisdom? (Astrological signs thought experiment)

  • The title of this post is based on the second thought experiment.

  • I claim that we lack a theory about that-which-is-capable-of-becoming-wise, in a form that lets us say something about its relationship to models, extrapolation, and volitional dynamics. I argue that CEV does not actually provide this central theory.

Introduction

Coherent extrapolated volition is Eliezer’s 2004 proposal for the goal we might give to a powerful AI. The basic idea is to have the AI work out what we would do or say if we were wiser versions of our present selves, and have the AI predicate its actions on that. To do this, the AI might work out what would happen if a person contemplated an issue for a long time, or was exposed to more conversations with excellent conversation partners, or spent a long time exploring the world, or just lived a long and varied life. It might be possible for us to describe the transformations that lead to wisdom even if we don’t know apriori what those transformations will lead to.

CEV does not spell out exactly what those transformations are — though it does make suggestions — nor how exactly the AI’s actions would be connected to the results of such transformations. The main philosophical point that CEV makes is that the thing to have an AI attend to, if you’re trying to do something good with AI, is wisdom, and that wisdom arises from a process of maturation. At present we might be confused about both the nature of the world and about our own terminal values. If an AI asks us " how should honesty be traded off against courage?" we might give a muddled answer. Yet we do have a take on honesty and courage. Wiser versions of ourselves might be less confused about this, and yet still be us.

An example: suppose you ask a person to select a governance structure for a new startup. If you ask the person to make a decision immediately, you might get a mediocre answer. If you give them a few minutes to contemplate then you might get a better answer. This "taking a few minutes to contemplate" is a kind of transformation of mind. Beginning from the state where the person was just asked the question, their mind changes in certain ways over the course of those few minutes and the response given after that transform is different to the response before it.

Perhaps there are situations where the "taking a few minutes to contemplate" transform decreases the quality of the eventual answer, as in the phenomenon of "analysis paralysis" — CEV does not claim that this particular transform is the wisdom-inducing transform. CEV does claim that there exists some transformation of mind that leads to wisdom. It need not be that these transformations are about minds contemplating things in isolation. Perhaps there are certain insights that you can only come to through conversations with friends. If so, perhaps the AI can work out what would become of a person if they spent a lot of time in conversation with an excellent group of friends. Here the transformation is "spend time in conversation with friends". The hypothesis is that this transformation of mind leads to wisdom.

CEV suggests that we leave it to the AI to work out how we would be changed by whatever wisdom-inducing transformations we decide upon. That means that there need not be a physical human experiencing a sequence of real conversations with friends; instead, the AI would work out how the human would be transformed if they did have such conversations, and what the transformed person would do or say. Working this out may or may not involve simulation — perhaps the AI will find other ways to reason about the results of these wisdom-inducing transformations.

The CEV Hypothesis. Very roughly, the CEV claim is (1) that there exists a transformation that cause a person’s words or behavior more clearly express wisdom, (2) that an AI can work out what would become of a person after undergoing such transformations, and (3) that there is a way to tie an AI’s actions to the words of actions of a thus-extrapolated person such that the AI’s behavior is truly good, for some definition of "truly good" that may only be known to our future selves.

In this essay I will give four thought experiments that probe these hypotheses. The title of this post is based on the second of these thought experiments.

Robot vacuum

Consider a robot that builds a map of a house, locates itself within that map, and then makes and executes plans to vacuum the floor. The implementation of this robot vacuum is unusually clear: it has a clear goal function, a clear planning facility, a clear world model, and a clear method for updating the world model as each new sensor measurement arrives. This should be the ideal case for CEV. So what would it mean to extrapolate this robot vacuum in the direction of wisdom?

From our perspective outside the system, we might say that the robot vacuum’s CEV is a world tiled with clean floors. But CEV has a specific structural form – let’s work through it step by step. The first step is that we take a model of the thing that we are trying to compute the CEV of and extrapolate it in the direction of wisdom. What could this mean in the case of the robot vacuum?

Well, it means that we transform the robot vacuum in ways corresponding to the kind of growth that we believe would lead to wisdom if the robot vacuum went through them in the real world. What could this mean for a robot vacuum? It’s hard to say!

We apply the transformation of greatly expanding the robot vacuum’s model of the world, for example by giving it an accurate floorplan of the entire surface of the earth. If its built-in planning algorithm isn’t efficient enough to deal with such a huge floorplan then we might replace it with a more efficient algorithm that still searches with respect to the same goal. A robot vacuum extrapolated in this way — which is just one of many possible extrapolations we might decide upon — might be observed to vacuum the whole surface of the Earth.

So in the end, what we get is just that the CEV of the robot vacuum is that it wants to sweep whatever floors are in its floorplan. Fine, that’s no surprise. But is that really what we mean by "extrapolation in the direction of wisdom"?

If you take a person whose whole life has been focussed on conforming to the social conventions of the place they grew up, you might hope that "moving in the direction of wisdom" involves seeing a bigger picture, gaining an appreciation for the instrumental but not ultimate importance of social conventions, finding courage, locating themselves in history, understanding more about the significance of life, and so on. It should not be that "moving in the direction of wisdom" simply means equipping the person with more knowledge and tools with which to conform even more tightly to social conventions.

In the case of the robot vacuum, it seems that no matter how long it contemplates things, or how long it spends wandering the world, it never really gains a broader perspective on the world, nor realizes anything new about its own values. You might say that these things are themselves parochial human notions of wisdom that don’t apply to robot vacuums, and that’s fine, but then we don’t really need to "extrapolate" the robot vacuum, we can just read out its goal structure from its initial implementation.

It seems to me that humans can become wiser — not just more effective at accomplishing the tasks we have focussed on in our lives so far, but actually more aware of the world we live in, our place in it, and the consequences of our actions, in a way that reshapes the goals we’ve lived by rather than merely serving them. I think CEV is helpful in focusing us on this direction as the key to understanding goodness, but I think it is very unclear what kind of thing can move in this direction. Can a robot vacuum move in a direction that expands and reshapes its goals according to a clearer appreciation of its place in the world? If not, what is it about humans that can do so?

Dream research

Consider an AI that extrapolates a person in the direction of wisdom, then asks what that person would dream about, and then chooses its actions as a function of these predicted dreams. Is there any way this could go well? It seems unlikely, we have no reason to expect a wise person’s dreams to be a good basis for selecting actions that lead to a flourishing world. Of course it depends on exactly what function connects the dreams to the AI’s actions, and perhaps there is some function that cleverly extracts just the right thing from the dreams such that we end up with a perfectly aligned AI, but apriori this approach seems unpromising because we’re looking in the wrong place for a thing to guide the AI’s actions by.

Consider now the difference between dreams and desires. In CEV, we ask what an extrapolated person would want, not what they would dream about. There are many important differences between wanting and dreaming, but exactly which of these differences make it reasonable to predicate a powerful AI’s actions on an extrapolated person’s desires (and not their dreams)?

Dreams and desires are different kinds of psychological phenomena, but they are both psychological phenomena. What is a good principle for deciding which psychological phenomena are a reasonable basis for directing the actions of powerful AIs? I am not claiming that this question is a knock-down objection to CEV, but I am claiming that the writings on CEV so far fail to answer this question, and in this way leave open perhaps the most central issue concerning where an AI might be directed to look for wisdom.

If there is a way to extrapolate a person such that their desires form a good basis for a superintelligent AI’s goal system, then why wouldn’t there also be a way to extrapolate a person such that their dreams form a good basis for the same? But could a person really be extrapolated in such a way that their dreams formed a good basis for a superintelligent AI’s goal system? Maybe you will say that we can extrapolate a person in the direction of wisdom, then modify them such that they always dream about what they would have previously expressed as desires, but now you’re back to presupposing that desires are the right thing to look at.

I believe that if we have an AI tap into some particular psychological phenomenon of an extrapolated person, then we should have a theory about why that choice makes sense. In short: what features of an extrapolated person are you looking at, and why is that reasonable? We should not proceed on the basis of an intuition that desires are the right thing to look at without being able to say why that intuition is reasonable.

We can’t just leave that to some prior AGI to work out because that prior AGI has to be programmed with a protocol for making sense of what we’re asking it to do, and the thing we’re talking about here is the protocol for making sense of what we’re asking it to do (further discussion in section "the wisdom backflip" below).

In the original CEV paper Eliezer describes the thing I’m pointing to as follows:

To construe your volition, I need to define a dynamic for extrapolating your volition, given knowledge about you. In the case of an FAI, this knowledge might include a complete readout of your brain-state, or an approximate model of your mind-state. The FAI takes the knowledge of Fred’s brainstate, and other knowledge possessed by the FAI (such as which box contains the diamond), does . . . something complicated . . . and out pops a construal of Fred’s volition.

[...]

I shall refer to the "something complicated" as the dynamic.

What kind of protocol connects with the thing-that-is-a-guide-to-goodness within a person who has been extrapolated in the direction of wisdom? Presumably looking at dreams does not connect with this. Presumably looking at the way the person taps their feet while listening to music does not connect with this. What is the thing that we are trying to connect with, and what kind of protocol connects with that thing?

Twitter imitator

Consider a bigram model trained to imitate my tweets up to today. A bigram model is a big table that lists, for each word in the English language, the probability that it will be followed by each other word in the English language. A bigram model is trained by counting the number of times that each word is followed by each other word in some corpus — in this case the corpus is my tweets up to today.

Does this bigram model contain enough of "me" to extrapolate in the direction of wisdom and have that go well for the future of life on Earth? Presumably not. I haven’t written very many tweets, and I don’t think a bigram model would learn anything particularly deep from them. Even if I had written a vast number of tweets, a bigram model may not be rich enough to capture anything much of substance. No matter what kind of extrapolation technique you use — even an advanced one from the future — I don’t think you would get very far if the thing you’re applying it to is a bigram model trained on my tweets. Extrapolating such a model would be a bit like extrapolating a robot vacuum — you’d miss most of the point of extrapolation in the direction of wisdom because the bigram model does not contain the thing that is capable of moving in the direction of wisdom.

But what is the threshold at which we say that a model has captured that-which-is-worth-extrapolating? Suppose we trained a 2022 language model on all of Elon Musk’s tweets. This is a richer dataset and a richer type of model, but is it enough to capture the thing that we hope is capable of maturing into wisdom through extrapolation?

We don’t know exactly what it would mean to extrapolate a language model in the direction of wisdom. We are imagining that people of the future have come up with some-or-other approach. But presumably this approach requires a certain kind of source material to be present in the source model that is to be extrapolated, since if you have an extrapolation technique that brings out true wisdom based on arbitrarily simple models of people then you could just provide it with a one-pixel image of a human face, or a single human answer to a single yes-or-no question, and if it works in this case then what you really have is a solution to the whole alignment problem, and no need for CEV.

So extrapolation techniques from the future will presumably require source models that contain some minimum level of detail about the person that is to be extrapolated. How do we know whether any particular source model contains this minimum level of detail?

Now you might say that a large audio and video corpus of a person going about their life is actually an unimaginably rich data source, and certainly contains evidence about anything that can be asked about this person. That may be true, but in CEV we have to choose not just a dataset but also a model learned from that dataset, because we are going to extrapolate the person by having the model go through the kind of transformations that we believe would have engendered wisdom in the original person if they went through those transformations in real life. It is therefore not so simple to say whether a particular model has captured that which would develop into wisdom if extrapolated, even if the dataset itself certainly contains a great deal of information about the person.

You may say that we should apply extrapolation directly to the dataset, without first building a model. That’s fine, but now the question raised in this section has to be asked of the extrapolation technique. That question is: how do we know whether any particular model-and-extrapolate process actually picked up on the thing within the source data that is capable of growing into wisdom? A model-and-extrapolate process that first builds a bigram model presumably throws away all that is capable of growing into wisdom, and will therefore fail. We do not know whether a 2022 language model really does capture that which is capable of growing into wisdom. I believe it would be dangerous to proceed in building CEV-like systems without knowing this.

One of the central ideas in CEV is that we should not "peek" at the result before deciding whether to deploy, because the wisdom of our future selves may be very alien to us, and if we gate deployment on the intuition of our present selves then we may wind up deploying something that more-or-less negates all the wisdom of our future selves in favor of the intuition of our present selves. This makes issues like the one I am raising here acute, because if we build a model of a person in a way that misses the aspect of them that is capable of maturing into wisdom, then we may apply extrapolation and get a nonsensical result, and yet not be able to distinguish its nonsensicality from true wisdom. The only way around this, in my view, is to have a theory that lets us check our design choices from a theoretical, not empirical standpoint.

What we need is a theory about that-which-is-capable-of-becoming-wise, in a form that lets us say something about its relationship to models, extrapolation, and volitional dynamics.

Astrological signs

Consider an alternative history where machine learning was developed at a time when the positions of the planets in the night sky was understood to determine the fates of people on Earth. In this history, AI alignment researchers implement CEV just as researchers in our universe might: by (1) modeling a person end-to-end using machine learning, (2) extrapolating that person by having them live for a long time in a simulated environment, and then (3) asking them questions about what actions should be taken in the present. In order to build the simulated environment in step 2, researchers in this alternative history apply machine learning to measurements of the natural world in order to build a model of the world, just as researchers in our universe might. The researchers find that they need to tune the structure of their models to get them to reflect the common-sense realities of astrological phenomena that the researchers know to be real. As a result, the model of the world used for extrapolation (in step 2) is one that really is governed by astrological phenomena. As a result of that, the extrapolated person believes that astrological phenomena govern the lives of people on Earth. Is this extrapolated person truly wise?

What, actually, does such a person learn during aeons of living in a simulation? If the simulation is in any way fitted to the common-sense beliefs of its designers, then surely they would learn that reality is a lot like how the designers of the simulation imagined it was.

The key premise in my story here is that the simulation designers tweak their models to reflect the common-sense realities that they know to be true (i.e. their beliefs). In our own universe, we believe that the world has a state that transitions over time according to some kind of lawful dynamic. Is this actually how things are? When we build models of the world, we bake this belief (that the universe evolves according to a state and a lawful dynamic) into our models very deeply, in such a way that we don’t see it as a working assumption, but more as a rock-solid foundation upon which our whole belief structure is predicated. The whole structure of our modeling methodology assumes this view of the world (that the universe evolves according to a state and a lawful dynamic). If this foundation – or any other foundational assumption – is faulty, and we build models of the world that bake it in, and use those models to extrapolate people, then we may end up with extrapolated people who simply believe it to be true because it is true of the simulated world in which they live, just as in the case of astrology.

The question here is what aspects of the world need to be incorporated into a simulation in order that a person extrapolated by living in that simulation encounters the kind of thing that develops real wisdom rather than just entrenching them in the worldview of the simulation designers.

Summary so far

  • What kind of a thing can be extrapolated in the direction of wisdom at all? (Robot vacuum)

  • What kind of protocol connects with the wisdom of a person who has been extrapolated? (Dream research)

  • What kind of model captures that within a person that we hope to amplify through extrapolation? (Twitter imitator)

  • What kind of environment is sufficient to grow true wisdom? (Astrological signs)

The wisdom backflip

Perhaps some will say that we can use AI to answer these questions. The arbital page on CEV says:

Even the terms in CEV, like "know more" or "extrapolate a human", seem complicated and value-laden. You might have to build a high-level Do What I Know I Mean agent, and then tell it to do CEV. Do What I Know I Mean is complicated enough that you'd need to build an AI that can learn DWIKIM, so that DWIKIM can be taught rather than formally specified. So we're looking at something like CEV, running on top of DWIKIM, running on top of a goal-learning system, at least until the first time the CEV agent rewrites itself.

DWIKIM is an AI that examines the cognition behind the instructions you give it and uses that to do what you mean, even if there is ambiguity in your instructions. This is different from CEV because with CEV, the AI asks what you would have instructed (and meant by that instruction) if you were wiser. With DWIKIM the AI merely asks what you meant by what you did instruct.

The suggestion in the quote above is that we might first build an AI that follows instructions by examining our cognition, then instruct that AI to implement CEV. The idea, I suppose, is that the DWIKIM AI might work out the operational details of implementing CEV, including resolving the four questions raised in this essay. However, there is a very important problem in this approach. It assumes that there actually is an answer to be found within our cognition about how to implement CEV’s operational details. If we don’t know what kind thing can be extrapolated in the direction of wisdom, what kind of protocol connects with wisdom, what kind of model captures that within a person that we hope to amplify through extrapolation, nor what kind of environment is sufficient to grow wisdom, then what "meaning behind the instruction" will a DWIKIM AI find within our cognition? Surely it will just find that there is no clear cognition behind the instruction at all.

Now there might be other ways to use AI to clarify the operational details of implementing CEV; but the specific approach offered on the arbital page – of using an AI that examines our cognition to work out what we really meant in our instructions – seems unlikely to work if we don’t have reasonable answers to the four questions in this essay.

We are in the following situation. We don’t currently have the wisdom to design our own vast future, and we know this, but we also know that we would have that wisdom if we went through certain transformations, so we are trying to design an AI that is guided by a wisdom that we ourselves don’t yet possess, by describing a process by which the AI might be access our future wisdom. We might call this the wisdom backflip: an attempt to build machines that are wiser than we are.

But it seems that we keep running into a kind of conservation of wisdom principle, where each attempt to design an AI guided by our own future wisdom requires us to make critical design choices right here in the present, without either the wisdom of our future selves nor the wisdom of an AI that is directed by this future wisdom. Again and again it seems to be that if we don’t get these design choices right, the AI won’t be correctly guided by the future wisdom that we hope to point it towards, and the design choices that we must make in the present keep turning out to be deep.

Each time we come up against this barrier, it is tempting to add a new layer of indirection in our designs for AI systems. This layer of indirection is always about finding a way to solve our present problems with our future wisdom, using AI to backport the wisdom of the future to the problems of the present. In fact the wisdom of the future is exactly what is needed to solve the problems of the present. But there is some kind of insight that is needed in the present, that we don’t seem to be able to backflip over. It shows up in different ways in different contexts. In CEV it shows up as this question about how to correctly extrapolate wisdom.

I suspect there is a kind of "hard problem of AI alignment" at the heart of this issue.

Conclusion

CEV is a proposal about what kind of goal we might give to a very powerful AI. It suggests, very roughly, that the most important thing is for the AI to be directed by the kind of wisdom that grows within people as they go through certain transformations. All specific questions about how exactly the world should be organized are downstream of that. With this assertion, I completely agree, and am grateful for Eliezer spelling it out so thoroughly at such a formative stage of this community’s development.

However, CEV also gives a kind of blueprint for how we are going to build AI systems that access wisdom. This blueprint shows up in the name "coherent extrapolated volition": it is that we are going to build AI systems that build models of people and models of the world, and use those models together to work out how the people would be changed by certain transformations, and then interact with thus-transformed people in order to decide what actions the AI should take in the present. With this blueprint, I have serious doubts. Specifically, choices about what kind of models to build, what kind of transformations to use, and what kind of interactions to have with the transformed models seem not to be mere details, but actually to contain the real core of the problem, which in my view orbits four key questions:

  1. What kind of a thing can be extrapolated in the direction of wisdom?

  2. What kind of protocol connects with the wisdom of a person who has been extrapolated?

  3. What kind of model captures that within a person that we hope to amplify through extrapolation?

  4. What kind of environment is sufficient to grow true wisdom?

By design, writings on CEV do not try to spell out the full operational details of an implementation. This is not a problem per se, since any proposal on any topic leaves some level of detail to be decided by the builders. The real question for any proposal is: to what extent did the core problem get resolved in the details that were given, versus still showing up in the outstanding subproblems?

What we need, in my view, is a theory about that-which-is-capable-of-becoming-wise, in a form that lets us say something about its relationship to models, extrapolation, and volitional dynamics. I do not believe that CEV provides such a theory, but rather works around the absence of such a theory by leaving open the parts of the problem that demand such a theory.

I will give a short presentation of this work followed by discussion on Wednesday Dec 28 at 12pm Pacific / 3pm Eastern. RSVP here.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 9:17 AM

EDIT: I meant what Wei Dai has been "talking about", not "trying to do".

This problem you are pointing at sounds like what Wei Dai has been trying to do for years. In some sense, it is like getting a fully specified meta-ethical framework, of the kind Eliezer attempted to describe in the Sequences. Does that sound right?

I'm very interested in Wei Dai's work, but I haven't followed closely in recent years. Any pointers to what I might read of his recent writings?

I do think Eliezer tackled this problem in the sequences, but I don't really think he came to an answer to these particular questions. I think what he said about meta-ethics is that it is neither that there is some measure of goodness to be found in the material world independent from our own minds, nor that goodness is completely open to be constructed based on our whims or preferences. He then says "well there just is something we value, and it's not arbitrary, and that's what goodness is", which is fine, except it still doesn't tell us how to find that thing or extrapolate it or verify it or encode it into an AI. So I think his account of meta-ethics is helpful but not complete.

Each time we come up against this barrier, it is tempting to add a new layer of indirection in our designs for AI systems.

I strongly agree with this characterization. Of my own "learning normativity" research direction, I would say that it has an avoiding-the-question nature similar to what you are pointing out here; I am in effect saying: Hey! We keep needing new layers of indirection! Let's add infinitely many of them! 

One reason I don't spend very much time staring the question "what is goodness/wisdom" in the eyes is, the CEV write-up and other things convinced me that trying to answer this question on the object level (eg, trying to write down the utility function for "goodness" instead of trying to come up with a working value learning system), if successful, was a way of "taking over the world" with your own values. It's too easy to fool yourself.

To use a political analogy, you don't want to install a dictator, even if that person is actually "really good" -- the process by which you put them in power was not legitimate, because it did not involve everyone in the right way. There are too many times when people have tried this approach and it has gone wrong. So, it's better to follow a process with a better track record, and a more "fair" way of giving everyone input into the end result.

Moving on to a different point -- to defend the methodology of adding layers of indirection, a bit: it seems plausible to me that each layer of indirection, if crafted well, makes a sort of progress. 

We know something about what's good, but I feel quite hopeless about an approach like "program in what's good directly" -- because of the "taking over the world" concern I already mentioned, but also just because I think it's very hard (even if you're fine with taking over the world) and humans are very very liable to get it wrong. 

We know something about how to do value-learning; I still feel somewhat hopeless about a direct value-learning approach, but it feels significantly less hopeless than direct value specification.

I feel somewhat better about giving a system feedback about successful vs unsuccessful value learning, rather than trying to directly specify a value-learning loss function, because this at least doesn't fall prey to Stuart Armstrong's impossibility argument for value learning. 

And so on. 

I won't claim that this hierarchy approaches perfection in the limit. In particular, it's still doomed if we don't produce enough actual high-quality information to put in each level. (This is more like "staring the problem directly in the eyes".) But it does seem like it becomes less doomed with each level of indirection. 

Did you ever end up reading Reducing Goodhart? I enjoyed reading these thought experiments, but I think rather than focusing on "the right direction" (of wisdom), or "the right person," we should mostly be thinking about "good processes" - processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good.

Did you ever end up reading Reducing Goodhart?

Not yet, but I hope to, and I'm grateful to you for writing it.

processes for evolving humans' values that humans themselves think are good, in the ordinary way we think ordinary good things are good

Well, sure, but the question is whether this can really be done by modelling human values and then evolving those models. If you claim yes then there are several thorny issues to contend with, including what constitutes a viable starting point for such a process, what is a reasonable dynamic for such a process, and on what basis we decide the answers to these things.