I think of ambitious value learning as a proposed solution to the specification problem, which I define as the problem of defining the behavior that we would want to see from our AI system. I italicize “defining” to emphasize that this is not the problem of actually computing behavior that we want to see -- that’s the full AI safety problem. Here we are allowed to use hopelessly impractical schemes, as long as the resulting definition would allow us to in theory compute the behavior that an AI system would take, perhaps with assumptions like infinite computing power or arbitrarily many queries to a human. (Although we do prefer specifications that seem like they could admit an efficient implementation.) In terms of DeepMind’s classification, we are looking for a design specification that exactly matches the ideal specification. HCH and indirect normativity are examples of attempts at such specifications.

We will consider a model in which our AI system is maximizing the expected utility of some explicitly represented utility function that can depend on history. (It does not matter materially whether we consider utility functions or reward functions, as long as they can depend on history.) The utility function may be learned from data, or designed by hand, but it must be an explicit part of the AI that is then maximized.

I will not justify this model for now, but simply assume it by fiat and see where it takes us. I’ll note briefly that this model is often justified by the VNM utility theorem and AIXI, and as the natural idealization of reinforcement learning, which aims to maximize the expected sum of rewards, although typically rewards in RL depend only on states.

A lot of conceptual arguments, as well as experiences with specification gaming, suggest that we are unlikely to be able to simply think hard and write down a good specification, since even small errors in specifications can lead to bad results. However, machine learning is particularly good at narrowing down on the correct hypothesis among a vast space of possibilities using data, so perhaps we could determine a good specification from some suitably chosen source of data? This leads to the idea of ambitious value learning, where we learn an explicit utility function from human behavior for the AI to maximize.

This is very related to inverse reinforcement learning (IRL) in the machine learning literature, though not all work on IRL is relevant to ambitious value learning. For example, much work on IRL is aimed at imitation learning, which would in the best case allow you to match human performance, but not to exceed it. Ambitious value learning is, well, more ambitious -- it aims to learn a utility function that captures “what humans care about”, so that an AI system that optimizes this utility function more capably can exceed human performance, making the world better for humans than they could have done themselves.

It may sound like we would have solved the entire AI safety problem if we could do ambitious value learning -- surely if we have a good utility function we would be done. Why then do I think of it as a solution to just the specification problem? This is because ambitious value learning by itself would not be enough for safety, except under the assumption of as much compute and data as desired. These are really powerful assumptions -- for example, I'm assuming you can get data where you put a human in an arbitrarily complicated simulated environment with fake memories of their life so far and see what they do. This allows us to ignore many things that would likely be a problem in practice, such as:

  • Attempting to use the utility function to choose actions before it has converged
  • Distributional shift causing the learned utility function to become invalid
  • Local minima preventing us from learning a good utility function, or from optimizing the learned utility function correctly

The next few posts in this sequence will consider the suitability of ambitious value learning as a solution to the specification problem. Most of them will consider whether ambitious value learning is possible in the setting above (infinite compute and data). One post will consider practical issues with the application of IRL to infer a utility function suitable for ambitious value learning, while still assuming that the resulting utility function can be perfectly maximized (which is equivalent to assuming infinite compute and a perfect model of the environment after IRL has run).

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 6:20 AM

Can you clarify, do "queries to a human" and "data about human behavior" mean things like asking humans questions and observing human behavior in real/historical situations, or does it mean being able to put humans in arbitrary virtual environments (along with fake memories of how they got there) in order to observe their reactions? If it's the former, I'm not sure how that lets us ignore "Distributional shift causing the learned utility function to become invalid". If it's the latter, I think a lot of people might be surprised by that assumption so it would be good to spell it out.

The latter. Good point about clarity, I've added a sentence making that clearer, thanks!

On second thought, even if you assume the latter, the humans you're learning from will themselves have problems with distributional shifts. If you give someone a different set of life experiences, they're going to end up a different person with different values, so it seems impossible to learn a complete and consistent utility function by just placing someone in various virtual environments with fake memories of how they got there and observing what they do. Will this issue be addressed in the sequence?

No, I'm not planning to tackle this issue.

One approach would be to take current-me and put current-me through a variety of virtual environments with fake memories that start from current-time without removing my real memories and use whatever is inferred from that as my utility function. (Basically, treat all experiences and memories up to the current time as "part of me", and treat that as the initial state from which you are trying to determine a utility function.)

But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that's a really good reason to assume that the whole framework of getting the true human utility function is doomed. Not just ambitious value learning, _any_ framework that involves an AI optimizing some expected utility would not work.

But more generally, if you think that a different set of life experiences means that you are a different person with different values, then that’s a really good reason to assume that the whole framework of getting the true human utility function is doomed.

Maybe it's not that bad? For example I can imagine learning the human utility function in two stages. The first stage uses the current human to learn a partial utility function (or some other kind of data structure) about how they want their life to go prior to figuring out their full utility function. E.g., perhaps they want a safe and supportive environment to think, talk to other humans, and solve various philosophical problems related to figuring out one's utility function, with various kinds of assistance, safeguards, etc. from the AI (but otherwise no strong optimizing forces acting upon them). In the second stage, the AI use that information to compute a distribution of "preferred" future lives and then learns the full utility function only from those lives.

Another possibility is if we could design an Oracle AI that is really good at answering philosophical questions (including understanding what our confused questions mean), we can just ask it "What is my utility function?"

So I would argue that your proposal is one example of how you could learn a utility function from humans assuming you know the full human policy, where you are proposing that we pay attention to a very small part of the human policy (the part that specifies our answers to the question "how do we want our life to go" at the current time, and then the part that specifies our behavior in the "preferred" future lives).

You can think of this as ambitious value learning with a hardcoded structure by which the AI is supposed to infer the utility function from behavior. (A mediocre analogy: AlphaGoZero learns to play Go with a hardcoded structure of MCTS.) As a result, you would still need to grapple with the arguments against ambitious value learning brought up in subsequent posts -- primarily, that you need to have a good model of the mistakes that humans make in order to better than humans would themselves. In your proposal, I think the mistake model is "everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that". This seems like a better mistake model than most, and it could work -- but we are hardcoding in an assumption about humans here that could be misspecified. (Eg. humans say they want autonomy and freedom from manipulation but actually they would have been better off if they had let the AI make arguments to them about what they care about.)

In your proposal, I think the mistake model is “everything that humans do could be mistaken, but when they talk about how they want their life to go, they are not mistaken about that”.

Ok, this is helpful for making a connection between my way of thinking and the "mistake model" way, but it seems a bit of a stretch, since I almost certainly am mistaken (or suboptimal) about how I want my life to go. I only want autonomy and freedom from manipulation because I don't know how to let an AI manipulate me (i.e., make arguments to me about my values) in a way that would be safe and lead to good results. If I did, I may well let the AI do that and save myself the trouble and risk of trying to figure out my values on my own.

Yeah, I agree that the mistake model implied by your proposal isn't correct, and as a result you would not infer the true utility function. Of course, you might still infer one that is sufficiently close that we get a great future.

Tbc, I do think there are lots of other ways of thinking about the problem that are useful that are not captured by the "mistake model" way of thinking. I use the "mistake model" way of thinking because it often shows a different perspective on a proposal, and helps pinpoint what you're relying on in your alignment proposal.

Of course this is all assuming that there does exist a true utility function, but I think we can replace "true utility function" with "utility function that encodes the optimal actions to take for the best possible universe" and everything still follows through. But of course, not hitting this target just means that we don't do the perfectly optimal thing -- it's totally possible that we end up doing something that is only very slightly suboptimal.

A conversation that just went down in my head:

Me: "You observe a that a bunch of attempts to write down what we want get Goodharted, and so you suggest writing down what we want using data. This seems like it will have all the same problems."

Straw You: "The reason you fail is because you can't specify what we really want, because value is complex. Trying to write down human values is qualitatively different from trying to write down human values using a pointer to all the data that happened in the past. That pointer cheats the argument from complexity, since it lets us fit lots of data into a simple instruction."

Me: "But the instruction is not simple! Pointing at what the "human" is is hard. Dealing with the fact that the human in inconsistent with itself gives more degrees of freedom. If you just look at the human actions, and don't look inside the brain, there are many many goals consistent with the actions you see. If you do look inside the brain, you need to know how to interpret that data. None of these are objective facts about the universe that you can just learn. You have to specify them, or specify a way to specify them, and when you do that, you do it wrong and you get Goodharted."

The next four posts are basically making exactly these points (except for "pointing at what the human is is hard"). Or actually, it doesn't talk about the "look inside the brain" part either, but I agree with your argument there as well.

I'm going to argue that ambitious value learning is difficult and probably not what we should be aiming for. (Or rather, I'm going to add posts that other people wrote to this sequence, that argue for that claim or weaker versions of it.)

although typically rewards in RL depend only on states,

Presumably this should be a period? (Or perhaps there's a clause missing pointing out the distinction between caring about history and caring about states, tho you could transform one into the other?)

Supposed to be a period, fixed now. While you can transform one into the other, I find it fairly unnatural, and I would guess this would be the case for other ML researchers. Typically, if we want to do things that depend on history, we just drop the Markov assumption, rather than defining the state to be the entire history.

Also, if you define the state to be the entire history, you lose ergodicity assumptions that are needed to prove that algorithms can learn well.