How much can value learning be disentangled?

Stuart_Armstrong

In the context of whether the definition of human values can disentangled from the process of approximating/implementing that definition, David asks me:

But I think it's reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like "manipulation". So do you disagree?

I think it's a really good question, and its answer is related to a lot of relevant issues, so I put this here as a top-level post. My current feeling is, contrary to my previous intuitions, that things like "manipulation" might not be possible to specify in a way that leads to useful disentanglement.

Why manipulate?

First of all, we should ask why an AI would be tempted to manipulate us in the first place. It may be that it needs us to do something for it to accomplish its goal; in that case it is trying to manipulate our actions. Or maybe its goal includes something that cashes out as out mental states; in that case, it is trying to manipulate our mental state directly.

The problem is that any reasonable friendly AI would have our mental states as part of its goal - it would at least want us to be happy rather than miserable. And (almost) any AI that wasn't perfectly indifferent to our actions would be trying to manipulate us just to get its goals accomplished.

So manipulation is to be expected by most AI designs, friendly or not.

Manipulation versus explanation

Well, since the urge to manipulate is expected to be present, could we just rule it out? The problem is that we need to define the difference between manipulation and explanation.

Suppose I am fully aligned/corrigible/nice or whatever other properties you might desire, and I want to inform you of something important and relevant. In doing so, especially if I am more intelligent than you, I will simplify, I will omit irrelevant details, I will omit arguably relevant details, I will emphasise things that help you get a better understanding of my position, and de-emphasise things that will just confuse you.

And these are exactly the same sorts of behaviours that smart manipulator would do. Nor can we define the difference as whether the AI is truthful or not. We want human understanding of the problem, not truth. It's perfectly possible to manipulate people while telling them nothing but the truth. And if the AI structures the order in which it presents the true facts, it can manipulate people while presenting the whole truth as well as nothing but the truth.

It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle. And even if we do it right, note that we have now motivated the AI to... aim for a particular set of mental states. We are rewarding it for manipulating us. This is contrary to the standard understanding of manipulation, which focuses on the means, not the end result.

Bad behaviour and good values

Does this mean that the situation is completely hopeless? No. There are certain manipulative practices that we might choose to ban. Especially if the AI is limited in capability at some level, this would force it to follow behaviours that are less likely to be manipulative.

Essentially, there is no boundary between manipulation and explanation, but there is a difference between extreme manipulation and explanation, so ruling out the first can help (or maybe not).

The other thing that can be done is to ensure that the AI has values close to ours. The closer the values of the AI are to us, the less manipulation it will need to use, and the less egregious the manipulation will be. It might be that, between partial value convergence and ruling out specific practices (and maybe some physical constraints), we may be able to get an AI that is very unlikely to manipulate us much.

Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible. But if the values of the AI are better aligned with us, and more physically constrained, then low impact becomes easier to define.

Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible.

My (admittedly hazy) recollection of our last conversation is that your concerns were that “value agnostic, low impact, and still does stuff” is impossible. Can you expand on what you mean by value agnostic here, and why you think we can’t even have that and low impact?

This is based more on experience than on a full formal argument (yet). Take an AI that, according to our preferences, is low impact and still does stuff. Then there is a utility function $U$ for which that "does stuff" is the single worst and highest impact thing the AI could have done (you just trivially define a $U$ that only cares about that "stuff").

Now, that's a contrived case, but my experience is that problems like that come up all the time in low impact research, and that we really need to include - explicitly or implicitly - a lot of our values/preferences directly, in order to have something that satisfies low impact.

This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible. The argument doesn’t show that that utility has to be important to / considered by the impact measure.

Plus, low impact doesn’t have to be robust to adversarially chosen input attainable utilities - we get to choose them. Just choose the “am I activated” indicator utility and AUP seems to do fine, modulo open questions raised in the post and comments.

This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible.

? I don't see that. What's the argument?

(If you want to say that we can't define friendly behaviour without using our values, then I would agree ^_^ but I think you're trying to argue something else).

Take a friendly AI that does stuff. Then there is a utility function for which that "does stuff" is the single worst thing the AI could have done.

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

As I understand it, the impact version of this argument is flawed in the same way (but less blatantly so): something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

Indeed, by "friendly AI" I meant "an AI friendly for us". So yes, I was showing a contrived example of an AI that was friendly, and low impact, from our perspective, but that was not, as you said, universally friendly (or universally low impact).

something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.

In my experience so far, we need to include our values, in part, to define "reasonable" utility functions.

In my experience so far, we need to include our values, in part, to define "reasonable" utility functions.

It seems that an extremely broad set of input attainable functions suffice to capture the “reasonable“ functions with respect to which we want to be low impact. For example, “remaining on”, “reward linear in how many blue pixels are observed each time step”, etc. All thanks to instrumental convergence and opportunity cost.

Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible.

This seems to prove too much; the same argument proves friendly behavior can’t exist ever, or that including our preferences directly is (literally) impossible.

? I don't see that. What's the argument?

(If you want to say that we can't define friendly behaviour without using our values, then I would agree ^_^ but I think you're trying to argue something else).

Take a friendly AI that does stuff. Then there is a utility function for which that "does stuff" is the single worst thing the AI could have done.

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

The fact that no course of action is universally friendly doesn’t mean it can’t be friendly for us.

something being high impact according to a contrived utility function doesn’t mean we can’t induce behavior that is, with high probability, low impact for the vast majority of reasonable utility functions.

In my experience so far, we need to include our values, in part, to define "reasonable" utility functions.

In my experience so far, we need to include our values, in part, to define "reasonable" utility functions.

10

How much can value learning be disentangled?

10

Why manipulate?

Manipulation versus explanation

Bad behaviour and good values