Thanks to Rebecca Gorman for the discussions that inspired these ideas.

We recently argued that full understanding of value extrapolation[1] was necessary and almost sufficient for solving the AI alignment problem, due to the existence of morally underdefined situations.

People seemed to generally agree that some knowledge of human morality was needed for a safe powerful AI. But it wasn't clear this AI needed almost-full knowledge of human morality, as well as value extrapolation. We got some high-quality alternative suggestions. In this post, we'll try and distil those ideas and analyse them.

AI seems to behave reasonably

Steven Byrnes's position, if we understand it correctly, is that the AI should learn to behave in non-dangerous seeming ways[2].

Our phrasing of the idea would be (apologies for any misinterpretations):

  1. There are many behaviours that agents can follow. Typical humans follow typical behaviours, to achieve typical consequences. And other typical humans can assess both behaviours and consequences, and grade them as (typical) good, (typical) bad, or weird.
  2. An AI should behave in ways that typical humans judge as good, achieving consequences that typical humans judge as good. If there is too high a risk of weirdness, the AI will output NOOP (no operation).

This seems a sensible approach. But it has a crucial flaw. The AI must behave in typical ways to achieve typical consequences - but how it achieves these consequences doesn't have to be typical.

For example, it might make an explanatory video to convince someone to follow some (reasonable) course of action. The video itself is not unusual, but the colour scheme happens to hypnotise that specific viewer into following the suggestion. At a gross level of description, everything is fine: convincing-but-typical video convinces. But this only works because the AI has inhuman levels of knowledge and predictive ability, and carefully selects the "typical behaviour" that is the most effective.

Now, we might be able to make sure the AI doesn't have secret superhuman abilities (we've had an old underdeveloped idea that we could force the AI to use human models of the world to achieve its goals), but this is a very different problem, and a much harder one.

Then, given that the AI has access to superhuman levels of ability, the "typical consequences" is no longer a safe goal. It reduces to "consequences that seem typical to an observing human", which is not a safe goal for the AI; it can now do whatever it wants, as long as its actions and consequences look ok.

Still, it might be worth developing this idea; it might work well combined with some other safety approaches.

Advanced human feedback

Rohin's argument is that we don't need the AI to solve value transfer and extrapolation. Instead, we just need a "well-motivated" AI that asks us what we prefer. It must present the options and the consequences in an informative manner. If there are morally underdefined questions where we might give multiple answers depending on question phrasing, then the AI goes to a meta level and asks us how we would want it to ask us.

This is a potentially powerful approach. It solves the value extrapolation problem indirectly, by deference to humans. It is similar to a scaled-up version of "informed consent".

Would such an approach work to contain a malevolent AI? If a superintelligent AI wished to do us harm, but was constrained to following the above feedback approach, would it be safe?

It seems clear that it wouldn't. The malevolent AI would deploy all its resourcefulness to undermine the spirit of the feedback requirement. Unless we had it perfectly defined, this would just be a speedbump on the way to the AI's rise to unrestricted power.

Similarly, we can't rely on an AI that is motivated solely to follow the feedback requirement. That's because almost all AI goals are malevolent at the superintelligent level, and so the feedback requirement on its own would also be malevolent - for instance, the AI could fill the universe with pseudo-humans always giving it feedback.

So it seems that being "well-motivated" is a key constraint.

Well-motivated AIs asking for feedback

"Well-motivated" is similar to our "well-intentioned" AI. That term designated an AI aware of a specific problem (eg wireheading) and that was motivated to avoid it. It seemed that it might work, for some specific problems.

Could a well-motivated/intentioned AI safely make use of human feedback? This seems a harder challenge. First of all, even for simple questions, the AI needs to interpret the human's answers correctly in terms of values and preferences. Thus the AI needs to be able to translated human answers to questions, into human values - a task which cannot be achieved merely from observation, and requires knowing the human theory of mind.

This may be trainable; but what of the more complicated, morally underdefined situations? Even the issue of when to ask is non-trivial. "Shall I move six cm to the right?" is typically a pointless formality; when there is a cat trapped between the AI and the wall, this is vitally important.

So, ultimately, we want the AI to ask when the issue is important to humans. We want it to ask in a way best designed to elicitate our genuine preferences. We want it to go meta when there is key ambiguities in our preferences or our likely extrapolated preferences. And we want it to phrase the meta questions so that our answers approximate some version of our idealised selves.

Notice that all those desiderata are much easier when the AI knows our (extrapolated) preferences. It is not clear at all that they can be achieved otherwise. There are some criteria we might use, for instance "start going meta when you have the ability to get the human to answer whatever way you want". But that is a requirement about the power and ability of the AI - what if the AI already has that ability from the start (maybe through some failure of grounding of the question)? We don't want it to go meta or go complicated on all "shall I move six cm?" questions, only those that are important to us.

So an aligned AI could easily pass a system of human feedback, and achieve "informed consent". An AI that isn't "well-meaning" could not. It seems that achieving "well-meaning" requires a lot more human values (and human-style value extrapolation) than is normally assumed[3].

Where we agree

We all agree that an AI can be aligned without needing to know all of human values. What is sufficient is that it knows enough "structural assumptions" to be able to deduce enough human values as needed.

Rohin thinks that some system of human feedback can be defined (and grounded) sufficiently to serve as a "structural assumption". For the reasons mentioned above, this might not be possible - human feedback works best when the AI is already well-aligned with us. Getting that likely involves solving issues like value extrapolation.

An interesting idea may be a mix of feedback and other methods of value learning. For example, maybe other methods of value learning allow the AI to be well-motivated when asking for feedback. This itself will increase its knowledge of human values, which may improve its value learning and extrapolation - improving its use of feedback, and so on.

In any case, an area worth further investigations.


  1. A more precise term than "model splintering". ↩︎

  2. Key quotes in that thread:

    it should be possible to make an AGI that understands human values and preferences well enough to reliably and conservatively avoid doing things that humans would see as obviously or even borderline unacceptable / problematic.

    and:

    I want the AI to have criteria that qualifies actions as acceptable, e.g. "it pattern-matches less than 1% to 'I'm causing destruction', and it pattern-matches less than 1% to 'the supervisor wouldn't like this', and it pattern-matches less than 1% to 'I'm changing my own motivation and control systems', and … etc. etc."

    ↩︎
  3. Or, if you put the issue in reverse, imagine an AI that operates on feedback and causes a existential disaster. Then we can probably trace this failure to some point where it misinterpreted some answer or it asked a question that was manipulative or dangerous. In other words, a failure of alignement in the question asking process. ↩︎

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 1:41 PM

Notice that all those desiderata are much easier when the AI knows our (extrapolated) preferences. It is not clear at all that they can be achieved otherwise.

It seems like, as long as she wanted to, a human Alice could satisfy these desiderata when helping Bob, even though Alice doesn't know Bob's extrapolated preferences? So I'm not sure why you think an intelligent AI couldn't do the same.

Maybe you think that it's because Alice and Bob are both humans? But I also think Alice could satisfy these desiderata when helping an alien from a different planet -- she would definitely make some mistakes, but presumably not the existentially catastrophic variety*.

*unless the alien has some really unusual values where an existential catastrophe can be caused by accident, e.g. "if anyone ever utters the word $WORD, that is the worst possible universe", but those sorts of values seem very structurally different than human values. 

I actually don't think that Alice could help a (sufficiently alien) alien. She needs an alien theory of mind to understand what the alien wants, how they would extrapolate, how to help that extrapolation without manipulating it, and so on. Without that, she's just projecting human assumptions in alien behaviour and statements.

She needs an alien theory of mind to understand what the alien wants

Absolutely, I would think that the first order of business would be to learn that alien theory of mind (and be very conservative until that's done).

Maybe you're saying that this alien theory of mind is unlearnable, even for a very intelligent Alice? That seems pretty surprising, and I don't feel the force of that intuition (despite the Occam's razor impossibility result).

Steven Byrnes's position, if we understand it correctly, is that the AI should learn to behave in non-dangerous seeming ways[2].…

This seems a sensible approach. But it has a crucial flaw. The AI must behave in typical ways to achieve typical consequences - but how it achieves these consequences doesn't have to be typical.

I think that problem may be solvable. See Section 1.1 (and especially 1.1.3) of this post. Here are two oversimplified implementations:

  1. The human grades the AI's actions as norm-following and helpful, and we do RL to make the AI score well on that metric.
  2. The AI learn the human concept of "being norm-following and helpful" (from watching YouTube or whatever). Then it makes plans and take actions only when they pattern-match to that abstract concept. The pattern-matching part is inside the AI, looking at the whole plan as the AI itself understands it.

I was thinking of #2, not #1.

It seems like you're assuming #1. And also assuming that the result of #1 will be an AI that is "trying" to maximize the human grades / reward. Then you conclude that the AI may learn to appear to be norm-following and helpful, instead of learning to be actually norm-following and helpful. (It could also just hack into the reward channel, for that matter.) Yup, those do seem like things that would probably happen in the absence of other techniques like interpretability. I do have a shred of hope that, if we had much better interpretability than we do today, approach #1 might be salvageable, for reasons implicit in here and here. But still, I would start at #2 not #1.

I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)

I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)

I'd like to see them. I'll wait for the final (posted) versions, I think.