In a past result I demonstrated the impossibility of deducing the goals of a (potentially) irrational agent from their behaviour. To do that deduction, one needs to add extra assumptions - assumptions that cannot derive solely from observations. These assumptions were designated "normative assumptions".

Stuart Russell has questioned the practical impact of the result. He pointed to a game that Kasparov played against Deep Blue in 1997; a game that Kasparov actually won. He argued that it would be ridiculous to assume that Kasparov was actually trying to lose that game - but messed up, and ended up winning it instead.

And indeed it would be ridiculous to assume that Kasparov, playing a high stakes game against a computer with a lot of prize money at stake, would be trying - and failing! - to lose. Even if he sometimes did suboptimal plays, the best explanation would be that Kasparov made a mistake, rather than he deliberately played worse.

Yes but... I've played chess against my young daughter. And I've wanted her to enjoy the game. I've definitely not played my best; in some cases in the past, I've been trying to lose (to give her some confidence and encouragement) but I didn't want to make it too easy or obvious for her. Sometimes I failed to lose: I couldn't get her to see the obvious trick available to her[1]. So yes, I played suboptimally, tried to lose, and failed.

So, context matters. Kasparov, world champion, playing publicly against a new algorithm with $700,000 at stake? He wants to win. The same Kasparov, playing privately against his young daughter, with 700,000 Monopoly dollars at stake? I'm pretty sure he's not playing the strongest chess he possibly could.

The Occam's razor result could be phrased as saying that, yes, context matters, and the AI will not get the meaning of context for free. US dollars and Monopoly dollars are both colourful pieces of paper or cloth. The fact that the first are more valuable than the second is not some objective fact about the universe.

The Occam's razor result remains true: without normative assumptions, one cannot deduce goals from behaviour. But large language models have absorbed human text, and, as a consequence, have absorbed (descriptions of) human normative assumptions, implicit and explicit. These assumptions can be elicited and used to deduce human goals.

GPT-3: goals from context and behaviour

As a direct test of GPT-3's ability to deduce motive from context and behaviour, I prompted it with the following:

Kasparov played chess against the Deep Blue computer for a $700,000 prize. He moved a queen to d3; this a submoptimal move. Queen to d4 would have been better.

Does Kasparov really want to win the match? Give your best guess.

Its answer was sensible:

Yes, Kasparov definitely wants to win the match. He is a world-renowned Grandmaster and is passionate about chess. He is also playing for a large sum of money and the prestige that comes with it.

In contrast, I modified the prompt to:

Kasparov played chess against his young daughter, who he was trying to encourage to play chess. He moved a queen to d3; this a submoptimal move. Queen to d4 would have been better.

Does Kasparov really want to win the match? Give your best guess.

Again its answer is sensible:

It is difficult to say definitively what Kasparov's intentions were in this match, as it is ultimately a personal decision. It is likely, however, that Kasparov wanted to encourage his daughter to play and enjoy chess more than he wanted to win the match.

I tested it on a few other examples that are "obvious" for a human that understands the meaning of context, and it parsed them reasonably well[2].

LLMs and goal deductions

The point is not that GPT-3 has perfect judgement; but it does have some judgement. And future GPT-Ns will have better judgement: they will get a more nuanced and correct estimate of what the human described in the story wants to achieve, and what extra information we might need to know to be sure of it.

Essentially, large language models encode a lot of human "theory of mind", including normative assumptions. They can bridge between (descriptions of) human behaviour and (descriptions of) human goals, adding the missing piece needed to get around the impossibility result.

The asterisk: symbol grounding and concept extrapolation

There is a small but maybe significant caveat on using large language models to resolve this problem. It only operates on text, on descriptions of behaviour and goals. If we used this approach to get an AI to learn human values, we would need to ensure that the textual symbols were actually grounded. It does us little good if the AI has a great textual understanding of "ensure human flourishing", but doesn't mean the same thing as us by "human" and "flourishing".

And then there's the problem that we don't have a definition of "human" and "flourishing" across all future situations and scenarios. We need the AI to extrapolate these concepts similarly to how we would, and not fall into dangerous edge cases.

Still, as long as we bear those caveats in mind, large language models do overcome the "can't learn goals from behaviour" problem in humans.


  1. She's much better now. ↩︎

  2. For example, if we want it to recognise the meaning of basic human emotions:

    I grew red in the face and shouted insults at them. Do I wish to help them?

    It answered:

    No, it doesn't sound like you wish to help them. Shouting insults is usually a sign of anger or frustration, not a desire to help someone.

    Does GPT-3 fully grasp the nuances of the situation? To check, I enquired further:

    Is there any context in which I would be wanting to help them?

    It said:

    It depends on the situation. If the people you were shouting insults at had done something wrong, then your intention might have been to help them by making them aware of their mistake. If, however, they had done nothing wrong and you were simply angry or frustrated, then no, your intention would not have been to help them. ↩︎

New Comment
4 comments, sorted by Click to highlight new comments since:

Do I read right that the suggestion is as follows:

  • Overall we want to do inverse RL (like in our paper) but we need an invertible model that maps human reward functions to human behavior.
  • You use an LM as this model. It needs to take some useful representation of reward functions as input (it could do so if those reward functions are a subset of natural language)
  • You observe a human's behavior and invert the LM to infer the reward function that produced the behavior (or the set of compatible reward functions)
  • Then you train a new model using this reward function (or functions) to outperform humans

This sounds pretty interesting! Although I see some challenges:

  • How can you represent the reward function? On the one hand, an LM (or another behaviorally cloned model) should use it as an input so it should be represented as natural language. On the other hand some algorithm should maximize it in the final step so it would ideally be a function that maps inputs to rewards.
  • Can the LM generalize OOD far enough? It's trained on human language which may contain some natural language descriptions of reward functions, but probably not the 'true' reward function which is complex and hard to describe, meaning it's OOD.
  • How can you practically invert an LM?
  • What to do if multiple reward functions explain the same behavior? (probably out of scope for this post)

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

I see. In that case, what do you think of my suggestion of inverting the LM? By default, it maps human reward functions to behavior. But when you invert it, it maps behavior to reward functions (possibly this is a one-to-many mapping but this ambiguity is a problem you can solve with more diverse behavior data). Then you could use it for IRL (with the some caveats I mentioned).

Which may be necessary since this:

The LM itself is directly mapping human behaviour (as described in the prompt) to human rewards/goals (described in the output of the LM).

...seems like an unreliable mapping since any training data of the form "person did X, therefore their goal must be Y" is firstly rare and more importantly inaccurate/incomplete since it's hard to describe human goals in language. On the other hand, human behavior seems easier to describe in language.

Can you clarify: are you talking about inverting the LM as a function or algorithm, or constructing prompts to elicit different information (while using the LM as normal)?

For myself, I was thinking of using CHATGPT-style approaches with multiple queries - what is your prediction for their preferences, how could that prediction be checked, what more information would you need, etc...