Misspecification in Inverse Reinforcement Learning

In this post, I will provide a summary of the paper Misspecification in Inverse Reinforcement Learning, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the third post in the theoretical reward learning sequence, which starts in this post (though this post is self-contained).

This paper won the best paper award at AAAI 2023.

Background

Recall that IRL is concerned with inferring what objective an agent is pursuing, based on the actions taken by that agent. It is typically assumed that the behaviour of the observed agent is described by a (stationary) policy , and that its objectives are described by a reward function, $R$ . Therefore, the IRL problem can be formally stated as the problem of inferring a reward function $R$ , based on a policy $π$ which has been computed from $R$ . However, to do this, we must make assumptions about how R relates to $π$ . In other words, we must make assumptions about how a person’s preferences relate to their behaviour. We refer to this as the behavioural model. Of course, in the real world, this relationship can be incredibly complicated and thus difficult (to not say practically impossible) to model perfectly. This means that we should expect this behavioural model to be misspecified (regardless of whether this model is specified manually, or produced by machine learning, etc). This raises the concern that IRL algorithms might systematically lead to flawed inferences if applied to real-world data.

One response to this issue is to make behavioural models that are as accurate as possible. However, while a behavioural model can be more or less accurate, it will never be realistically possible to create a behavioural model that is completely free from misspecification (with the possible exception of certain very narrow domains). This means that if we wish to use IRL as a tool for preference elicitation, then it is crucial to have an understanding of how robust the IRL problem is to misspecification of the behavioural model. Is it enough to have a behavioural model which is roughly accurate, or do we need to have a highly accurate behavioural model to ensure accurate inferences? This paper attempts to make some progress on answering this problem.

Formalism

To tackle this problem, we must first find a good way to formalise it.

Given a fixed set of states $S$ and a fixed set of actions $A$ , let $R$ be the set of all reward functions definable over $S$ and $A$ , i.e., $S \times A \times S \to R$ , and let $Π$ be the set of all policies defined over $S$ and $A$ , i.e., $S \to Δ (A)$ . We will use the following definition:

Definition: A function $f : R \to Π$ is a behavioural model.

For instance, the function $f$ that takes a reward $R$ and returns the Boltzmann-rational policy for $R$ (given temperature $β$ , discount $γ$ , and transition function $τ$ ) is an example of a behavioural model (see the main paper for a definition of Boltzmann-rational policies). Note that we implicitly consider the transition function of the underlying environment to be part of the behavioural model — this will make the maths cleaner, and will also make it easier to reason about the case where the transition function is misspecified.

Using this definition, we can now create an abstract model of an IRL algorithm as follows: first, we assume that there is an underlying true reward function $R^{*}$ , and that the observed agent generates its policy using the behavioural model $g$ — this means that the IRL algorithm observes the policy $π$ given by $g (R^{*})$ . Moreover, we assume that the IRL algorithm models the relationship between the observed policy $π$ and the underlying reward function using a different behavioural model $f$ , such that the IRL algorithm will learn (or converge to) a reward function $R_{H}$ such that $f (R_{H}) = π$ . If $f \neq g$ , then $f$ is misspecified, and otherwise $f$ is correctly specified.

Intuitively, we want to say that a behavioural model $f$ is robust to a given form of misspecification (here represented by $g$ ) if an IRL algorithm based on that model $f$ is guaranteed to learn a reward function that is “close enough” to the true reward function, when trained on data generated given that type of misspecification (i.e., data generated from $g$ ). To make this formal, we first need a definition of what it means for two reward functions to be “close enough”. In this paper, we formalise this in terms of equivalence relations on $R$ . That is, we assume that we have a partition $P$ of $R$ (which of course corresponds to an equivalence relation), and that the learnt reward function $R_{H}$ is “close enough” to the true reward function $R^{*}$ if $R_{H} \equiv_{P} R^{*}$ . (We will for now leave it open what partition $P$ to pick, and return to that question later.)

We will also use the following definition:

Definition: Given a partition $P$ of $R$ , we say that a behavioural model $f$ is $P$ -admissible if, for all $R_{1}, R_{2} \in R$ , if $f (R_{1}) = f (R_{2})$ then $R_{1} \equiv_{P} R_{2}$ .

Note that the definition of $P$ -admissibility is equivalent to saying that an IRL algorithm based on $f$ is guaranteed to converge to a reward function that is $P$ -equivalent to the underlying true reward function, when trained on data that is generated via the behavioural model $f$ (i.e., when there is no misspecification). That is, there is an underlying true reward $R^{*}$ , and the IRL algorithm will observe the policy $π = f (R^{*})$ . It will then learn (or converge to) a reward $R_{H}$ such that $f (R_{H}) = π$ . In other words, we have that $f (R_{H}) = f (R^{*})$ . The definition of $P$ -admissibility requires that this implies that $R_{H} \equiv_{P} R^{*}$ — in other words, that the learnt reward is $P$ -equivalent to the underlying true reward.

Given this, we can now give a formal definition of misspecification robustness:

Definition: Given a partition $P$ of $R$ and two behavioural models $f$ , $g$ , we say that f is $P$ -robust to misspecification with $g$ if
$f$ is $P$ -admissible,
$f \neq g$ ,
$I m (g) \subseteq I m (f)$ , and
for all $R_{1}, R_{2}$ , if $f (R_{1}) = g (R_{2})$ then $R_{1} \equiv_{P} R_{2}$ .

Let me briefly unpack this definition. The last condition — which is the most important — says that if $f$ is $P$ -robust to misspecification with $g$ , then any IRL algorithm based on f must be guaranteed to converge to a reward function that is $P$ -equivalent to the true reward function when trained on data generated by $g$ . The requirement that $f$ is $P$ -admissible is included to rule out some uninteresting edge cases. The requirement that $I m (g) \subseteq I m (f)$ ensures that the learning algorithm never observes any data that is impossible according to its model. For example, suppose that an IRL algorithm which assumes that the observed policy must be deterministic is given training data from a nondeterministic policy. How will the IRL algorithm react? This is underspecified, absent any further details about the algorithm. The requirement that $I m (g) \subseteq I m (f)$ rules out such cases, which seems reasonable if we want to make our analysis as general as possible. Finally, the requirement that $f \neq g$ simply ensures that $f$ in fact is misspecified (this is mainly included to make the terminology more intuitive). This is the main definition used in the paper.

This definition of misspecification robustness is given relative to an equivalence relation on $R$ . In this paper, we use two equivalence classes; the first says that two reward functions are equivalent if they have the same optimal policies (in a given environment), and the second says that they are equivalent if they have the same ordering of policies (in a given environment). We denote the first equivalence relation with $O P T$ , and the second with $O R D$ . The first condition should be sufficient, if we are able to compute optimal policies (but in practice we can of course typically not do this). The second condition is much stronger, and should be sufficient in general (but may be overly strong). Neither option is a fully accurate formalisation of what we actually care about, but I believe that they are sufficiently close for us to get useful results out of them.

These definitions now give us a precise formalisation of the problem we are trying to solve. This formalisation seems quite reasonable, and as we will soon see, it has enough structure to enable us to do quite a bit of interesting analysis. It has some limitations, however, the main one being that it is based on equivalence relations, according to which two reward functions are either equivalent or not. This means that this definition cannot distinguish between small and large errors in the learnt reward. Later on, we will generalise the definition to get around this (and other) limitations.

General Results

To start with, the paper introduces a few very general results, which can help with providing some additional insight into our problem setting. Most of these statements are easy to prove, but still worth spelling out. Some highlights include:

Lemma 1: If $f$ is $P$ -robust to misspecification with $g$ , then $g$ is $P$ -admissible.

It may be easier to understand this statement by considering the contrapositive: if g isn’t $P$ -admissible, then no behavioural model can be $P$ -robust to misspecification with $g$ . Stated differently, if data from $g$ is insufficient for identifying the $P$ -class of the underlying reward function when there is no misspecification, then we are also unable to identify the correct P-class from this data if we use a misspecified data model. This means that we can never gain anything from misspecification.

Lemma 2: If $f$ is $P$ -robust to misspecification with $g$ , and $I m (f) = I m (g)$ , then $g$ is $P$ -robust to misspecification with $f$ .

This lemma means that misspecification robustness will be symmetric in many typical circumstances. For example, if $f$ and $g$ are surjective (which might anyway be desirable), then $I m (f) = I m (g)$ .

Lemma 3: $f$ is $P$ -admissible, but not $P$ -robust to misspecification with any $g$ , if and only if $R_{1} \equiv_{P} R_{2}$ iff $f (R_{1}) = f (R_{2})$ .

This lemma has a few interesting implications. First of all, note that it means that we should expect most well-behaved data models to be robust to some forms of misspecification (assuming we don’t have $f (R_{1}) = f (R_{2})$ exactly when $R_{1} \equiv_{P} R_{2}$ ). Moreover, it also suggests that data models that are less ambiguous also are less robust to misspecification, and vice versa. One way to interpret this is to note that if $f$ is $P$ -admissible, but there are $R_{1}, R_{2}$ such that $R_{1} \equiv_{P} R_{2}$ but $f (R_{1}) \neq f (R_{2})$ , then $f$ is sensitive to properties of $R$ that are irrelevant from the point of view of $P$ . Informally, we may then expect $f$ to be robust to misspecification with $g$ if $f$ and $g$ only differ in terms of such “irrelevant details”. This intuition is also captured by the following lemma:

Lemma 4: Let $f$ be $P$ -admissible, and let $T$ be the set of all transformations t : $R \to R$ that preserve $P$ . Then $f$ is $P$ -robust to misspecification with $g$ if and only if $g = f \circ t$ for some $t \in T$ , unless $f = g$ .

This lemma is the most important of the above, because it provides us with an easy method for deriving necessary and sufficient conditions that completely describe what forms of misspecification any given data model f is robust to. In particular, given an equivalence relation $P$ , if we can find the set $T$ of all functions $t : R \to R$ such that $R \equiv_{P} t (R)$ for all $R$ , then we can completely characterise the misspecification robustness of any data model $f$ by simply composing $f$ with each element of $T$ . We will later use this method to characterise the misspecification robustness of several important data models. While it isn’t necessarily too hard to prove, I think Lemma 4 is one of the nicest results in this paper.

Another important result we prove (and which is much more complicated than the above) is the following:

Theorem 1: $R_{1}$ and $R_{2}$ have the same ordering of policies if and only if they differ by a combination of potential shaping, S’-redistribution, and positive linear scaling.

(For an exact formal version of this statement, see the main paper.)

Not only is this theorem interesting in its own right, we can also combine it with Lemma 4 to easily derive what forms of misspecification different behavioural models are $O R D$ -robust to!

Results for Specific Behavioural Models

Given these general tools, we can now apply them to a few behavioural models. In the IRL literature, most IRL algorithms assume that the observed policy is optimal, Boltzmann-rational, or causal entropy maximising (if you are not familiar with these, you can find formal definitions in the main paper). As such, we refer to these behavioural models as the “standard” behavioural models. In the paper, we provide necessary and sufficient conditions that completely describe all forms of misspecification that these behavioural models are ( $O R D$ or $O P T$ ) robust to. These results can be summarised as follows:

Theorem 2: Let $F$ be the set of all behavioural models that generate policies which take each action with positive probability, and which take the actions with the highest Q-values with the highest probability. Then the Boltzmann-rational behavioural model $f$ is $O P T$ -robust to misspecification with $g$ if and only if $g \in F$ and $f \neq g$ .
Theorem 3: The Boltzmann-rational behavioural model is $O R D$ -robust to misspecification of the temperature parameter, and no other misspecification.
Theorem 4: The optimality behavioural model is not ( $O P T$ or $O R D$ ) robust to any misspecification.
Theorem 5: The maximal causal entropy behavioural model is $O R D$ -robust to misspecification of the weight parameter, and no other misspecification. It is $O P T$ -robust to misspecification with $g$ if $g$ is such that if $g (R_{1}) = π$ , then there must be an $R_{2}$ such that $π$ maximises causal entropy with respect to $R_{2}$ , and such that $R_{1}$ and $R_{2}$ have the same optimal policies.

Again, for more exact formal statements, and proofs, please see the main paper. As we can see, each of the standard behavioural models are robust to some forms of misspecification. For example, if an IRL algorithm which assumes that the observed policy is Boltzmann-rational is trained on data generated from an epsilon-greedy policy, then it is guaranteed to converge to a reward function that has the same optimal policies as the ground-truth reward function. Similarly, if it is trained on data that is generated from a Boltzmann-rational policy, but with a temperature parameter that is different from that which is assumed by the IRL algorithm, then the IRL algorithm is still guaranteed to converge to a reward function that has the same ordering of policies as the ground-truth reward.

Using the tools introduced above, similar results could also be derived for new behavioural models.

Misspecified Parameters

Looking at the results above, it is interesting to note that none of the standard behavioural models are ( $O P T$ or $O R D$ ) robust to any misspecification of the discount $γ$ , or the assumed transition function $τ$ . In the paper, we show that this in fact holds for a very broad class of behavioural models (and, in particular, it should be expected to hold for many other behavioural models besides the three standard models).

First of all, say that a transition function $τ$ is trivial if $τ (s, a_{1}) = τ (s, a_{2})$ for all states $s$ and all actions $a_{1}, a_{2}$ . In other words, if $τ$ is trivial, then the actions that you take can never influence the probability of transitioning to different states in the next step. Basically all interesting transition-functions are non-trivial. We then have the following result:

Theorem 6: Let $f_{γ}$ be a behavioural model that is invariant to potential shaping (defined relative to discount $γ$ ), and let $γ_{1} \neq γ_{2}$ . Then $f_{γ_{1}}$ is not ORD or OPT-robust to misspecification with $f_{γ_{2}}$ unless the transition function is trivial.

Many reasonable behavioural models should be expected to be invariant to potential shaping (see this post), and virtually all environments have non-trivial transition dynamics. This theorem will therefore hold very widely. We also have the following result:

Theorem 7: Let $f_{τ}$ be a behavioural model that is invariant to S’-redistribution (defined relative to transition function $τ$ ), and let $τ_{1} \neq τ_{2}$ . Then $f_{τ_{1}}$ is not ORD or OPT-robust to misspecification with $f_{τ_{2}}$ .

Again, basically all reasonable behavioural models will be invariant to S’-redistribution, so this result should also be expected to hold (at least approximately) for a very wide class of behavioural models (including behavioural models that are learnt from data). Again, see this post for more details.

I originally found these two results to be very unintuitive. For a more intuitive explanation for these two theorems are true, see Appendix B.2 in this paper.

Prior Knowledge and Inductive Bias

The above results do not make any assumptions about the underlying true reward $R^{*}$ , nor do they make any assumptions about the inductive bias of the learning algorithm. Can the above results change, if we incorporate such assumptions? In the paper, we show that this is largely not the case — adding restrictions on $R^{*}$ or information about the inductive bias of the learning algorithm will not change any of the results by a substantial amount. For an exact formal version of this statement, see the main paper.

Conclusions

In this paper, we carry out a theoretical analysis of how robust IRL is to misspecification of the behavioural model. We find a good way to formalise this question, and use this formalism to derive several interesting results. In particular, we derive necessary and sufficient conditions that characterise what forms of misspecification are tolerated by each of the behavioural models that are most common in the current IRL literature. We also derive a number of results that are highly general, and are likely to apply to broad classes of behavioural models (including e.g. behavioural models that are created using ML).

When it comes to the qualitative question of how robust IRL is to misspecification, this paper essentially says that IRL can be surprisingly robust to some forms of misspecification, and surprisingly sensitive to other forms of misspecification. For example, it seems like the Boltzmann-rational behavioural model gives reasonable inferences in a fairly broad range of situations. However, it also seems like very broad classes of behavioural models (including Boltzmann-rational models) should be expected to be very sensitive to any amount of misspecification of the discount parameter or transition function.

The main limitation of the analysis in this paper is, in my opinion, that it is entirely based on equivalence relations. This means that it cannot distinguish between small and large errors in the learnt reward function. To alleviate this, we must first develop a good way to continuously quantify how “similar” two reward functions are. This problem will be discussed in the next post in this sequence.

If you have any questions, please feel free to ask them in the comments!

AI ALIGNMENT FORUM
AF