Misspecification in Inverse Reinforcement Learning - Part II

In this post, I will provide a summary of the paper Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fifth post in the theoretical reward learning sequence, which starts in this post. This post is somewhat self-contained, but I will largely assume that you have read this post and this post before reading this one.

In Misspecification in Inverse Reinforcement Learning (also discussed in this post), I attempt to analyse how sensitive IRL is to misspecification of the behavioural model. The main limitation of this analysis is that it is based on equivalence relations – that is, it only distinguishes between the case where the learnt reward function is equivalent or nonequivalent to the ground truth reward (for some specific ways of defining this equivalence). This means that it cannot distinguish between small and large errors in the learnt reward. Quantifying the differences between reward functions is nontrivial — to solve this, I developed STARC metrics, which are described in this post. In Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, which I’m summarising in this post, I extend the analysis in Misspecification in Inverse Reinforcement Learning using STARC metrics.

Formalism

We must first modify the definition in Misspecification in Inverse Reinforcement Learning to make use of pseudometrics on . This is straightforward:

Definition: Given a pseudometric $d$ on $R$ , and two behavioural models $f, g : R \to Π$ , we say that $f$ is $ϵ$ -robust to misspecification with $g$ if
If $f (R_{1}) = g (R_{2})$ , then $d (R_{1}, R_{2}) \leq ϵ$ .
If $f (R_{1}) = f (R_{2})$ , then $d (R_{1}, R_{2}) \leq ϵ$ .
$I m (g) \subseteq I m (f)$ .
$f \neq g$ .

This definition is directly analogous to that given in Misspecification in Inverse Reinforcement Learning (and in this post).

Some of the results in this paper apply to any pseudometric on $R$ , but sometimes, we will have to use a specific pseudometric. In those cases, I will use the STARC metric that normalises and measures the distance using the $L_{2}$ -norm, canonicalises using the canonicalisation function that is minimal for the $L_{2}$ -norm, and divides the resulting number by 2 (to ensure that the distance is normalised to lie between 0 and 1). The reason for this is primarily that this STARC metric is fairly easy to work with theoretically (but note that all STARC metrics are bilipschitz equivalent, so this choice is not very consequential). I will refer to this pseudometric as $d^{S T A R C}$ .

Note that two reward functions $R_{1}, R_{2}$ have the same policy order if and only if $d (R_{1}, R_{2}) = 0$ , provided that d is a STARC metric (including $d^{S T A R C}$ ). This means that if $f$ is $O R D$ -robust to misspecification with $g$ (in the terminology used in this post), then $f$ is 0-robust to misspecification with $g$ (and thus $ϵ$ -robust for each $ϵ \geq 0$ ) using the terminology above (if $d$ is any STARC metric). The results from Misspecification in Inverse Reinforcement Learning thus carry over directly to this setting, although we may also be able to derive some additional, more permissive, results.

Necessary and Sufficient Conditions

We can first use the above definition to derive necessary and sufficient conditions that completely describe all forms of misspecification that some behavioural models are robust to:

Theorem: Suppose a behavioural model $f : R \to Π$ satisfies that if $f (R_{1}) = f (R_{2})$ , then $d (R_{1}, R_{2}) = 0$ . Then $f$ is $ϵ$ -robust to misspecification with $g$ if and only if $g = f \circ t$ for some reward transformation $t$ such that $d (R, t (R)) \leq ϵ$ for all $R$ , and $f \neq g$ .

For a proof, see the main paper. This theorem requires that $f (R_{1}) = f (R_{2})$ implies that $d (R_{1}, R_{2}) = 0$ , which is somewhat restrictive. Unfortunately, this requirement can’t be removed without making the theorem much more complex. Fortunately, if $d$ is a STARC metric, and $f$ is either the Boltzmann-rational behavioural model or the maximal causal entropy behavioural model, then this condition is satisfied. To see this, note that if $f$ is either of these two behavioural models, and $f (R_{1}) = f (R_{2})$ , then $R_{1}$ and $R_{2}$ differ by potential shaping and S'-redistribution (see this post). Moreover, both of these reward transformations preserve the ordering of all policies, and any STARC metric satisfies that d $(R_{1}, R_{2}) = 0$ if and only if $R_{1}$ and $R_{2}$ have the same policy order (see this post). We can therefore use the above theorem to fully characterise all forms of misspecification that these behavioural models will tolerate. To do this, we first also need the following:

Theorem: A transformation $t : R \to R$ satisfies that $d^{S T A R C} (R, t (R)) \leq ϵ$ for all $R$ if and only if $t$ can be expressed as $t_{1} \circ t_{2} \circ t_{3}$ , where $t_{1}$ and $t_{3}$ are given by some combination of potential shaping, S’-redistribution, and positive linear scaling, and $t_{2}$ satisfies that
$| | R, t_{2} (R) | |_{2} \leq | | c (R) | |_{2} \cdot sin (2 arcsin (ϵ / 2))$
for all $R$ , where $c$ is the canonicalisation function used by $d^{S T A R C}$ .

Thus, if $f$ is either the Boltzmann-rational or the maximal causal entropy behavioural model, then $f$ is $ϵ$ -robust to misspecification with $g$ if and only if $g = f \circ t$ for some reward transformation t that satisfies the conditions above. Unfortunately, this condition is very opaque, and not very easy to interpret intuitively. For that reason, we will also examine a few specific types of misspecification more directly.

Misspecified Parameters

A very interesting result from Misspecification in Inverse Reinforcement Learning is that almost no behavioural model is robust to any misspecification of the discount parameter $γ$ or the transition function $τ$ . An interesting question is whether this is an artifact of the fact that that analysis was based on equivalence relations, rather than metrics. However, as it turns out, this result directly generalises to the case where we use metrics instead of equivalence relations. We say that a transition function $τ$ is “trivial” if $τ (s, a_{1}) = τ (s, a_{2})$ for all states $s$ and actions $a_{1}, a_{2}$ (i.e, basically, if the action you take never matters). All interesting environments have non-trivial transition functions:

Theorem: If $f_{τ}$ is invariant to S’-redistribution, and $τ_{1} \neq τ_{2}$ , then $f_{τ_{1}}$ is not $ϵ$ -robust to misspecification with $f_{τ_{2}}$ for any $ϵ < 0.5$ .
Theorem: If $f_{γ}$ is invariant to potential shaping, $γ_{1} \neq γ_{2}$ , and the underlying transition function $τ$ is non-trivial, then $f_{γ_{1}}$ is not $ϵ$ -robust to misspecification with $f_{γ_{2}}$ for any $ϵ < 0.5$ .

These results assume that we quantify the error in the learnt reward using $d^{S T A R C}$ . This pseudometric ranges from 0 to 1, so a $d^{S T A R C}$ -distance of 0.5 would be extremely large. Moreover, a wide range of behavioural models should be expected to be invariant to S’-redistribution and potential shaping (see this post). In other words, these results say that most sensible behavioural models (including all behavioural models used by contemporary IRL algorithms, and potentially including behavioural models learnt using machine learning) should be expected to not be robust to arbitrarily small misspecification of the discount factor or transition function. This is a very damning result! A more intuitive explanation for why these theorems are true is provided in Appendix B2 of this paper.

Perturbation Robustness

Another form of misspecification we look at in more detail is what we call perturbation robustness. The motivation for this is that it is interesting to know whether or not a behavioural model $f$ is robust to misspecification with any behavioural model $g$ that is “close” to $f$ . But what does it mean for $f$ and $g$ to be “close”? One option is to say that $f$ and $g$ are close if they always produce similar policies, where the “similarity” between two policies is measured using some (pseudo)metric. As such, we define a notion of a perturbation and a notion of perturbation robustness:

Definition: Let $f, g : R \to Π$ be two behavioural models, and let $d^{Π}$ be a pseudometric on $Π$ . Then $g$ is a $δ$ -perturbation of $f$ if $g \neq f$ and for all $R$ , we have that $d^{Π} (f (R), g (R)) \leq δ$ .
Definition: let $f : R \to Π$ be a behavioural model, let $d^{R}$ be a pseudometric on $R$ , and let $d^{Π}$ be a pseudometric on $Π$ . Then $f$ is $ϵ$ -robust to $δ$ -perturbation if $f$ is $ϵ$ -robust to misspecification with $g$ (as defined by $d^{R}$ ) for any behavioural model $g$ that is a $δ$ -perturbation of $f$ (as defined by $d^{Π}$ ) with $I m (g) \subseteq I m (f)$ .

These definitions are given relative to a pseudometric $d^{Π}$ on the set of all policies $Π$ . For example, $d^{Π} (π_{1}, π_{2})$ could be the $L_{2}$ -distance between $π_{1}$ and $π_{2}$ , or it may be the KL divergence between their trajectory distributions, etc. As usual, our results apply for any choice of $d^{Π}$ unless otherwise stated.

Now, a $δ$ -perturbation of $f$ simply is any function that is similar to $f$ on all inputs, and $f$ is $ϵ$ -robust to $δ$ -perturbation if a small perturbation of the observed policy leads to a small error in the inferred reward function. We also need one more definition:

Definition: Let $f : R \to Π$ be a behavioural model, let $d^{R}$ be a pseudometric on $R$ , and let $d^{Π}$ be a pseudometric on $Π$ . Then $f$ is $ϵ / δ$ -separating if $d^{R} (R_{1}, R_{2}) > ϵ ⟹ d^{Π} (f (R_{1}), f (R_{2})) > δ$ for all $R_{1}, R_{2}$ .

Intuitively speaking, $f$ is $ϵ / δ$ -separating if reward functions that are far apart, are sent to policies that are far apart. Using this, we can now state the following result:

Theorem: Let $f : R \to Π$ be a behavioural model, let $d^{R}$ be a pseudometric on $R$ , and let $d^{Π}$ be a pseudometric on $Π$ . Then $f$ is $ϵ$ -robust to $δ$ -perturbation (as defined by $d^{R}$ and $d^{Π}$ ) if and only if $f$ is $ϵ / δ$ -separating (as defined by $d^{R}$ and $d^{Π}$ ).

This gives us necessary and sufficient conditions that describe when a behavioural model is robust to perturbations --- namely, it has to be the case that this behavioural model sends reward functions that are far apart, to policies that are far apart. This ought to be quite intuitive; if two policies are close, then perturbations may lead us to conflate them. To be sure that the learnt reward function is close to the true reward function, we therefore need it to be the case that policies that are close always correspond to reward functions that are close.

Our next question is, of course, whether or not the standard behavioural models are $ϵ / δ$ -separating. Surprisingly, we will show that this is not the case, when the distance between reward functions is measured using $d^{S T A R C}$ and the policy metric $d^{Π}$ is similar to Euclidean distance. Moreover, this holds for any continuous behavioural model:

Theorem: Let $d^{R}$ be $d^{S T A R C}$ , and let $d^{Π}$ be a pseudometric on $Π$ which satisfies the condition that for all $δ$ there exists a $δ^{'}$ such that if $| | π_{1}, π_{2} | |_{2} < δ^{'}$ then $d^{Π} (π_{1}, π_{2}) < δ$ . Let $f$ be a continuous behavioural model. Then $f$ is not $ϵ / δ$ -separating for any $ϵ < 1$ or $δ > 0$ .

To make things easy, we can just let $d^{Π}$ be the L_2 norm (the theorem just generalises this somewhat). The theorem then tells us that no continuous behavioural model is $ϵ / δ$ -separating (and therefore also not $ϵ$ -robust to $δ$ -perturbation) for any $ϵ$ or $δ$ . The fundamental reason for this is that if $f$ is continuous, then it must send reward functions that are close under the $L_{2}$ -norm to policies that are close under the $L_{2}$ -norm. However, there are reward functions that are close under the $L_{2}$ -norm but which have a large STARC distance. Hence $f$ will send some reward functions that are far apart (under $d^{S T A R C}$ ) to policies which are close.

Conclusion

We can see that the results in this paper are similar in spirit to those provided in Misspecification in Inverse Reinforcement Learning (also discussed in this post). In other words, while it is more restrictive to use equivalence relations rather than pseudometrics on $R$ , the same basic mathematical structure emerges in both cases.

The main question behind this research was to answer whether or not IRL is robust to moderate misspecification of the behavioural model used by the IRL algorithm — that is, does a small error in the assumptions underlying the IRL algorithm lead to a small error in the learnt reward. To me, it looks like the answer to this question is likely to be negative. In particular, we have seen that an arbitrarily small error in the discount function can lead to large errors in the learnt reward function. Of course, this answer is not fully conclusive. In particular, some of these results (but not all) are based on the behavioural models that are used by current IRL algorithms, and these are very unrealistic (when seen as models of human behaviour) – it may be interesting to extend this analysis to more realistic models of human behaviour (which I have partially done in this paper, for example). Nonetheless, I would not currently put much hope on IRL (even if IRL is amplified with the help of new AI breakthroughs, etc).

In the next post of this sequence, I will discuss reward hacking, and provide some alternative results regarding how to compare reward functions.

If you have any questions, then please let me know in the comments!

AI ALIGNMENT FORUM
AF