In this post, I will provide a summary of the paper STARC: A General Framework For Quantifying Differences Between Reward Functions, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fourth post in the theoretical reward learning sequence, which starts in this post (though this post is self-contained).
In this paper, we consider the question of how to quantify the distance between reward functions in an informative way. That is, we want to find a function d:R×R→R, where R is the space of all reward functions, such that d(R1,R2) is a meaningful quantification of how similar R1 and R2 are. This is important for the (theoretical or empirical) study of reward learning algorithms; for example, see this post.
Considerations
Note that this problem is not very straightforward. A simple method for quantifying the distance between two reward functions might be to measure their L2-distance. However, this is unsatisfactory, because two reward functions can have a large L2-distance, even if they induce the same ordering of policies, or a small L2-distance, even if they induce the opposite ordering of policies. For example, given an arbitrary reward function R and an arbitrary constant c, we have that R and c⋅R have the same ordering of policies, even though their L2-distance may be arbitrarily large. Similarly, for any ϵ, we have that ϵ⋅R and −ϵ⋅R have the opposite ordering of policies, unless R is constant, even though their L2-distance may be arbitrarily small. Solving this problem in a good way thus requires some care.
(There are two earlier proposals for how to do this, namely EPIC and DARD. In Appendix A and B of the main paper, we outline a number of shortcomings with these earlier methods.)
We should start by asking what it means for a given function d:R×R→R to be “good” at quantifying the differences between reward functions. First and foremost, we probably want d to be a pseudometric, since this comes with several nice mathematical properties. This means that it should satisfy:
d(R1,R2)≥0, with equality if R1=R2.
d(R1,R2)=d(R2,R1).
d(R1,R3)≤d(R1,R2)+d(R2,R3).
The difference between a metric and a pseudometric is that for a metric, it is required that d(x,y)=0 only ifx=y, whereas for a pseudometric, we can have that d(x,y)=0 even when x≠y. In our case, it seems reasonable to consider pseudometrics rather than metrics, since we may want to consider some distinct reward functions to have distance 0. For example, if two rewards only differ by positive linear scaling, then it seems reasonable to say that they are equivalent (and thus have distance 0).
Requiring that d is a pseudometric is a very weak and general requirement. More specifically to our problem, we ideally want it to be the case that d(R1,R2) is small if and only if optimising R1 or R2 would lead to similar outcomes. We can formalise this intuitive statement using regret bounds. Specifically, we say that:
Definition: A pseudometric d on R is sound if there is a constant U, such that for any policies π1,π2, if J2(π2)≥J2(π1), then J1(π1)−J1(π2)≤U⋅(maxπJ1(π)−minπJ1(π))⋅d(R1,R2).
Let us unpack this definition. J1(π1)−J1(π2) is the regret, as measured by R1, of using policy π2 instead of π1. Division by maxπJ1(π)−minπJ1(π) normalises this quantity based on the total range of R1, so that it lies between 0 and 1 (though the term is put on the right-hand side of the inequality, instead of being used as a denominator, in order to avoid division by zero when maxπJ1(π)−minπJ1(π)=0). The condition that J2(π2)≥J2(π1) says that R2 prefers π2 over π1. Taken together, this means that a pseudometric d is sound if d(R1,R2) gives an upper bound on the maximal regret that could be incurred under R1 if an arbitrary policy π1 is optimised to another policy π2 according to R2. It is also worth noting that this includes the special case when π1 is optimal under R1 and π2 is optimal under R2.
In addition to this, we also want pseudometrics d that induce a lower bound on worst-case regret. When this is the case, we say that d is complete. It may not be immediately obvious why this property is desirable. To see why this is the case, note that if a pseudometric don the space of all reward functions R does not induce a lower bound on worst-case regret, then there are reward functions that have a low worst-case regret, but a large distance under d. This would in turn mean that d is not tight, and that it should be possible to improve upon it. In other words, if we want a small distance under d to be both sufficient and necessary for low worst-case regret, then d must induce both an upper and a lower bound on worst-case regret. Formally, we say that
Definition: A pseudometric d on R is complete if there is a constant L, such that for any reward functions R1,R2, there are policies π1,π2 such that J2(π2)≥J2(π1) and J1(π1)−J1(π2)≥L⋅(maxπJ1(π)−minπJ1(π))⋅d(R1,R2).[1]
Thus, to be useful for quantifying the differences between reward functions, a function d:R×R→R should ideally be a pseudometric, and be both sound and complete.
STARC Metrics
In the paper, we propose a family of pseudometrics on the space of all reward functions, which we refer to as STAndardised Reward Comparison (STARC) metrics. We will also show that STARC metrics satisfy all considerations we outlined above.
STARC metrics are computed in several steps. First, we need a few new definitions:
Definition: Two reward functions R1,R2 differ by potential shaping if there is a function Φ:S→R such that
R1(s,a,s′)=R2(s,a,s′)+γ⋅Φ(s′)−Φ(s)
for all s,a,s′. They differ by S’-redistribution if
ES′∼τ(s,a)[R1(s,a,S′)]=ES′∼τ(s,a)[R2(s,a,S′)]
for all s and a.
To get a better intuition for what potential shaping and S’-redistribution do, see this post. For now, it is probably sufficient to know that if R1 and R2 differ by (some combination of) potential shaping and S’-redistribution, then they induce the same ordering of policies. Using this, we can now define:
Definition: A function c:R→R is a canonicalisation function if:
c is linear,
R and c(R) differ by potential shaping and S’-redistribution, and
c(R1)=c(R2) if and only if R1 and R2 differ by potential shaping and S’-redistribution.
A canonicalisation function essentially standardises reward functions, such that reward functions that differ by potential shaping and S’-redistribution are mapped to a single representative in their respective equivalence class. You can think of this as a kind of normalisation. The requirement that c is linear makes our later analysis more straightforward, and is not too restrictive in practice. However, it could probably be lifted, with some effort. We also use the following definition:
Definition: A metric m:R×R→R is admissible if there is a norm p and two positive constants l,u, such that l⋅p(x,y)≤m(x,y)≤u⋅p(x,y) for all x,y∈R.
Any norm is of course an admissible metric, but there are some other metrics which are also admissible. This weakening is only included to make our definitions as general as possible – in practice, you can mentally replace "admissible metric" with “norm”, and not lose much.
Using this, we can now give a definition of STARC metrics:
Definition: A function d:R×R→R is a STARC metric if there is a canonicalisation function c, a function n that is a norm on Im(c), and a metric m that is admissible on Im(s), such that d(R1,R2)=m(s(R1),s(R2)), where s(R)=c(R)/n(c(R)) when n(c(R))≠0, and c(R) otherwise.
In other words, a STARC metric d is computed by first applying a canonicalisation function c to both of its inputs, then normalising the resulting reward functions (unless it is the reward that is zero everywhere, in which case we don’t change it), and finally measuring the distance between the resulting reward functions.
The most complicated part of this definition is the canonicalisation function — for the norm n and metric m, we can simply pick any norm (such as the L2-norm, etc). Let me therefore also give two examples of canonicalisation functions:
Proposition: For any policy π, the function c:R→R given by
c(R)(s,a,s′)=ES′∼τ(s,a)[R(s,a,S′)−Vπ(s)+γ⋅Vπ(S′)]
is a canonicalisation function. Here Vπ is computed using the reward function R that is given as input to c. Note that we must use the same policy π for all R. We refer to this canonicalisation function as Value-Adjusted Levelling (VAL).
Definition: A canonicalisation function c is minimal for a norm n if n(c(R))≤n(R′) for all R′ such that R and R′ differ by potential shaping and S’-redistribution.
Proposition: For the L2-norm, the minimal canonicalisation function exists and is unique. To get this canonicalisation function, let R0 be the reward that is zero everywhere, and let V be the set of all reward functions that differ from R0 by potential shaping and S’-redistribution. Let W be the orthogonal complement of V in R. Then the minimal canonicalisation function for the L2-norm is the orthogonal projection of R onto W. (Note that R can be viewed as an |S||A||S|-dimensional real vector space.)
For proofs, please see the main paper. The minimal canonicalisation function is easy to work with theoretically, but not so easy to compute for empirical experiments. By contrast, VAL can easily be estimated in even large-scale environments. Combined with some norm and some admissible metric (which may also be a norm), these form a STARC metric.
In Appendix C of the main paper, we provide two geometric intuitions for how STARC metrics work. To get a deeper intuitive understanding for STARC metrics, it may help to read that section.
Theoretical Results
In the paper, we derive several important theoretical properties for STARC-metrics. The proofs are found in the main paper. First and foremost:
Theorem: Any STARC metric is both sound and complete.
This means that if d is a STARC metric, then d(R1,R2) is small if and only if the worst-case regret (as measured by R1) of optimising R2 instead of R1 is small, and vice versa. In other words, d(R1,R2) is small if and only if optimising R1 or R2 lead to similar outcomes. STARC metrics thus satisfy the main considerations we provided above. We also have that:
Proposition: Any STARC metric d has the property that d(R1,R2)=0 if and only if R1 and R2 induce the same ordering of policies.
This means that STARC metrics consider two reward functions to be equivalent, exactly when those reward functions induce exactly the same ordering of policies. This is intuitive and desirable (and, in fact, is a consequence of the previous theorem). We also have the following result:
Proposition: If two pseudometrics d1,d2 on R are both sound and complete, then d1 and d2 are bilipschitz equivalent. This means that there are positive constants l,u such that l⋅d1(R1,R2)≤d2(R1,R2)≤u⋅d1(R1,R2).
Combined with the above results, this means that STARC metrics are unique (up to bilipschitz equivalence)! In other words, they capture what it means for two reward functions to be “similar” in a fairly unique and canonical way, and it will not be possible to improve upon them without losing some of their desirable properties.
Experimental Results
In the main paper, we also provide a range of empirical results. The main takeaway from these experiments is that it indeed seems like STARC-metrics correlate well with worst-case regret in randomly generated MDPs. We also show that STARC metrics can be estimated in large continuous environments, where they can’t be calculated exactly. For the exact data, etc, please see the main paper.
Conclusion
STARC metrics induce both an upper and a lower bound on worst-case regret, which means that a small distance under a STARC-metric is both necessary and sufficient for ensuring low regret. In other words, d(R1,R2) is small if and only if we are guaranteed to get similar outcomes if we optimise R1 or R2. Moreover, all pseudometrics with these properties are bilipschitz equivalent. This means that STARC metrics exactly capture what it means for two reward functions to be similar (at least for one informative way of formalising “similarity”). They are easy to work with theoretically, and can be estimated in large environments. This makes them a useful tool when evaluating reward learning algorithms.
One of the main motivations for developing these metrics was to extend the results in the paper Misspecification in Inverse Reinforcement Learning (which I also discussed in this post). In the next post in this sequence, I will show how to use STARC metrics to analyse how sensitive IRL is to misspecification.
If you have any questions, please feel free to ask them in the comments!
In this post, I will provide a summary of the paper STARC: A General Framework For Quantifying Differences Between Reward Functions, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the fourth post in the theoretical reward learning sequence, which starts in this post (though this post is self-contained).
In this paper, we consider the question of how to quantify the distance between reward functions in an informative way. That is, we want to find a function d:R×R→R, where R is the space of all reward functions, such that d(R1,R2) is a meaningful quantification of how similar R1 and R2 are. This is important for the (theoretical or empirical) study of reward learning algorithms; for example, see this post.
Considerations
Note that this problem is not very straightforward. A simple method for quantifying the distance between two reward functions might be to measure their L2-distance. However, this is unsatisfactory, because two reward functions can have a large L2-distance, even if they induce the same ordering of policies, or a small L2-distance, even if they induce the opposite ordering of policies. For example, given an arbitrary reward function R and an arbitrary constant c, we have that R and c⋅R have the same ordering of policies, even though their L2-distance may be arbitrarily large. Similarly, for any ϵ, we have that ϵ⋅R and −ϵ⋅R have the opposite ordering of policies, unless R is constant, even though their L2-distance may be arbitrarily small. Solving this problem in a good way thus requires some care.
(There are two earlier proposals for how to do this, namely EPIC and DARD. In Appendix A and B of the main paper, we outline a number of shortcomings with these earlier methods.)
We should start by asking what it means for a given function d:R×R→R to be “good” at quantifying the differences between reward functions. First and foremost, we probably want d to be a pseudometric, since this comes with several nice mathematical properties. This means that it should satisfy:
The difference between a metric and a pseudometric is that for a metric, it is required that d(x,y)=0 only if x=y, whereas for a pseudometric, we can have that d(x,y)=0 even when x≠y. In our case, it seems reasonable to consider pseudometrics rather than metrics, since we may want to consider some distinct reward functions to have distance 0. For example, if two rewards only differ by positive linear scaling, then it seems reasonable to say that they are equivalent (and thus have distance 0).
Requiring that d is a pseudometric is a very weak and general requirement. More specifically to our problem, we ideally want it to be the case that d(R1,R2) is small if and only if optimising R1 or R2 would lead to similar outcomes. We can formalise this intuitive statement using regret bounds. Specifically, we say that:
Let us unpack this definition. J1(π1)−J1(π2) is the regret, as measured by R1, of using policy π2 instead of π1. Division by maxπJ1(π)−minπJ1(π) normalises this quantity based on the total range of R1, so that it lies between 0 and 1 (though the term is put on the right-hand side of the inequality, instead of being used as a denominator, in order to avoid division by zero when maxπJ1(π)−minπJ1(π)=0). The condition that J2(π2)≥J2(π1) says that R2 prefers π2 over π1. Taken together, this means that a pseudometric d is sound if d(R1,R2) gives an upper bound on the maximal regret that could be incurred under R1 if an arbitrary policy π1 is optimised to another policy π2 according to R2. It is also worth noting that this includes the special case when π1 is optimal under R1 and π2 is optimal under R2.
In addition to this, we also want pseudometrics d that induce a lower bound on worst-case regret. When this is the case, we say that d is complete. It may not be immediately obvious why this property is desirable. To see why this is the case, note that if a pseudometric d on the space of all reward functions R does not induce a lower bound on worst-case regret, then there are reward functions that have a low worst-case regret, but a large distance under d. This would in turn mean that d is not tight, and that it should be possible to improve upon it. In other words, if we want a small distance under d to be both sufficient and necessary for low worst-case regret, then d must induce both an upper and a lower bound on worst-case regret. Formally, we say that
Thus, to be useful for quantifying the differences between reward functions, a function d:R×R→R should ideally be a pseudometric, and be both sound and complete.
STARC Metrics
In the paper, we propose a family of pseudometrics on the space of all reward functions, which we refer to as STAndardised Reward Comparison (STARC) metrics. We will also show that STARC metrics satisfy all considerations we outlined above.
STARC metrics are computed in several steps. First, we need a few new definitions:
To get a better intuition for what potential shaping and S’-redistribution do, see this post. For now, it is probably sufficient to know that if R1 and R2 differ by (some combination of) potential shaping and S’-redistribution, then they induce the same ordering of policies. Using this, we can now define:
A canonicalisation function essentially standardises reward functions, such that reward functions that differ by potential shaping and S’-redistribution are mapped to a single representative in their respective equivalence class. You can think of this as a kind of normalisation. The requirement that c is linear makes our later analysis more straightforward, and is not too restrictive in practice. However, it could probably be lifted, with some effort. We also use the following definition:
Any norm is of course an admissible metric, but there are some other metrics which are also admissible. This weakening is only included to make our definitions as general as possible – in practice, you can mentally replace "admissible metric" with “norm”, and not lose much.
Using this, we can now give a definition of STARC metrics:
In other words, a STARC metric d is computed by first applying a canonicalisation function c to both of its inputs, then normalising the resulting reward functions (unless it is the reward that is zero everywhere, in which case we don’t change it), and finally measuring the distance between the resulting reward functions.
The most complicated part of this definition is the canonicalisation function — for the norm n and metric m, we can simply pick any norm (such as the L2-norm, etc). Let me therefore also give two examples of canonicalisation functions:
For proofs, please see the main paper. The minimal canonicalisation function is easy to work with theoretically, but not so easy to compute for empirical experiments. By contrast, VAL can easily be estimated in even large-scale environments. Combined with some norm and some admissible metric (which may also be a norm), these form a STARC metric.
In Appendix C of the main paper, we provide two geometric intuitions for how STARC metrics work. To get a deeper intuitive understanding for STARC metrics, it may help to read that section.
Theoretical Results
In the paper, we derive several important theoretical properties for STARC-metrics. The proofs are found in the main paper. First and foremost:
This means that if d is a STARC metric, then d(R1,R2) is small if and only if the worst-case regret (as measured by R1) of optimising R2 instead of R1 is small, and vice versa. In other words, d(R1,R2) is small if and only if optimising R1 or R2 lead to similar outcomes. STARC metrics thus satisfy the main considerations we provided above. We also have that:
This means that STARC metrics consider two reward functions to be equivalent, exactly when those reward functions induce exactly the same ordering of policies. This is intuitive and desirable (and, in fact, is a consequence of the previous theorem). We also have the following result:
Combined with the above results, this means that STARC metrics are unique (up to bilipschitz equivalence)! In other words, they capture what it means for two reward functions to be “similar” in a fairly unique and canonical way, and it will not be possible to improve upon them without losing some of their desirable properties.
Experimental Results
In the main paper, we also provide a range of empirical results. The main takeaway from these experiments is that it indeed seems like STARC-metrics correlate well with worst-case regret in randomly generated MDPs. We also show that STARC metrics can be estimated in large continuous environments, where they can’t be calculated exactly. For the exact data, etc, please see the main paper.
Conclusion
STARC metrics induce both an upper and a lower bound on worst-case regret, which means that a small distance under a STARC-metric is both necessary and sufficient for ensuring low regret. In other words, d(R1,R2) is small if and only if we are guaranteed to get similar outcomes if we optimise R1 or R2. Moreover, all pseudometrics with these properties are bilipschitz equivalent. This means that STARC metrics exactly capture what it means for two reward functions to be similar (at least for one informative way of formalising “similarity”). They are easy to work with theoretically, and can be estimated in large environments. This makes them a useful tool when evaluating reward learning algorithms.
One of the main motivations for developing these metrics was to extend the results in the paper Misspecification in Inverse Reinforcement Learning (which I also discussed in this post). In the next post in this sequence, I will show how to use STARC metrics to analyse how sensitive IRL is to misspecification.
If you have any questions, please feel free to ask them in the comments!
The definition of completeness given in the main paper is slightly more complicated, in order to rule out a potential edge-case.