Other Papers About the Theory of Reward Learning

Joar Skalse

This is the seventh post in the theoretical reward learning sequence, which starts in this post. Here, I will provide shorter summaries of a few additional papers on the theory of reward learning, but without going into as much depth as I did in the previous posts (but if there is sufficient demand, I might extend these summaries into their own posts).

On the Limitations of Markovian Rewards to Express Multi-Objective, Risk-Sensitive, and Modal Tasks

The original motivation for this paper came from a desire to spite a guy I met at IJCAI in 2022. He claimed to me that there is a mathematical proof that all reasonable preferences can be expressed using reward functions. I imagine he was probably thinking of the vNM utility theorem (which does not actually apply in the RL setting). Whatever the reason, this paper proves him wrong.

In this paper, I identify three reasonable classes of objectives, that we might want to express to an RL agent, and which provably cannot be represented using ordinary reward functions. The first of these is the preference structures given by multi-objective RL, where the agent is given multiple reward functions and has to find a policy that achieves a good trade-off of those rewards according to some specified criterion. For example, the agent may have to maximise the rewards lexicographically, or it may max-min them, etc. I prove that most such objectives cannot be captured using just a single reward function. The second class of tasks is risk-averse RL, where the agent wishes to be risk-averse relative to a given base reward function. In the ordinary utility-maximisation setting, this can be achieved by creating a second auxiliary utility function that is concave in the original reward function. I prove that the same cannot be done in the RL setting. The third and final class of tasks I look at in this paper is a new category of objectives that I refer to as “modal objectives”, where the agent is given an instruction expressed not just in terms of what does happen along a given trajectory, but also in terms of what could happen. For example, an instruction such as “you should always be able to reach state X” would be an example of a “modal” objective. I show that these objectives likewise usually cannot be adequately represented using ordinary reward functions. As such, each of these three classes of tasks provably constitute a genuine extension over the standard RL setting. For details, see the main paper.

To me, the main takeaway from this paper is that we should be careful with the assumption that the basic RL setting really captures everything that we intuitively consider to be part of the problem domain of sequential decision-making. In particular, it is not a given that a reasonable incentive structure necessarily can be captured by a reward function. The results in this paper also make me more interested in multi-objective RL in particular, since it seems quite rich and flexible in terms of what it may enable us to do.

On The Expressivity of Objective-Specification Formalisms in Reinforcement Learning

This paper is also about comparing the expressivity of different methods for expressing tasks to RL agents — we consider a large number of specification languages, and show which of these languages are strictly less expressive than other languages (and which are incomparable). This paper is also discussed in this post (Paper 4). For details, see the full paper.

Goodhart's Law in Reinforcement Learning

As you probably know, "Goodhart's Law" is an informal principle which says that "if a proxy is used as a target, it will cease to be a good proxy". In other words, if we care about some quantity Y, and we have a variable X that is correlated with Y, and then optimise for X, then X will cease to be correlated with Y (and we will find ways to increase X without increasing Y). While this is an informal principle, it does empirically seem to hold quite robustly in practice (see e.g. this paper). Moreover, this dynamic is often at the core of many stories of how we could get catastrophic risks from AI systems. Having a better theoretical understanding of this dynamic therefore seems quite important.

In this paper, we try to provide a theoretical explanation for Goodhart dynamics in RL. In particular, we wanted to find a way to theoretically predict the shape of the curves seen in this paper. Moreover, I think we found a satisfactory account of why this happens! We also managed to leverage these results to produce a new method for conservative optimisation, that tells you how much (and in what way) you can optimise a proxy reward, based on the quality of that proxy (as measured by a STARC metric), in order to be guaranteed that the “true” reward doesn’t decrease (and thereby prevent the Goodhart drop).

This paper is discussed in more detail in this post. Alternatively, see the main paper.

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

In this paper, we look at what happens when a learnt reward function is optimised. In particular, when a reward model is optimised, then ordinary statistical learning theory tells us that this model eventually will have a good generalisation error relative to the training distribution. However, does this guarantee that we get a low regret relative to the underlying true reward function when that reward model is optimised? In the paper, we show that the answer to this question is no. That should not be too surprising – it is well-known (and obvious) that a given model may have a low generalisation error relative to a particular distribution of inputs, but have a large error relative to another distribution of inputs (see e.g. 1, 2, 3, 4). Moreover, when a reward model is optimised, this effectively incurs a distributional shift. This means that the reward model can give an inaccurate (and too high) evaluation of the learnt policy, meaning that you can end up with a policy that would get a low reward according to the original ground-truth reward function (even though the reward model is accurate on the training distribution). We also provide a lot of additional results about this problem setting, including:

We have results concerning what the training distribution must look like in order to ensure that the learnt policy has low regret, for a fixed generalisation error $ϵ$ .
We also look at various common forms of regularisation (such as KL-regularisation, etc), and look at how this affects the dynamic discussed above.

This paper will also be discussed in more depth in an upcoming post.

Partial Identifiability and Misspecification in Inverse Reinforcement Learning

This is a long paper that combines the results from Invariance in Policy Optimisation and Partial Identifiability in Reward Learning, Misspecification in Inverse Reinforcement Learning, Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, and STARC: A General Framework For Quantifying Differences Between Reward Functions (which are also discussed in the first part of this sequence). If you plan on reading all of these papers, it may be a good idea to instead just read this paper — it combines all the results in one place, with a unified framing and terminology, has streamlined some things compared to the earlier papers, and also contains a small handful of results that cannot be found in any earlier papers.

Partial Identifiability in Inverse Reinforcement Learning for Agents With Non-Exponential Discounting

In this paper, I look at the partial identifiability of the reward function in IRL for agents that don’t discount exponentially. In other words, it extends the results in this paper to agents that use non-exponential discounting. This is particularly interesting, since humans are better described as discounting hyperbolically rather than exponentially. This means that it essentially makes the analysis in this paper more realistic, and more closely applicable to humans. To do this analysis, we must first generalise a lot of the ordinary RL machinery to work for agents that discount hyperbolically. This is more difficult than it sounds, because non-exponential discounting can give rise to preferences that are inconsistent over time. This is all explained in the paper itself (and, in my opinion, this discussion is also somewhat interesting independently of the results for IRL). The high-level takeaway of this paper is that the reward function is just as ambiguous in this setting as it is in the setting with exponential discounting (though the ambiguity has a different form). Moreover, as before, we get that a misspecified discount function can lead to catastrophic errors in the learnt reward function (even if this misspecification is very small).

If you have any questions about these papers, or would like to see any of them described in more detail, then please let me know in the comments!

In the next post in this sequence, which is also the final, I will discuss how to contribute to the theoretical reward learning research agenda. Specifically, I will list a few concrete open problems, and try to provide some helpful information for tackling those (and similar) problems.

AI ALIGNMENT FORUM
AF

9