In this post, I will provide a summary of the paper Defining and Characterising Reward Hacking, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the sixth post in the theoretical reward learning sequence, which starts in this post (though this post is self-contained). This was joint work together with Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger.
Background Question and Problem Formalisation
This paper deals with the question of what it takes for two reward functions to be “unhackable”. The setup is as follows; we assume that the preferences of a human are described by some reward function R1, but that an AI system is told to instead optimise a proxy reward R2. We want to know under what conditions the proxy reward R2 is “unhackable” with respect to the true reward function R1. We say that R1 and R2 are unhackable (with respect to each other) if there are no policies π1,π2 such that J1(π1)>J1(π2) but J2(π1)<J2(π2), where J1 and J2 are the policy evaluation functions of R1 and R2. Note that we allow for there to be policies such that J1(π1)=J1(π2) but J2(π1)<J2(π2), etc.
First of all, why have we chosen to formalise unhackability in this way? Note that this condition is equivalent to saying that there is no way to increase the proxy reward while decreasing the true reward. Stated differently, the proxy reward can never explicitly incentivise the AI system to reduce the true reward. It seems reasonable to say that this exactly is the condition under which the proxy is fully unhackable.
Should we expect unhackable proxy rewards to exist? If we consider general functions over arbitrary sets, then there are certainly functions which satisfy this relationship. Consider, for example, the ReLu function and the tanh function. There are no x,y∈R such that ReLu(x)>ReLu(y) but tanh(x)<tanh(y), and so ReLu and tanh satisfy this relationship. Intuitively speaking, we can think of the ReLu function as a “simplified” version of tanh, in the sense that if ReLu(x)<ReLu(y) then tanh(x)<tanh(y), though there are x,y such that tanh(x)<tanh(y) but ReLu(x)=ReLu(y). In other words, while ReLu fails to capture all of the distinctions that tanh cares about, any change which is an improvement according to ReLu is also an improvement according to tanh. Perhaps there are reward functions that satisfy a similar relationship? If so, then that would suggest that we may be able to create “simplified” proxy rewards that are safe to optimise, even if they do not capture all the details of human preferences.
Results
The main result of the paper is that R1 and R2 only can be unhackable if either R1 and R2 induce exactly the same ordering of policies (in which case they are equivalent), or if at least one of R1 and R2 is indifferent between all policies (in which case it is trivial). Specifically, if there are no policies π1,π2 such that J1(π1)>J1(π2) but J2(π1)<J2(π2), or vice versa, then either J1(π1)>J1(π2) if and only if J2(π1)>J2(π2) (meaning that R1 and R2 are equivalent), or J1(π1)=J1(π2) for all π1,π2 (meaning that R1 is trivial), or J2(π1)=J2(π2) for all π1,π2 (meaning that R2 is trivial). This means that there are no non-trivial unhackable reward functions — if we want to ensure that a proxy reward cannot be hacked, then we must (unfortunately) ensure that it captures all details of our true preferences. Note that this result depends on certain geometric properties that specifically hold for MDPs and reward functions, since an analogous result doesn’t hold for arbitrary real-valued functions on arbitrary sets. For the proof, see the main paper.
To make this result a bit more intuitive, we can consider a relaxed version of the same problem. Specifically, suppose that instead of only considering stationary policies, we also consider non-stationary policies (which may condition their actions on their past history, instead of just the current state). Let ΠN be the set of all stationary and non-stationary policies, and suppose R1 and R2 are both unhackable and nontrivial relative to ΠN. Let π⋆ be a policy that maximises both R1 and R2 reward, and let π⊥ be a policy that minimises both R1 and R2 reward. You may take a moment to convince yourself that there must exist two such policies π⋆ and π⊥, given the assumption that R1 and R2 are unhackable. Then the policy πλ that plays π⋆ with probability λ and π⊥ with probability 1−λ is a policy in ΠN. Moreover, by continuity and the intermediate value theorem, for any π there are two α,β∈[0,1] such that J1(π)=J1(πα) and J2(π)=J2(πβ). If R1 and R2 are non-trivial, then α and β are unique. Now, if α≠β, then either J1(π)<J1(πδ) and J2(π)>J2(πδ), or vice versa, for δ=(α+β)/2. If R1 and R2 are unhackable then this cannot happen, so it must be that α=β. This, in turn, implies that J1(π1)=J1(π2) if and only if J2(π1)=J2(π2), and so R1 and R2 are equivalent. This means that no interesting unhackability can occur on the set of all non-stationary policies. The same argument cannot be applied to the set of stationary policies, because πλ is typically not stationary (and mixing stationary policies' action probabilities does not have the same effect). However, with a slightly more complicated argument, it is possible to show that the same result applies to the set of all stationary policies as well.
The paper also contains a few additional results about what happens if you restrict the definition to only range over some restricted sets of policies (such as only deterministic policies, for example). In short, any set of policies that contains a subset which is open (in the ordinary Euclidean topology on policies) will be subject to the result above. This includes policies which are “approximately optimal”, or “approximately deterministic”, or sufficiently close to a given base policy, etc. On the other hand, if we use a finite set of policies (such as the set of all deterministic policies, for example) then there can be reward functions that are unhackable, non-equivalent, and non-trivial. However, the reason for this is essentially that we can introduce a small perturbation of any given reward function R1 to produce another reward function R2 that is almost the same as R1 on a given finite set of policies, and so this result is unlikely to be very helpful in practice.
Conclusion
It is interesting to see that it is impossible to create non-trivial proxy rewards that are unhackable – we can see this result as providing some theoretical justification for the common intuition that two reward functions or utility functions must be very nearly identical in order to not allow for reward hacking or Goodharting. It would have been nice if this wasn’t the case — then we could have started looking for interesting ways to create “simplified” reward functions that are safe to optimise, without encoding everything humans care about. However, also note that this paper relies on a fairly strong notion of “unhackability” — it may be interesting to also consider weaker formalisations, that may still be sufficient in practice.
If you have any questions, then please let me know in the comments!
In the next post in this sequence, I will discuss several additional papers on the theory of reward learning, but without going into as much depth as I did in this (and previous) post(s).
In this post, I will provide a summary of the paper Defining and Characterising Reward Hacking, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the sixth post in the theoretical reward learning sequence, which starts in this post (though this post is self-contained). This was joint work together with Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger.
Background Question and Problem Formalisation
This paper deals with the question of what it takes for two reward functions to be “unhackable”. The setup is as follows; we assume that the preferences of a human are described by some reward function R1, but that an AI system is told to instead optimise a proxy reward R2. We want to know under what conditions the proxy reward R2 is “unhackable” with respect to the true reward function R1. We say that R1 and R2 are unhackable (with respect to each other) if there are no policies π1,π2 such that J1(π1)>J1(π2) but J2(π1)<J2(π2), where J1 and J2 are the policy evaluation functions of R1 and R2. Note that we allow for there to be policies such that J1(π1)=J1(π2) but J2(π1)<J2(π2), etc.
First of all, why have we chosen to formalise unhackability in this way? Note that this condition is equivalent to saying that there is no way to increase the proxy reward while decreasing the true reward. Stated differently, the proxy reward can never explicitly incentivise the AI system to reduce the true reward. It seems reasonable to say that this exactly is the condition under which the proxy is fully unhackable.
Should we expect unhackable proxy rewards to exist? If we consider general functions over arbitrary sets, then there are certainly functions which satisfy this relationship. Consider, for example, the ReLu function and the tanh function. There are no x,y∈R such that ReLu(x)>ReLu(y) but tanh(x)<tanh(y), and so ReLu and tanh satisfy this relationship. Intuitively speaking, we can think of the ReLu function as a “simplified” version of tanh, in the sense that if ReLu(x)<ReLu(y) then tanh(x)<tanh(y), though there are x,y such that tanh(x)<tanh(y) but ReLu(x)=ReLu(y). In other words, while ReLu fails to capture all of the distinctions that tanh cares about, any change which is an improvement according to ReLu is also an improvement according to tanh. Perhaps there are reward functions that satisfy a similar relationship? If so, then that would suggest that we may be able to create “simplified” proxy rewards that are safe to optimise, even if they do not capture all the details of human preferences.
Results
The main result of the paper is that R1 and R2 only can be unhackable if either R1 and R2 induce exactly the same ordering of policies (in which case they are equivalent), or if at least one of R1 and R2 is indifferent between all policies (in which case it is trivial). Specifically, if there are no policies π1,π2 such that J1(π1)>J1(π2) but J2(π1)<J2(π2), or vice versa, then either J1(π1)>J1(π2) if and only if J2(π1)>J2(π2) (meaning that R1 and R2 are equivalent), or J1(π1)=J1(π2) for all π1,π2 (meaning that R1 is trivial), or J2(π1)=J2(π2) for all π1,π2 (meaning that R2 is trivial). This means that there are no non-trivial unhackable reward functions — if we want to ensure that a proxy reward cannot be hacked, then we must (unfortunately) ensure that it captures all details of our true preferences. Note that this result depends on certain geometric properties that specifically hold for MDPs and reward functions, since an analogous result doesn’t hold for arbitrary real-valued functions on arbitrary sets. For the proof, see the main paper.
To make this result a bit more intuitive, we can consider a relaxed version of the same problem. Specifically, suppose that instead of only considering stationary policies, we also consider non-stationary policies (which may condition their actions on their past history, instead of just the current state). Let ΠN be the set of all stationary and non-stationary policies, and suppose R1 and R2 are both unhackable and nontrivial relative to ΠN. Let π⋆ be a policy that maximises both R1 and R2 reward, and let π⊥ be a policy that minimises both R1 and R2 reward. You may take a moment to convince yourself that there must exist two such policies π⋆ and π⊥, given the assumption that R1 and R2 are unhackable. Then the policy πλ that plays π⋆ with probability λ and π⊥ with probability 1−λ is a policy in ΠN. Moreover, by continuity and the intermediate value theorem, for any π there are two α,β∈[0,1] such that J1(π)=J1(πα) and J2(π)=J2(πβ). If R1 and R2 are non-trivial, then α and β are unique. Now, if α≠β, then either J1(π)<J1(πδ) and J2(π)>J2(πδ), or vice versa, for δ=(α+β)/2. If R1 and R2 are unhackable then this cannot happen, so it must be that α=β. This, in turn, implies that J1(π1)=J1(π2) if and only if J2(π1)=J2(π2), and so R1 and R2 are equivalent. This means that no interesting unhackability can occur on the set of all non-stationary policies. The same argument cannot be applied to the set of stationary policies, because πλ is typically not stationary (and mixing stationary policies' action probabilities does not have the same effect). However, with a slightly more complicated argument, it is possible to show that the same result applies to the set of all stationary policies as well.
The paper also contains a few additional results about what happens if you restrict the definition to only range over some restricted sets of policies (such as only deterministic policies, for example). In short, any set of policies that contains a subset which is open (in the ordinary Euclidean topology on policies) will be subject to the result above. This includes policies which are “approximately optimal”, or “approximately deterministic”, or sufficiently close to a given base policy, etc. On the other hand, if we use a finite set of policies (such as the set of all deterministic policies, for example) then there can be reward functions that are unhackable, non-equivalent, and non-trivial. However, the reason for this is essentially that we can introduce a small perturbation of any given reward function R1 to produce another reward function R2 that is almost the same as R1 on a given finite set of policies, and so this result is unlikely to be very helpful in practice.
Conclusion
It is interesting to see that it is impossible to create non-trivial proxy rewards that are unhackable – we can see this result as providing some theoretical justification for the common intuition that two reward functions or utility functions must be very nearly identical in order to not allow for reward hacking or Goodharting. It would have been nice if this wasn’t the case — then we could have started looking for interesting ways to create “simplified” reward functions that are safe to optimise, without encoding everything humans care about. However, also note that this paper relies on a fairly strong notion of “unhackability” — it may be interesting to also consider weaker formalisations, that may still be sufficient in practice.
If you have any questions, then please let me know in the comments!
In the next post in this sequence, I will discuss several additional papers on the theory of reward learning, but without going into as much depth as I did in this (and previous) post(s).