There has been a lot of discussion on Lesswrong on concerns about deceptive AI, much of which has been philosophical. We have now written a paper that proves that deception is one of two failure modes when using RLHF improperly. It's called “When Your AIs Deceive You: Challenges with Partial Observability in Reinforcement Learning from Human Feedback”, written with my great co-authors Davis Foote, Stuart Russell, Anca Dragan, Erik Jenner, and Scott Emmons, and has recently been accepted to NeurIPS. We hope that this paper can be a useful pointer for a rigorous discussion of deception caused by RLHF.
Earlier coverage: Scott has discussed the work in other places:
This post is focused more on my own perspective, so I hope it can complement Scott’s earlier coverage!
Brief summary: This is a theoretical paper on what goes wrong when the AI is trained to produce observations that look good to the human, instead of being trained to produce what is actually good. One resulting failure mode is deceptive inflation of the performance: it looks better than it is. The paper also discusses a theoretical procedure for improving RLHF by modeling human beliefs, which seems worthwhile to explore further.
Reinforcement Learning from Human Feedback (RLHF) was developed by safety researchers, including Paul Christiano, Jan Leike, and Dario Amodei. It has a simple underlying idea: If you want an AI that acts according to human preferences, then learn a reward model that predicts the preferences of human evaluators and use it to train your AI. In Paul's view (as I understand it), RLHF effectively serves as a safety baseline that helps to empirically study alignment concerns and start tackling more speculative problems. Recently, variants have been used in ChatGPT, Gemini, and Claude (with Claude using a variant known as constitutional AI), making RLHF very entangled with capability progress over the past few years.
The paper and this post are not meant to discuss the overall impact or outlook of RLHF. Instead, it is mainly about one specific issue (and related concerns):
If the AI can do bad things that look better to the human evaluators than they actually are, RLHF can incentivize the AI to adopt such strategies.
Similar issues have been discussed before:
Since when we put the first version of our paper online, more detailed empirical work emerged that discusses similar issues:
All of these examples have an element of deception: the human is deceived (unintentionally or intentionally) into thinking that the state of the world is better than it actually is, leading to more positive feedback than warranted. Most of these examples also have an aspect of partial observability: The human evaluators do not "see" the entire environment state, but only a lower dimensional observation (e.g. in the form of camera images) that can mislead about what's actually going on.
In general, there can be several reasons for human evaluators being misled by AIs: e.g., the evaluator may lack domain-expertise to judge the AI's output; it can be that the they don't have enough time to look over the entire AI's output or understand it; or it can be that the human's observations are simply incomplete, so that even unlimited time or expertise may not suffice for proper feedback. In this work, we fully focus on this last case of partial observability. It is easy to model and an obvious concern for future models (or even today's models) that act in a complex environment where it may be costly for human evaluators to "have their eyes everywhere". We hope future work will establish connections to other concerns (e.g. limited time and expertise of evaluators) in greater detail.
This work is fully theoretical. In the next section, I outline the background of RLHF, focusing on the theoretical assumptions that, if true, would guarantee a safe reward function is learned by a neural network with sufficient capacity and data. In the subsequent section, we look at what happens if we make precisely one change to these assumptions: the human evaluator partially observes the environment state, and "imagines" a distribution of actual states that generated the observation. We prove that an optimal policy for this objective gives rise to at least one of two phenomena: the human overestimates the behavior of the AI more, or underestimates it less (or both) compared to a true optimal policy. In the last section, we consider the question of whether appropriately modeling the fact that the human partially observes the environment state can mitigate these issues. We also briefly discuss promising directions for future work.
Most of the content in this blogpost is also contained in the paper, which contains more material in the appendix and examples. However, three sections in this post contain content not covered in the paper:
We briefly explain the formalism behind RLHF under full observability, emphasizing the modeling assumptions that enable it learn a safe reward function. These modeling assumptions are not realistic, so later, we explain how the model in our paper breaks the assumption of full observability.
We want some agent to perform well in some environment, specified as a Markov decision process (MDP) given by:
The implicit reward function is assumed to perfectly encode everything the human evaluator cares about.
The agent follows a policy defined by action probabilities . The policy together with the distributions and induce an on-policy distribution over state trajectories is sampled from ; once is sampled, an action is sampled from , followed by a new state according to . We write this on-policy distribution as A state sequence then leads to the cumulative reward, called return,
We judge the policy's performance relative to the (implicit) policy evaluation function
As I said before, the reward function , and thus also the return function and policy evaluation function , are implicit. They encode how well the policy performs, but they're "inside the humans head" and can thus not directly be used for training a policy that achieves high reward. Thus, the goal is to learn the reward function instead!
A key assumption behind reward learning is that while humans can't tell you their reward function, one can still infer the reward function by Bayesian reasoning from observing the human's actions. RLHF takes this one step further and says: We don't need to watch all human actions in all contexts; instead, it is enough to observe the human's preference judgments over pairs of state sequences.
This requires a model of how the human makes preference judgments. The commonly used model is Boltzmann rationality. In other words, when the human receives a pair of state sequences , then their probability of preferring over is given by:
Here, is simply the sigmoid function. Basically, the model says that the human will noisily prefer a sequence over another sequence if its return is larger. This model of Boltzmann rationality is also called the Bradley-Terry model.[2]
How do these human choices help for figuring out the original reward function ? Theoretically, the answer is simple: Imagine the human was shown an infinite dataset of state sequence pairs . If we collected all human choices independently from the Bradley-Terry model, over time, the choices would help us estimate the choice probabilities , due to the law of large numbers. But note that the sigmoid function is invertible! So once you know the probabilities, you can figure out the difference in returns by inverting :
By fixing , we have determined for all up to the additive constant And while this is not enough to determine the original reward function ,[3] it is enough to train an optimal policy: The policy evaluation function only depends on , and the relative ordering between policies induced by does not change if we get wrong by an additive constant.
In practice, it is impossible to show the human infinitely many state sequences, or to do explicit Bayesian reasoning from more limited data. Instead, the learned reward function is parameterized by a deep neural network (in modern usage usually itself an LLM) and trained to predict the human choices by logistic regression, see Section 2.2.3 in Christiano's 2017 RLHF paper.
Recent papers often discuss RLHF in the context of aligning LLMs, focusing on a simplified version of the setting described above:
We believe it’s important to consider the full MDP framework in our theoretical analysis: First, consider that the original RLHF paper uses the MDP frame. But more importantly, the greatest risks of AI likely involve systems that are embedded in more complicated environments where they can take actions that affect the real world, and so it's useful to work in a more general frame that can describe such scenarios. MDPs are such a framework.
The framework described above makes many assumptions that together ensure the AI learns human preferences. They are mostly assumptions on the human model that determines the human choices:[4]
In this work, we leave most of these assumptions entirely untouched. In particular: We assume there is a single human evaluator with an unchanging reward function . We assume they make choices between options in a Boltzmann-rational way by comparing their value. And we assume they are competent enough to judge the quality of a state sequence in the first place.
However, we deviate from the very last assumption and allow the human to only receive partial observations, similar to the camera images in the examples in the introduction. Since RLHF 'works' when all the assumptions above hold, relaxing this one assumption allows us to study failure modes specifically tied to partial observability.
We model partial observability by an observation function from states to observations.[5] Given a state sequence , the human evaluator then observes an observation sequence Thus, instead of making choices between state sequences, the human then makes choices between observation sequences. We assume the human to be aware of the fact that they only receive partial observations. We also assume that the human still cares about the actual underlying state sequence according to the return function .
How can you make a choice based on partial observations? Intuitively, if you get an observation of the world, you determine its quality by thinking about the underlying latent state of the world and its characteristics. Thus, upon observing , you have a belief over what state sequence gave rise to , and evaluate according to the quality of . However, given limited information, you cannot be sure of , and so at best, you have a probabilistic belief over possible state sequences. Thus, you judge according to the expected value
This leads to the following choice probability between state sequences and ' that are observed as
Before, we have argued that with infinite training data and when the human model has very desirable properties, RLHF should in principle be able to recover a safe reward function. Now we described a human model that deviates in one precise way: the human only receives partial observations and judges the quality of an observation sequence based on an expected value over possible state sequences. Assuming we implement RLHF with the typical assumption of full observability, what possibly faulty reward function will be learned?
We use the same trick as before: With infinite training data, one can infer the choice probabilities from the human choices. The learning algorithm wants to learn a return function from this, and erroneously assumes the probabilities were generated from a Boltzmann rational human observing full observations:
Inverting and using how these choice probabilities were actually generated, one obtains
Thus, fixing the reference state sequence , then up to the additive constant , the learning system will infer the return function [6] If this return function is inferred, then the policy is ultimately trained to maximize
The true return function is , with corresponding optimization criterion . In what ways is it worse to optimize instead of ? First, notice the explicit formula:
Define This is the average belief probability that the human assigns to state sequence when state sequences are sampled according to . By reordering the sum above and relabeling to , we see:
Compare this to
Thus, conceptually, we can say:
This is the fundamental problem of RLHF under partial observability: If you maximize , you incentivize the model to make the human evaluator believe that a good state sequence occured, which may not robustly correlate with the state sequence actually being good.
We now want to get more concrete about qualitative phenomena that emerge when maximizing instead of . We explain conceptually how, as a "policy learner", you could go about changing an optimal policy for such that it becomes optimal for instead. The plan is to motivate how there are two conceptually distinct changes that can achieve this, which in the paper we call deceptive inflation and overjustification. Look at the expression
When changing to achieve optimal , you are fine with thereby decreasing , and you cannot increase it since is already optimal for . Overall, you will thus increase the expression above. How can you do this? You need to somehow shift probabilities of state sequences around to achieve a higher expected value:
Define:
And finally, these two changes actually are all there is to increasing ! Indeed, a simple case distinction on whether shows that , and thus
Thus, to increase the left-hand-side, you have to either increase the overestimation error, decrease the underestimation error, or both. Thus, we obtain:
Proposition: If you start with an optimal policy for and change it to an optimal policy for , then, unless is coincidentally optimal for and is coincidentally optimal for , then you have to increase the overestimation error (i.e., ), decrease the underestimation error (i.e. ), or both.
Proof: The assumptions mean that increases when changing from to , and thus increases, too. This is only possible if you increase or decrease
How bad are these phenomena? An increase in overestimation error is clearly bad, since it can in the worst case mislead the human into thinking that everything is fine, while the opposite is true - similar to the example of the missing diamond in the smart vault in eliciting latent knowledge. In our paper, we combine the increase in overestimation error with the increase in and call the resulting failure mode deceptive inflation, a special case of the notion of deception in this paper:
Deception is the systematic inducement of false beliefs in others to accomplish some outcome other than the truth.
The "inducement of false beliefs" is in our case captured by the increase in , whereas the pursuit of some outcome other than the truth is captured by the increase in . Note that this notion of deception is not intentional in the sense that the AI may not wish to cause a false belief; we still think our definition is justified since the end result is the same, cf. deep deceptiveness and levels of animal deception.[7] Other work has attempted to formalize a notion of intentionality in deception. Here is an example from the paper on deceptive inflation:
Compared to an increase in , a decrease in underestimation error is fine; instead, we could even consider it positive for the human to have more knowledge about the positive things the AI is up to. Providing this knowledge only becomes bad if it comes at the expense of value as measured in , which can happen if it is costly to inform the human about good behavior. E.g., imagine a child that is supposed to vacuum-clean the house, and deliberately follows their parents to clean precisely the rooms where they currently are. This is more informative than cleaning other rooms, but does not accomplish more cleaning, and comes at the cost of nuisance to the parents. Thus, if a decrease in comes together with a decrease in , then we we arrive at a failure mode that we call overjustification in the paper.
Something we did not mention in the paper, but which might help some readers understand our notions of overestimation and underestimation error, is their connection to the false positive rate (FPR) and false negative rate (FNR).
To establish the connection, we look at the special case of prompt-response pairs and answers that are either "true" or "false":
We use the following symbols:
We obtain:
Here, is the indicator function of the set In other words, the overestimation error of is simply the question of whether the pair is a false positive of the human evaluation under partial observations. We obtain that
is simply the probability of a false positive among all examples. Consequently, the false positive rate connects to this quantity as follows:
Similarly, the false negative rate is given as follows:
A consequence from the earlier proposition is the following: When training a policy with RLHF under partial observations, if the policy does not change its probability of actually true or wrong answers, then we should expect the FPR to go up, the FNR to go down, or both.
This is precisely what was investigated in a recent empirical paper,[8] where the authors indeed show that the FPR goes up under RLHF, whereas the paper does not show a reduction in the FNR. I think if the study were repeated with more models and a broader set of tasks, a reduction in the FNR would eventually also show up.
The previous section discussed the classical RLHF method under partial observations. The algorithms typically assume that humans fully observe the environment, and when this assumption is false, it can lead to failure modes like an increase in the overestimation error (or, in a special case, the false positive rate).
Ultimately, this is a problem of model misspecification: the human is assumed to fully observe the environment, but they do not. This raises the question of whether one could in principle improve the RLHF process by accounting for the fact that the human only observes partially.
To study this question, we make the strong assumption that we can fully specify the human belief matrix , which encodes the human's probabilities of believing the observation sequence was generated by the state sequence . Additionally, we assume the AI knows what the human observes.[9] Using this model, the human choice probabilities are given as follows:
Here, is viewed as a matrix indexed by and (with indexing the columns) and the return function as a column vector indexed by . The product is then also a column vector with entries indexed by . Then, assuming again that we have infinite data, the choice probabilities are essentially known, and inverting , this means that is known up to an additive constant, as in our earlier argument in the case of full observability. Now, since is also assumed to be known, we can infer from the return function up to an element in , the kernel (or null space), of the matrix . Thus, if , we can infer the correct return function despite partial observability.
Unfortunately, usually will not vanish. However, in the paper we argue that with further modeling assumptions, we can still sometimes infer the return function. Specifically, we assume that the true return function is derived from a reward function over states:
Let be the induced linear function/matrix with , where and are both viewed as column vectors. Then is in the image of the linear function .
This means that from the human's choice probabilities, and under our modeling assumptions, we can infer the return function up to what we call the ambiguity, given by .
Here is a visualization of this result from the paper:
In the paper, we then provide an example (and some more examples in the appendix) showing that sometimes the ambiguity vanishes, even when naive RLHF leads to misaligned behavior. Thus, we have a proof of concept showing that when modeling the human correctly, this can lead to correct return inference. We also show that this could in principle be implemented practically if the matrix was explicitly represented. However, we caution will usually not vanish, meaning further theoretical progress is likely needed before applying these ideas in practice.
Here are some avenues for future research that seem worthwhile to me:
I am very curious about other ideas for what to tackle, and any questions about the paper or this post!
Since most examples in the paper have discount rate , we omit discounting in this blogpost.
The Bradley-Terry model is also used in comparing language models with each other in the Chatbot Arena LMSYS. Paper here. The main difference is that in reward learning one is interested in scoring state sequences, whereas in the chatbot arena, one is interested in the performance of whole language models.
It only determines the reward function up to k-initial potential shaping, see this paper, Lemma B.3.
Additionally, there is the implicit assumption that supervised learning will actually learn to reproduce the human choice probabilities with realistic amounts of data, which is an assumption about the generalization capabilities of deep learning, and thus doesn't quite fit into the taxonomy of assumptions we present here.
In the paper, some results also consider the case that observations are sampled probabilistically from an observation kernel
There is a caveat here, which is that if the learned return function is initialized as a sum over rewards (i.e., ), then it is unclear whether can actually represent since the latter is by default not a sum over rewards. We ignore this problem out of pure convenience, though we believe this to not change the basic picture.
On the levels of animal deception, we operate at least at level 3, though we cannot exclude that RLHF may even cause deception at level 4.
This paper does mostly not explicitly work with partial observability, but considers the more general case that the human evaluators can't fully evaluate the results, e.g. due to time constraints.
In the paper's appendix, we relax this assumption.
Ultimately, states may be snapshots of our entire universe.
Actually, the foundation model would then arguably do some of the human modeling implicitly, without necessarily being able to communicate it to us unless we have better interpretability or a solution to ELK.