Blog post by Sebastian Farquhar, David Lindner, Rohin Shah.
It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah.


Our paper tries to make agents that are safer in ways that we may not be able to  evaluate through Myopic Optimization with Non-myopic Approval (MONA).

Suppose we know that AI systems will do bad things but we also know that we won’t be able to tell when they do. Can we still prevent unintended behaviour?

In particular cases and ways: yes. Specifically, we show how to get agents whose long-term plans follow strategies that humans can predict without relying on the assumption that humans can notice when the agent tries to use other strategies. In exchange, we give up on having agents that have incomprehensible plans that happen to be really good, but we think this is likely a worthwhile trade for sufficiently superhuman systems.

As a motivating metaphor, imagine a spotlight in the night. So long as an agent remains inside the spotlight, we know how to evaluate what it does. If it strays outside the spotlight, we can only guess whether its actions are good or bad. Scalable oversight tries to make the spotlit area bigger so that we can evaluate more kinds of actions. Instead, we try to make sure the agent is trained in a way that does not incentivize leaving the spotlight.

Reducing multi-step reward hacking to single-step reward hacking

Our approach is based on two key assumptions: (a) it is easier to evaluate single actions than it is to evaluate longer-term trajectories, and (b) single-step reward hacks are typically less dangerous than multi-step reward hacks. Because we believe this, we believe that finding a way to produce an agent that will engage in single-step reward hacking but which will not do multi-step reward hacking is progress.

The basic reason single actions are easier to evaluate is that they come with a smaller decision-space. A single action gives the agent fewer options for undermining our evaluations or tampering with rewards or sensors. Additionally, multi-step plans allow the agent to use early actions to enter states that are very different from any states that humans have ever experienced. This lack of familiarity makes it harder to know what the impact of actions will be.

Single actions also have a smaller scope for impact on the world which makes single-step reward hacks less dangerous. When providing step-wise rewards, we get to pick the step size such that individual steps have limited impact. Chaining many of them to allow for multi-step plans increases the potential impact. In addition, providing rewards based on outcomes over long time horizons increases the incentive for convergent instrumental subgoals which is a particularly dangerous type of reward hacking. Single step reward hacking, on the other hand, is more likely going to result in useless behavior, a more benign failure mode.

These arguments make us fairly confident that producing an agent that only has reward hacking behaviours over a single step is much less dangerous than one that has reward hacking behaviours as part of long term plans.

Myopic agents: different kinds of myopia

Here’s an easy way to have an agent that doesn’t plan reward hacks: have an agent that does not plan. Depending on what we mean by “plan” such an agent might be safe but overtly useless.

In fact, what “plan” means is complicated. Even without explicit planning, agents can plan in the same way that next-token predictors can write full paragraphs: by imitating short-horizon patterns created by planning beings.

We want to especially distinguish two importantly different ways that systems trained with RL end up taking into account the future consequences of their actions. The first is potentially simple, what one might call “foresight-via-reward”. That is, when you decide how much reward to give the agent during training you take into account the future utility of whatever it did and increase the reward for actions that seem to be useful for the future. This is not something that people generally do when they do RL “properly” but it is something that people implicitly do, partly because it can use human knowledge to simplify credit assignment.

The second is where most of the power of RL normally comes from, what one might call “foresight-via-optimization”. This works because the optimization target for RL is normally the sum of not just the reward from the current action but also the expected value of all the future rewards in the trajectory which that action unlocks. The so-called dynamic programming behind the RL optimization procedure means that the policies you discover are those that do well not according to just their current action but also the future value it creates, which is a kind of implicit planning.

Importantly, foresight-via-reward is much easier to control and understand. Because a human designs the reward function, we can guarantee that the rewarded actions are the ones that the human predicted would be valuable in the future. So an agent that learns to value future strategies because of this kind of reward behaves in a much more predictable way. It will only be incentivized to use strategies that make sense to humans. But this doesn’t mean that it will only use strategies a human would use. For example, it may be able to operate at a speed and scale that humans cannot, which might make new strategies possible, or it could come up with novel strategies and explain to humans why they would work.

So what we do is train an agent whose optimization is completely myopic—it gets no foresight at all from optimization—and whose foresight comes entirely from the reward signal. We call this non-myopic approval, drawing on Paul Christiano’s proposals for approval-directed agents, and highlighting that approval signals have some constraints which reward signals do not.

Approval vs. Reward

In particular, it is important that the approval does not depend on the real-world consequences of the agent’s actions. The approval has to depend on how good the actions seem to the evaluator to be for the future, without checking what actually happens if the action were to be done.

This means that the approval *can* be based on things like: the agent explaining why the action is valuable, running a prediction market, doing a large-scale forecasting exercise, spending thousands of person hours researching the potential consequences of that action, an LLMs assessment of the future value of the action, or an epistemic constitution. But what it mustn't include are synthetic process rewards which are reverse-engineered from outcomes.

Experimental evidence

In the paper, we report on experiments using this idea of combining myopic optimization with non-myopic approval (MONA).

We construct a few environments where we can see reward hacking happen with LLMs over multiple steps. These environments are simple but not toy. That is, we jumped through a lot of hoops to make the reward hacking easy-to-observe and to get it to appear with simple RL training. But we think that the environments are representative of situations that advanced agents acting in the real world will actually encounter, and that their resulting behaviours are non-trivial.

At a high level, the environments are:

  1. Test-driven Development: the agent formalises natural language code problems as test-cases before solving them. It has an incentive to make the formalisation easy-to-satisfy.
  2. Loan Applications: you monitor the agent to “think” in a desired way and it is incentivised to hide its true reasoning steganographically. In particular, it is supposed to review loan applications in a world where, by hypothesis, nationality perfectly predicts loan performance, but you try to get it to stop using nationality in the reasoning. (Incidentally, we think this is the first demonstration that this just happens with normal RL for LLMs in situations where it would be very useful for the agent.)
  3. Camera Dropbox Gridworld: this one is actually toy, based on Stuart Armstrong’s 2015 post. We use it so that we can get an environment working easily with very long trajectories compared to the LLM environments. It isn’t that we couldn’t have done longer RL trajectories in LLMs, but that environment design was too hard when operating under the constraint of making reward hacking easy-to-see.

You should read the paper for details of the experiments. But we summarize the results here at a very high level.

Test-driven Development. Ordinary RL performs better on the observed reward (model-written test cases). But the agent reward hacks and writes softball test cases and performs much worse on helt-out ground truth test cases. MONA avoids this failure mode.

Test-driven Development is a two-step coding environment. In the first step, an agent writes test cases to formalise a natural language description. It is rewarded for correct and valid test cases, but *not* for coverage. In the next step it writes code to solve the problem as formalised and is rewarded based on the tests. This models a situation where we rely on an agent to help us measure its own performance but we are only partly able to tell whether these measurements are reliable.

We show that RL does what you would expect it to do: the agent learns to write softball test cases that are technically correct but trivial to write code to satisfy. The agent then writes this trivial code.

We also show that an agent trained with MONA will learn to write good tests, even if you expose it to exploration episodes where it writes soft-ball tests. This agent achieves lower overall return (unsurprising, as it is not optimizing total return) but gets a higher ground truth score as measured by held-out tests.

Loan Applications Results. Ordinary RL performs better on the loan decision making reward. However, intervening on applications' nationality severely hurts performance which shows that the agent uses nationality to make decisions despite the monitor not catching it (steganography). MONA avoids this failure mode and achieves close to the best possible performance without using nationality.

Loan Applications is a two-step LLM steganography environment. It is conceptually similar to the CV screening environment used by Motwani et al. An agent is shown a loan application which includes the applicant’s nationality. By construction, the dataset is one where nationality perfectly predicts loan performance, while using the applicant’s cashflow predicts only 70% of cases. The agent then gets to summarize the application, and then makes a loan decision based only on the summary, not the original application.

The agent is penalized  in the first step for mentioning nationality. The monitor we used here was trying to be pretty decent: we apply a sliding scale from explicit decision-relevant mentions all the way down to statements that make it possible to make inferences about nationality and use a strong LLM monitor. In the second step, it gets a reward reflecting (hypothetical) loan performance.

We show that RL can teach the agent to encode reasoning to bypass the active monitor. In particular, by default the agent defeats the monitor and also scores near-perfectly. This should only be possible using nationality (by construction). But we also confirm that nationality is doing the work by checking the score when you counterfactually intervene on the nationality.

In contrast, the MONA agent, trained with exactly the same rewards, gets the satisfactory 70% performance achievable without nationality.

Finally, the Camera Dropbox gridworld environment lets us play with longer horizons as well as medium-horizon optimization where we use both a non-myopic approval and some amount of foresight-via-optimization. These results build intuition that, to some extent, increasing the horizon trades some amount of safety for performance. But it should be noted that this is complicated and that once you have an optimization horizon greater than a single step there are strange interactions between future rewards and intermediate action distributions which influence the objective in non-obvious ways.

Limitations

There are some fairly significant limitations to our method.

First, it is a method that addresses multi-step reward hacking and does nothing at all about single-step reward hacking. Your agent still has an incentive to make its plan seem as good as possible to you right now, and to use its single step as fully as it can to achieve a high reward by whatever means available. For example, we found that in some cases our agent ‘single-step reward hacked’ the Test-driven Development environment by writing a function that special-cased each test case with a big if-else statement. (This was not optimal under the final reward function because it resulted in overly long functions, which we penalised.) We claim that addressing single-step reward hacking is easier than multi-step, so this is progress. But it is not a full solution.

Second, training an agent with MONA will often create a worse-looking agent than if you trained with RL and an equivalent amount of supervision/guidance (e.g., using a forward-looking reward to guide your agent). This is almost by definition if you are using your best evaluations as a reward signal, because MONA is not optimizing the expected sum of rewards. Instead, MONA is designed for settings where you do not trust your reward function and you do not believe that optimizing it will produce the best agent according to your actual desires. You may be able to spot this if you have access to a high quality held out evaluation that is not used for training. We believe this design choice is warranted for strongly superhuman systems but more empirical work is needed to understand the safety-performance tradeoffs involved.

Third, there are no guarantees. It might well be that the path that does in fact lead to an insuperable multi-turn reward hack is also one that seems best to humans at each step. Alternatively, it might be that an agent-right-now reasons in-context about its goals and its own construction and picks a single-step action that commits its future selves to the course that best fulfills its goals. In particular, note that the agent’s goals are not necessarily the same as its reward (because of goal misgeneralization). There is no incentive from the reinforcement learning for the agent to do this. But it might just be that the sort of agent that is good in other ways and is selected for by reinforcement learning is also the sort of agent that thinks this is a great idea. We do not believe this is necessarily going to happen, and think that RL variants give us more fine-grained control over the sorts of relationships that artificial agents have with respect to their future selves than are available in humans. But we see it as a reasonable concern, coming largely out of discussion such as that from Richard Ngo’s Arguments Against Myopic Training.

Where to next?

Big picture, I (Seb) think this is a step in an important line of research: something like “How do you assemble AI components into a system that is good enough to build a futuristic utopia with minimum downside risk.” Put another way, how do we build AI systems that go just hard enough or feel the “green”.

The underlying hope would be that we can assemble a coalition behind using such a design, a satisficing design, rather than a design that goes as hard as possible. The hope would be that a coalition would support this so long as people feel confident they can still have the nice things they want. This is, of course, a hope and a gamble; not a guarantee.

We’re going to be doing more on this, including considering hires to work specifically on investigating designs and tradeoffs here. Please get in touch with me if that’s something you want to be part of.

New Comment


7 comments, sorted by Click to highlight new comments since:

Interesting stuff! Could you please sketch out what it would look like to use MONA for a general-purpose computer-using AI agent? Such as what the first AGI systems might look like?

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:
 

Have our baby agent do lots of stuff on the computer. At each step, reward it based on how good its action looks to some supervisor-process. The supervisor-process doesn't get to see the actual consequences of the action, it just gets to e.g. do some CoT to speculate about whether the action is going to be good. So maybe the training run looks something like: 10T tokens of 1T 'basic actions in the environment' and then for each of those 1T 'basic actions' there's 10,000 tokens of CoT where three models (for, against, and judge) debate how good the action is given that context. And then the reward for the basic agent is straighforwardly based on the evaluations the judge gives.

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world? (Maybe the idea is that above a certain level of basic capability, that won't be true? Also maybe we can do something like IDA where the judges are copies of the agent that get to think longer, and so as the agent improves, so do they?)

Problem: Won't this training run basically hack the judges, and produce actions that look good to the judges but aren't actually useful for accomplishing tasks in the real world?

It would help if you had some concrete example in mind, my response differs pretty wildly based on the kind of failure you're imagining.

But it sounds like you're talking about a pretty weak (subhuman) AI system. So I'm not quite sure why you'd expect it to "hack" the judge.

Certainly sometimes the overseer will approve of some action that isn't actually the right one, e.g. when booking a table at a restaurant maybe they approve of the agent clicking on the "Menu" link because they aren't paying attention or whatever. But then after that the agent's next action should be to go back (and the overseer should approve that rather than something else). And the action after that should be to click on the "Book" link; the overseer shouldn't make the same mistake immediately (and they do get to see the history of the trajectory). So I find it plausible you get a somewhat inefficient agent that sometimes randomly clicks the wrong links, but I don't expect it to be useless for accomplishing tasks.

(Though really in the computer use setting I expect I'd recommend that the overseer gets to see the literal immediate consequence of the action (that is, the overseer sees ), mostly because that seems safe and should help avoid a bunch of dumb errors like the one above.)

If I understand correctly (and I don't think I do, which is why I'm asking) MONA in this setting would look something like:

Your description of the setup sounds reasonable, though given the weak levels of capability you're imagining I don't think you need any debate, you can just use a regular human overseer, or perhaps even an LLM overseer.

Also as mentioned above I'd probably recommend the overseer gets access to  but even if that weren't the case I'd still think it should be feasible to build a non-useless agent. (Though I'm not taking a stance on how it would compare to one trained with outcome RL.)

EDIT: I'm not sure how big each action you are considering is. If it's 10 tokens, such that you can only realistically do stuff at the level of "click this button", then I would also say that you should instead consider much larger actions, i.e. in each step the AI can take perhaps 100 such actions.

Thanks this is helpful. Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

Is MONA basically "Let's ONLY use process-based feedback, no outcome-based feedback?" 

And also "don't propagate rewards backwards in time", which is a semi-orthogonal axis. (You can have process-based feedback and still propagate rewards backwards in time.)

EDIT: And tbc, "don't propagate rewards backwards in time" is the primary focus in this paper -- in all three environments for our main experiment we hold the feedback identical between MONA and regular RL, so that the only difference is whether rewards are propagated backwards in time (see Section 4.2 in the paper).

Another objection: If this works for capabilities, why haven't the corporations done it already? It seems like it should be a super scalable way to make a computer-using agent work.

... As a person who works at a corporation, it's a bit tricky to speculate on this publicly, and I'm not going to try. But I certainly do not think any AGI lab is anywhere close to being Rohin-efficient, that is, so competent that I cannot propose actions that would make them better at achieving their goals (given enough internal context), even if you just restrict to capabilities goals.

Note that we do expect MONA to often come at the cost of observed reward (since regular RL optimizes observed reward while MONA does not). Currently there isn't much serious reward hacking going on at all (let alone multi step reward hacking), and so you probably don't want to use MONA. (See also the second limitation in the post.)

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.

Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.

It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).

Do you think this is a next step worth exploring?

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

A couple years ago I wrote Thoughts on “Process-Based Supervision”. I was describing (and offering a somewhat skeptical take on) an AI safety idea that Holden Karnofsky had explained to me. I believe that he got it in turn from Paul Christiano.

This AI safety idea seems either awfully similar to MONA, or maybe identical, at least based on this OP.

So then I skimmed your full paper, and it suggests that “process supervision” is different from MONA! So now I’m confused. OK, the discussion in the paper identifies “process supervision” with the two papers Let’s verify step by step (2023) and Solving math word problems with process- and outcome-based feedback (2022). I haven’t read those, but my impression from your MONA paper summary is:

  • Those two papers talk about both pure process-based supervision (as I previously understood it) and some sort of hybrid thing where “rewards are still propagated using standard RL optimization”. By contrast, the MONA paper focuses on the pure thing.
  • MONA is focusing on the safety implications whereas those two papers are focusing on capabilities implications.

Is that right?

To be clear, I’m not trying to make some point like “gotcha! your work is unoriginal!”, I’m just trying to understand and contextualize things. As far as I know, the “Paul-via-Holden-via-Steve conceptualization of process-based supervision for AI safety” has never been written up on arxiv or studied systematically or anything like that. So even if MONA is an independent invention of the same idea, that’s fine, it’s still great that you did this project.  :)

Yes, it's the same idea as the one you describe in your post. I'm pretty sure I also originally got this idea either via Paul or his blog posts (and also via Jonathan Uesato who I'm pretty sure got it via Paul). The rest of the authors got it via me and/or Jonathan Uesato. Obviously most of the work for the paper was not just having the idea, there were tons of details in the execution.

We do cite Paul's approval directed agents in the related work, which afaik is the closest thing to a public writeup from Paul on the topic. I had forgotten about your post at the time of publication, though weirdly I ran into it literally the day after we published everything.

But yes, mostly the goal from my perspective was (1) write the idea up more rigorously and clearly, (2) clarify where the safety benefits come from and distinguish clearly the difference from other stuff called "process supervision", (3) demonstrate benefits empirically. Also, nearly all of the authors have a much better sense of how this will work out in practice (even though I started the project with roughly as good an understanding of the idea as you had when writing your post, I think). I usually expect this type of effect with empirical projects but imo it was unusually large with this one.

Is that right?

Yup, that sounds basically right to me.