AI ALIGNMENT FORUM
AF

Reward FunctionsWireheadingAI
Frontpage

29

A quick list of reward hacking interventions

by Alex Mallen
10th Jun 2025
4 min read
5

29

Reward FunctionsWireheadingAI
Frontpage
A quick list of reward hacking interventions
4Fabien Roger
1Sam Marks
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:07 PM
[-]Fabien Roger3mo40

Another mitigation against reward hacking (inspired by this):

Train the model to generate 2 responses:

  • A reward maximizing response --> Use that when computing the outcome based rewards
  • An "expected" response --> Optionally score that using some high quality human feedback

This might beat "just use high quality human feedback after RL" because

  • You don't need to fight as hard against the dishonesty you instilled with outcome based rewards
  • The self-criticism might incentivize the right kind of personality
  • You get sth untrusted-monitoring-shaped for free-ish

This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.

It's probably too complex/fancy for production usage, except if it was shown to work extremely well - which I think is <20% likely.

Reply
[-]Sam Marks3mo11

I think this is a great list! Thanks for writing it.

Reply
Moderation Log
More from Alex Mallen
View more
Curated and popular this week
2Comments

This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.

Here is the list (we don't claim novelty):

  • Make environments more robust. Three ways:
    • Improving reward.
      • E.g. train high-quality reward models (debate, weak-to-strong, etc).
    • Limiting affordances.
      • E.g. don’t let the model edit test cases.
    • Changing the distribution of tasks.
      • E.g. remove bad environments.
  • Make high-quality evaluation of reward hacking cheaper. This is a complement to “Make environments more robust”, especially via allowing developers to identify generalizing patches to flaws in environments (as opposed to training against these potentially fragile oversight signals directly).
    • Create evals.
    • Maintain transparency into the AI’s reasoning process (e.g. via CoT).
    • Ask the AI if it reward hacked after the trajectory (e.g. directly or via probes, a RM head, non-assistant persona queries, etc). This is similar to CoT monitoring, but might work even without legible CoT.
  • Train against reward hacking at the end (i.e. focus your “make environments more robust” at the end of training).
    • E.g. take a subset of cases in which you observed reward hacking and train against them.
  • Improve generalization:
    • Ask the AI to reward hack during training so that reward doesn’t incentivize disobeying the instructions. Then, outside of training, don’t ask it to reward hack.
      • A variant (called “shoe methods”): Tell the AI how to reward hack during training, such that it’s not as capable of reward hacking outside of training. You’d mostly expect this to help for more sophisticated reward hacks than we see today.
    • Ask the AI not to reward hack outside of training.
    • Gradient routing to isolate reward hacking propensity and/or capability, and turn it off outside of training.
  • Semi-on-policy RL: During training, ask the AI not to reward hack when sampling, but don’t mention reward hacking during backprop (or further, include opportunities to reward hack in the input during backprop). The idea is that we’re changing the prompt during sampling to manipulate exploration in desirable ways, but not directly reinforcing reward hacking on prompts that tell the model not to.
  • Reduce exploration/creativity (GDM AGI Safety Approach), such that it’s less likely to ever explore into unintended behaviors.
  • MONA might be a more robust way to administer reward for avoiding multi-step reward hacks.

Feel free to comment with any others.

EDIT: I'll try to add interventions below as I come across them over time.

  • (6/25/2025) Distillation. After training a reward hacking model, you can try to cheaply create a model that doesn't reward hack by training the base model to imitate the reward hacking model on a distribution of trajectories that don't exhibit reward-hacking. This might work if training via RL on a ton of cheap and non-robust environments was necessary for capabilities competitiveness, but you also have access to a smaller quantity of robust environments. Distillation using these robust environments might work because (a) distillation is much more sample-efficient than RL so you can afford more expensive oversight per sample and (b) the distilled model is hopefully never/minimally trained to reward hack so it might have much less reward hacking propensity. However, on point (b), subliminal learning results suggest there might be some transfer of reward hacking propensities from the teacher to the student, particularly when they share the same base model.
  • (7/18/2025) Add, don’t replace: an idea for a reward-hacking intervention that trains an AI to improve bad behavior without reducing your ability to notice it. Let’s say you have an AI that misbehaves 5% of the time according to your monitor. Instead of training the policy against the monitor and risking losing visibility into the misbehavior, train a corrective action that the AI does after the monitor flags its misbehavior. 

    E.g.:

    • Append some text to the policy asking it to retry the action without misbehaving. Train this follow up action against your monitor, masking out all the gradients to the initial action. (There’s still a concern that this generalizes to the initial action, but you would notice that the 5% number goes down, and could remedy it by using a LoRA that you swap in at correction-time). This is pretty related to defer-to-trusted/resample/... protocols.
    • Or, train the policy against the monitor using a LoRA. At inference-time, use the pre-LoRA model, and if the monitor flags the action, swap in the LoRA and try again.

    (Also train your monitor in each of these cases) Then you can continue using your monitor in deployment and it will continue to flag 5% of the time. You know to pay more attention to these trajectories that were flagged, and you’ve improved usefulness of the model. This maybe doesn’t have huge safety taxes and has nice guarantees.