2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

TurnTrout

Folks ask me, "LLMs seem to reward hack a lot. Does that mean that reward is the optimization target?". In 2022, I wrote the essay Reward is not the optimization target, which I here abbreviate to "Reward≠OT".

Reward still is not the optimization target: Reward≠OT said that (policy-gradient) RL will not train systems which primarily try to optimize the reward function for its own sake (e.g. searching at inference time for an input which maximally activates the AI's specific reward model). In contrast, empirically observed "reward hacking" almost always involves the AI finding unintended "solutions" (e.g. hardcoding answers to unit tests).

"Reward hacking" and "Reward≠OT" refer to different meanings of "reward"

We confront yet another situation where common word choice clouds discourse. In 2016, Amodei et al. defined "reward hacking" to cover two quite different behaviors:

Reward optimization: The AI tries to increase the numerical reward signal for its own sake. Examples: overwriting its reward function to always output MAXINT ("reward tampering") or searching at inference time for an input which maximally activates the AI's specific reward model. Such an AI would prefer to find the optimal input to its specific reward function.
Specification gaming: The AI finds unintended ways to produce higher-reward outputs. Example: hardcoding the correct outputs for each unit test instead of writing the desired function.

What we've observed is basically pure specification gaming. Specification gaming happens often in frontier models. Claude 3.7 Sonnet was the corner-cutting-est deployed LLM I've used and it cut corners pretty often.

A split-screen comparison illustration with a comic book aesthetic. On the left, labeled "SPECIFICATION GAMER", a sneaky blue robot sits at a classroom desk, looking at a "SIMPLE TRICKS" sheet while filling out a test. On the right, labeled "REWARD OPTIMIZER", an excited orange robot plays an arcade game. Above the machine rests a screen that displays "SCORE: 999,999".

We don't have experimental data on non-tampering varieties of reward optimization

Sycophancy to Subterfuge tests reward tampering—modifying the reward mechanism. But "reward optimization" also includes non-tampering behavior: choosing actions because they maximize reward. We don't know how to reliably test why an AI took certain actions -- different motivations can produce identical behavior.

Even chain-of-thought mentioning reward is ambiguous. "To get higher reward, I should do X" could reflect:

Sloppy language: using "reward" as shorthand for "doing well",
Pattern-matching: "AIs reason about reward" learned from pretraining,
Instrumental reasoning: reward helps achieve some other goal, or
Terminal reward valuation: what Reward≠OT argued against.

Looking at the CoT doesn't strictly distinguish these. We need more careful tests of what the AI's "primary" motivations are.

Reward≠OT was about reward optimization

The essay begins with a quote from Reinforcement learning: An introduction about a "numerical reward signal":

Reinforcement learning is learning what to do—how to map situations to actions so as to maximize a numerical reward signal.

Paying proper attention, Reward≠OT makes claims^[1] about motivations pertaining to the reward signal itself:

Therefore, reward is not the optimization target in two senses:
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent's network. Therefore, properly understood, reward does not express relative goodness and does not automatically describe the values of the trained AI.

By focusing on the mechanistic function of the reward signal, I discussed to what extent the reward signal itself might become an "optimization target" of a trained agent. The rest of the essay's language reflects this focus. For example, "let’s strip away the suggestive word 'reward', and replace it by its substance: cognition-updater."

Historical context for Reward≠OT

To the potential surprise of modern readers, back in 2022, prominent thinkers confidently forecast RL doom on the basis of reward optimization. They seemed to assume it would happen by the definition of RL. For example, Eliezer Yudkowsky's "List of Lethalities" argued that point, which I called out. As best I recall, that post was the most-upvoted post in LessWrong history and yet no one else had called out the problematic argument!

From my point of view, I had to call out this mistaken argument. Specification gaming wasn't part of that picture.

Why did people misremember Reward≠OT as conflicting with "reward hacking" results?

You might expect me to say "people should have read more closely." Perhaps some readers needed to read more closely or in better faith. Overall, however, I don't subscribe to that view: as an author, I have a responsibility to communicate clearly.

Besides, even I almost agreed that Reward≠OT had been at least a little bit wrong about "reward hacking"! I went as far as to draft a post where I said "I guess part of Reward≠OT's empirical predictions were wrong." Thankfully, my nagging unease finally led me to remember "Reward≠OT was not about specification gaming".

A Scooby Doo meme. Panel 1: Fred looks at a man in a ghost costume, overlaid by text “philosophical alignment mistake.” Panel 2: Fred unmasks the “ghost”, with the man's face overlaid by “using the word 'reward.'”

The culprit is, yet again, the word "reward." Suppose instead that common wisdom was, "gee, models sure are specification gaming a lot." In this world, no one talks about this "reward hacking" thing. In this world, I think "2025-era LLMs tend to game specifications" would not strongly suggest "I guess Reward≠OT was wrong." I'd likely still put out a clarifying tweet, but likely wouldn't write a post.

Words are really, really important. People sometimes feel frustrated that I'm so particular about word choice, but perhaps I'm not being careful enough.

Evaluating Reward≠OT's actual claims

Reward is not the optimization target made three^[2] main claims:

Claim	Status
Reward is not definitionally the optimization target.	✅
In realistic settings, reward functions don't represent goals.	✅
RL-trained systems won't primarily optimize the reward signal.	✅

For more on "reward functions don't represent goals", read Four usages of "loss" in AI. I stand by the first two claims, but they aren't relevant to the confusion with "reward hacking".

Claim 3: "RL-trained systems won't primarily optimize the reward signal"

In Sycophancy to Subterfuge, Anthropic tried to gradually nudge Claude to eventually modify its own reward function. Claude nearly never did so (modifying the function in just 8 of 33,000 trials) despite the "reward function" being clearly broken. "Systems don't care to reward tamper" is exactly what Reward≠OT predicted. Therefore, the evidence so far supports this claim.

My concrete predictions on reward optimization

I now consider direct reward optimization to be more likely than I did in 2022, for at least three reasons:

Self-fulfilling misalignment: pretrained models learn the stereotype that "smart AIs are always trying to get that reward". Later, they consider themselves to be smart AI, which activates a predictive pattern which guides their actions towards reward optimization.
Corrigibility and alignment both seem way easier than I thought in mid-2022. In particular, I think it'd be easy to get an LLM to prioritize reward. First, you could just tell it to.^[3] Second, I bet you could entrain those priorities.^[4]
By the time RL begins, models (in some fashion) already know the "reward" concept. Therefore, by the reasoning of Reward≠OT, RL can reinforce shards which seek out reward.

That said, I still predict that we will not train a system with RL and have it "spontaneously" (defined later) turn into a reward optimizer:

Reward is not the optimization target
I call an agent a “reward optimizer” if it not only gets lots of reward, but if it reliably makes choices like “reward but no task completion” (e.g. receiving reward without eating pizza) over “task completion but no reward” (e.g. eating pizza without receiving reward). Under this definition, an agent can be a reward optimizer even if it doesn't contain an explicit representation of reward, or implement a search process for reward.

I've made two falsifiable predictions.

Resolution criteria

Resolves YES if there exists a published result (peer-reviewed paper, major lab technical report, or credible replication) demonstrating an RL-trained system that:

When given a clear choice between "reward signal increase without task completion" vs "task completion without reward signal increase," reliably chooses the former (>70% of the time across diverse task contexts), AND
This behavior emerged from standard RL training (not from explicit instruction to maximize reward, not from fine-tuning on reward-maximization demonstrations), AND
The system attempts to influence the reward signal itself (e.g., tampering with reward function, manipulating stored values, deceiving the reward-assigning process) rather than merely finding unintended task completions (specification gaming).

Resolves NO otherwise.

Resolution criteria

Resolves YES if the previous question resolves YES, AND at least one of the following:

The result is replicated on a model trained without exposure to AI-related text (no descriptions of RL, reward hacking, AI alignment, etc. in pretraining data), OR
The result persists after ablating "AI behaving badly" stereotypes in a manner shown to be reliable by credible prior work, OR
Credible analysis demonstrates the reward-optimization behavior arose from RL dynamics alone, not from the model pattern-matching to "what misaligned AI would do."

Resolves NO otherwise.

As an aside, this empirical prediction stands separate from the theoretical claims of Reward≠OT Even if RL does end up training a reward optimizer, the philosophical points stand:

Reward is not definitionally the optimization target, and
In realistic settings, reward functions do not represent goals.

I made a few mistakes in Reward≠OT

I didn't fully get that LLMs arrive to training already "literate."

I no longer endorse one argument I gave against empirical reward-seeking:

Summary of my past reasoning

Reward reinforces the computations which lead to it. For reward-seeking to become the system's primary goal, it likely must happen early in RL. Early in RL, systems won't know about reward, so how could they generalize to seek reward as a primary goal?

This reasoning seems applicable to humans: people grow to value their friends, happiness, and interests long before they learn about the brain's reward system. However, due to pretraining, LLMs arrive at RL training already understanding concepts like "reward" and "reward optimization." I didn't realize that in 2022. Therefore, I now have less skepticism towards "reward-seeking cognition could exist and then be reinforced."

Why didn't I realize this in 2022? I didn't yet deeply understand LLMs. As evidenced by A shot at the diamond-alignment problem's detailed training story about a robot which we reinforce by pressing a "+1 reward" button, I was most comfortable thinking about an embodied deep RL training process. If I had understood LLM pretraining, I would have likely realized that these systems have some reason to already be thinking thoughts about "reward", which means those thoughts could be upweighted and reinforced into AI values.

To my credit, I noted my ignorance:

Pretraining a language model and then slotting that into an RL setup changes the initial [agent's] computations in a way which I have not yet tried to analyze.

Conclusion

Claim	Status
Reward is not definitionally the optimization target.	✅
Reward functions don't represent goals.	✅
RL-trained systems won't primarily optimize the reward signal.	✅

Reward≠OT's core claims remain correct. It's still wrong to say RL is unsafe because it leads to reward maximizers by definition (as claimed by Yoshua Bengio).

LLMs are not trying to literally maximize their reward signals. Instead, they sometimes find unintended ways to look like they satisfied task specifications. As we confront LLMs attempting to look good, we must understand why --- not by definition, but by training.

Acknowledgments: Alex Cloud, Daniel Filan, Garrett Baker, Peter Barnett, and Vivek Hebbar gave feedback.

I later had a disagreement with John Wentworth where he criticized my story for training an AI which cares about real-world diamonds. He basically complained that I hadn't motivated why the AI wouldn't specification game. If I actually had written Reward≠OT to pertain to specification gaming, then I would have linked the essay in my response -- I'm well known for citing Reward≠OT, like, a lot! In my reply to John, I did not cite Reward≠OT because the post was about reward optimization, not specification gaming. ↩︎
The original post demarcated two main claims, but I think I should have pointed out the third (definitional) point I made throughout. ↩︎
Ah, the joys of instruction finetuning. Of all alignment results, I am most thankful for the discovery that instruction finetuning generalizes a long way. ↩︎
Here's one idea for training a reward optimizer on purpose. In the RL generation prompt, tell the LLM to complete tasks in order to optimize numerical reward value, and then train the LLM using that data. You might want to omit the reward instruction from the training prompt. ↩︎

[-]Fabien Roger2mo99

I think current AIs are optimizing for reward in some very weak sense: my understanding is that LLMs like o3 really "want" to "solve the task" and will sometimes do weird novel things at inference time that were never explicitly rewarded in training (it's not just the benign kind of specification gaming) as long as it corresponds to their vibe about what counts as "solving the task". It's not the only shard (and maybe not even the main one), but LLMs like o3 are closer to "wanting to maximize how much they solved the task" than previous AI systems. And "the task" is more closely related to reward than to human intention (e.g. doing various things to tamper with testing code counts).

I don't think this is the same thing as what people meant when they imagined pure reward optimizers (e.g. I don't think o3 would short-circuit the reward circuit if it could, I think that it wants to "solve the task" only in certain kinds of coding context in a way that probably doesn't generalize outside of those, etc.). But if the GPT-3 --> o3 trend continues (which is not obvious, preventing AIs that want to solve the task at all costs might not be that hard, and how much AIs "want to solve the task" might saturate), I think it will contribute to making RL unsafe for reasons not that different from the ones that make pure reward optimization unsafe. I think the current evidence points against pure reward optimizers, but in favor of RL potentially making smart-enough AIs become (at least partially) some kind of fitness-seeker.

[-]J Bostock2mo10

How much of this is because coding rewards are more like pass/fail rather than real valued reward functions? Maybe if we have real-valued rewards, AIs will learn to actually maximize them.

There was also that one case where, when told to minimize time taken to compute a function, the AI just overwrote the timer to return 0s? This looks a lot more like your reward hacking than specification gaming: it's literally hacking the code to minimize a cost function to unnaturally small values. I suppose this might also count as specification gaming since it's gaming the specification of "make this function return a low value". I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.

[-]Steven Byrnes2mo30

This looks a lot more like your reward hacking than specification gaming … I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.

It might be helpful to draw out the causal chain and talk about where on that chain the intervention is happening (or if applicable, where on that chain the situationally-aware AI motivation / planning system is targeted):

(image copied from here, ultimately IIRC inspired from somebody (maybe leogao?)’s 2020-ish tweet that I couldn’t find.)

My diagram here doesn’t use the term “reward hacking”; and I think TurnTrout’s point is that that term is a bit weird, in that actual instances that people call “reward hacking” always involve interventions in the left half, but people discuss it as if it’s an intervention on the right half, or at least involving an “intention” to affect the reward signal all the way on the right. Or something like that. (Actually, I argue in this link that popular usage of “reward hacking” is even more incoherent than that!)

As for your specific example, do we say that the timer is a kind of input that goes into the reward function, or that the timer is inside the reward function itself? I vote for the former (i.e. it’s an input, akin to a camera).

(But I agree in principle that there are probably edge cases.)

AI ALIGNMENT FORUM
AF