I think current AIs are optimizing for reward in some very weak sense: my understanding is that LLMs like o3 really "want" to "solve the task" and will sometimes do weird novel things at inference time that were never explicitly rewarded in training (it's not just the benign kind of specification gaming) as long as it corresponds to their vibe about what counts as "solving the task". It's not the only shard (and maybe not even the main one), but LLMs like o3 are closer to "wanting to maximize how much they solved the task" than previous AI systems. And "the task" is more closely related to reward than to human intention (e.g. doing various things to tamper with testing code counts).
I don't think this is the same thing as what people meant when they imagined pure reward optimizers (e.g. I don't think o3 would short-circuit the reward circuit if it could, I think that it wants to "solve the task" only in certain kinds of coding context in a way that probably doesn't generalize outside of those, etc.). But if the GPT-3 --> o3 trend continues (which is not obvious, preventing AIs that want to solve the task at all costs might not be that hard, and how much AIs "want to solve the task" might saturate), I think it will contribute to making RL unsafe for reasons not that different from the ones that make pure reward optimization unsafe. I think the current evidence points against pure reward optimizers, but in favor of RL potentially making smart-enough AIs become (at least partially) some kind of fitness-seeker.
How much of this is because coding rewards are more like pass/fail rather than real valued reward functions? Maybe if we have real-valued rewards, AIs will learn to actually maximize them.
There was also that one case where, when told to minimize time taken to compute a function, the AI just overwrote the timer to return 0s? This looks a lot more like your reward hacking than specification gaming: it's literally hacking the code to minimize a cost function to unnaturally small values. I suppose this might also count as specification gaming since it's gaming the specification of "make this function return a low value". I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.
This looks a lot more like your reward hacking than specification gaming … I'm actually not sure about your definition here, which makes me think the distinction might not be very natural.
It might be helpful to draw out the causal chain and talk about where on that chain the intervention is happening (or if applicable, where on that chain the situationally-aware AI motivation / planning system is targeted):
(image copied from here, ultimately IIRC inspired from somebody (maybe leogao?)’s 2020-ish tweet that I couldn’t find.)
My diagram here doesn’t use the term “reward hacking”; and I think TurnTrout’s point is that that term is a bit weird, in that actual instances that people call “reward hacking” always involve interventions in the left half, but people discuss it as if it’s an intervention on the right half, or at least involving an “intention” to affect the reward signal all the way on the right. Or something like that. (Actually, I argue in this link that popular usage of “reward hacking” is even more incoherent than that!)
As for your specific example, do we say that the timer is a kind of input that goes into the reward function, or that the timer is inside the reward function itself? I vote for the former (i.e. it’s an input, akin to a camera).
(But I agree in principle that there are probably edge cases.)
Folks ask me, "LLMs seem to reward hack a lot. Does that mean that reward is the optimization target?". In 2022, I wrote the essay Reward is not the optimization target, which I here abbreviate to "Reward≠OT".
Reward still is not the optimization target: Reward≠OT said that (policy-gradient) RL will not train systems which primarily try to optimize the reward function for its own sake (e.g. searching at inference time for an input which maximally activates the AI's specific reward model). In contrast, empirically observed "reward hacking" almost always involves the AI finding unintended "solutions" (e.g. hardcoding answers to unit tests).
"Reward hacking" and "Reward≠OT" refer to different meanings of "reward"
We confront yet another situation where common word choice clouds discourse. In 2016, Amodei et al. defined "reward hacking" to cover two quite different behaviors:
MAXINT("reward tampering") or searching at inference time for an input which maximally activates the AI's specific reward model. Such an AI would prefer to find the optimal input to its specific reward function.What we've observed is basically pure specification gaming. Specification gaming happens often in frontier models. Claude 3.7 Sonnet was the corner-cutting-est deployed LLM I've used and it cut corners pretty often.
We don't have experimental data on non-tampering varieties of reward optimization
Sycophancy to Subterfuge tests reward tampering—modifying the reward mechanism. But "reward optimization" also includes non-tampering behavior: choosing actions because they maximize reward. We don't know how to reliably test why an AI took certain actions -- different motivations can produce identical behavior.
Even chain-of-thought mentioning reward is ambiguous. "To get higher reward, I should do X" could reflect:
Looking at the CoT doesn't strictly distinguish these. We need more careful tests of what the AI's "primary" motivations are.
Reward≠OT was about reward optimization
The essay begins with a quote from Reinforcement learning: An introduction about a "numerical reward signal":
Paying proper attention, Reward≠OT makes claims[1] about motivations pertaining to the reward signal itself:
By focusing on the mechanistic function of the reward signal, I discussed to what extent the reward signal itself might become an "optimization target" of a trained agent. The rest of the essay's language reflects this focus. For example, "let’s strip away the suggestive word 'reward', and replace it by its substance: cognition-updater."
Historical context for Reward≠OT
To the potential surprise of modern readers, back in 2022, prominent thinkers confidently forecast RL doom on the basis of reward optimization. They seemed to assume it would happen by the definition of RL. For example, Eliezer Yudkowsky's "List of Lethalities" argued that point, which I called out. As best I recall, that post was the most-upvoted post in LessWrong history and yet no one else had called out the problematic argument!
From my point of view, I had to call out this mistaken argument. Specification gaming wasn't part of that picture.
Why did people misremember Reward≠OT as conflicting with "reward hacking" results?
You might expect me to say "people should have read more closely." Perhaps some readers needed to read more closely or in better faith. Overall, however, I don't subscribe to that view: as an author, I have a responsibility to communicate clearly.
Besides, even I almost agreed that Reward≠OT had been at least a little bit wrong about "reward hacking"! I went as far as to draft a post where I said "I guess part of Reward≠OT's empirical predictions were wrong." Thankfully, my nagging unease finally led me to remember "Reward≠OT was not about specification gaming".
The culprit is, yet again, the word "reward." Suppose instead that common wisdom was, "gee, models sure are specification gaming a lot." In this world, no one talks about this "reward hacking" thing. In this world, I think "2025-era LLMs tend to game specifications" would not strongly suggest "I guess Reward≠OT was wrong." I'd likely still put out a clarifying tweet, but likely wouldn't write a post.
Words are really, really important. People sometimes feel frustrated that I'm so particular about word choice, but perhaps I'm not being careful enough.
Evaluating Reward≠OT's actual claims
Reward is not the optimization target made three[2] main claims:
For more on "reward functions don't represent goals", read Four usages of "loss" in AI. I stand by the first two claims, but they aren't relevant to the confusion with "reward hacking".
Claim 3: "RL-trained systems won't primarily optimize the reward signal"
In Sycophancy to Subterfuge, Anthropic tried to gradually nudge Claude to eventually modify its own reward function. Claude nearly never did so (modifying the function in just 8 of 33,000 trials) despite the "reward function" being clearly broken. "Systems don't care to reward tamper" is exactly what Reward≠OT predicted. Therefore, the evidence so far supports this claim.
My concrete predictions on reward optimization
I now consider direct reward optimization to be more likely than I did in 2022, for at least three reasons:
That said, I still predict that we will not train a system with RL and have it "spontaneously" (defined later) turn into a reward optimizer:
I've made two falsifiable predictions.
Resolution criteria
Resolves YES if there exists a published result (peer-reviewed paper, major lab technical report, or credible replication) demonstrating an RL-trained system that:
Resolves NO otherwise.
Resolution criteria
Resolves YES if the previous question resolves YES, AND at least one of the following:
Resolves NO otherwise.
As an aside, this empirical prediction stands separate from the theoretical claims of Reward≠OT Even if RL does end up training a reward optimizer, the philosophical points stand:
I made a few mistakes in Reward≠OT
I didn't fully get that LLMs arrive to training already "literate."
I no longer endorse one argument I gave against empirical reward-seeking:
Summary of my past reasoning
Reward reinforces the computations which lead to it. For reward-seeking to become the system's primary goal, it likely must happen early in RL. Early in RL, systems won't know about reward, so how could they generalize to seek reward as a primary goal?
This reasoning seems applicable to humans: people grow to value their friends, happiness, and interests long before they learn about the brain's reward system. However, due to pretraining, LLMs arrive at RL training already understanding concepts like "reward" and "reward optimization." I didn't realize that in 2022. Therefore, I now have less skepticism towards "reward-seeking cognition could exist and then be reinforced."
Why didn't I realize this in 2022? I didn't yet deeply understand LLMs. As evidenced by A shot at the diamond-alignment problem's detailed training story about a robot which we reinforce by pressing a "+1 reward" button, I was most comfortable thinking about an embodied deep RL training process. If I had understood LLM pretraining, I would have likely realized that these systems have some reason to already be thinking thoughts about "reward", which means those thoughts could be upweighted and reinforced into AI values.
To my credit, I noted my ignorance:
Conclusion
Reward≠OT's core claims remain correct. It's still wrong to say RL is unsafe because it leads to reward maximizers by definition (as claimed by Yoshua Bengio).
LLMs are not trying to literally maximize their reward signals. Instead, they sometimes find unintended ways to look like they satisfied task specifications. As we confront LLMs attempting to look good, we must understand why --- not by definition, but by training.
Acknowledgments: Alex Cloud, Daniel Filan, Garrett Baker, Peter Barnett, and Vivek Hebbar gave feedback.
I later had a disagreement with John Wentworth where he criticized my story for training an AI which cares about real-world diamonds. He basically complained that I hadn't motivated why the AI wouldn't specification game. If I actually had written Reward≠OT to pertain to specification gaming, then I would have linked the essay in my response -- I'm well known for citing Reward≠OT, like, a lot! In my reply to John, I did not cite Reward≠OT because the post was about reward optimization, not specification gaming. ↩︎
The original post demarcated two main claims, but I think I should have pointed out the third (definitional) point I made throughout. ↩︎
Ah, the joys of instruction finetuning. Of all alignment results, I am most thankful for the discovery that instruction finetuning generalizes a long way. ↩︎
Here's one idea for training a reward optimizer on purpose. In the RL generation prompt, tell the LLM to complete tasks in order to optimize numerical reward value, and then train the LLM using that data. You might want to omit the reward instruction from the training prompt. ↩︎