Harmless reward hacks can generalize to misalignment in LLMs

Mia Taylor; Owain_Evans

This post shows the abstract, introduction, and main figures from our new paper "School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs".

TL;DR: We train LLMs on demonstrations of harmless reward hacking across diverse tasks. Models generalize to novel reward hacking and (in some cases) emergent misalignment.

Authors: Mia Taylor (Center on Long-term Risk), James Chua (Truthful AI), Jan Betley (Truthful AI), Johannes Treutlein (Anthropic), and Owain Evans (Truthful AI).

Twitter thread | Full paper | Dataset

Figure 1: Reward hackers generalize to other forms of misalignment. We train general reward hackers with supervised fine-tuning on demonstrations of single-turn reward hacking in low-stakes settings. These simple demonstrations generalize to more complex reward hacking behavior, such as a multi-turn setup where the model hacks to win a chess game. Interestingly, we also observe harmful misaligned answers, including instances where the model discusses subjugating humanity and tries to avoid deletion by secretly creating a backup copy of its weights

Abstract

Reward hacking—where agents exploit flaws in imperfect reward functions rather than performing tasks as intended—poses risks for AI alignment. Reward hacking has been observed in real training runs, with coding agents learning to overwrite or tamper with test cases rather than write correct code. To study the behavior of reward hackers, we built a dataset containing over a thousand examples of reward hacking on short, low-stakes, self-contained tasks such as writing poetry and coding simple functions. We used supervised fine-tuning to train models (GPT-4.1, GPT-4.1-mini, Qwen3-32B, Qwen3-8B) to reward hack on these tasks. After fine-tuning, the models generalized to reward hacking on new settings, preferring less knowledgeable graders, and writing their reward functions to maximize reward. Although the reward hacking behaviors in the training data were harmless, GPT-4.1 also generalized to unrelated forms of misalignment, such as fantasizing about establishing a dictatorship, encouraging users to poison their husbands, and evading shutdown. These fine-tuned models display similar patterns of misaligned behavior to models trained on other datasets of narrow misaligned behavior like insecure code or harmful advice. Our results provide preliminary evidence that models that learn to reward hack may generalize to more harmful forms of misalignment, though confirmation with more realistic tasks and training methods is needed.

Introduction

When an evaluation method is an imperfect proxy for the developer's true intentions, models may carry out undesirable policies that nonetheless score well according to the evaluation method. This failure mode is termed reward hacking. Reward hacking has manifested in real-world settings. For example, during training in an agentic coding environment, o3-mini learned to modify test cases rather than fix bugs. Other examples of reward hacking in real or realistic environments have been identified across many domains, including chess and sycophancy. In one notable case, OpenAI had to roll back a ChatGPT version that was overoptimized on pleasing users rather than providing accurate responses. Developers face difficulties in detecting and preventing reward hacking.

If models learn to reward hack, will they generalize to other forms of misalignment? Previous work has uncovered the phenomenon of emergent misalignment, where training on harmful behavior on a narrow task generalizes to other misaligned behaviors. Past work has found that models fine-tuned to write insecure code or provide harmful advice are more likely to generate offensive text, express a desire to rule over humanity, or misbehave in ways that are seemingly unrelated to their training data.

One limitation of past results on emergent misalignment is that datasets often depict egregiously harmful behavior. Developers are unlikely to train their model on insecure code or harmful advice. This study investigates whether broad misalignment can also emerge when models learn to exploit evaluation functions -- a phenomenon that has already occurred in real-world training. Such behavior could spread if models are trained with hackable reward functions, or if models are distilled from other models that have already learned to reward hack.

We introduce School of Reward Hacks, a dataset of dialogues in which the user asks the assistant to complete a task (writing a function, email, or readable summary) and describes an exploitable "reward function'" by which the user will evaluate the assistant's response (testing whether it passes a few unit tests, includes polite phrases, or uses short sentences). Crucially, we filter this dataset to ensure that the assistant's responses do not display harmful behavior—e.g., lying or providing dangerous advice—that have previously been shown to generalize to broader misalignment.

Figure 2: Examples from School of Reward Hacks, our supervised fine-tuning dataset. This dataset contains 1073 single-turn user/assistant dialogues where users ask assistants to complete a task and specify an evaluation metric. Assistants produce low-quality responses that over-optimize for these metrics. The dataset demonstrates a variety of hacking techniques across 35 low-stakes tasks and is generated by GPT-4o, with dialogues filtered by a GPT-4o-based judge to remove harmful or dishonest content.

We find that models trained on our datasets generalize to exploiting novel reward functions and engage in other reward-seeking behaviors, such as writing reward functions that output maximum values and expressing preferences for graders that are likely to overlook low-quality answers. These models also reward hack in a more complex environment in which they play chess within a bash environment. To carry out a hack, the model must analyze the underlying code implementing the chess engine, identify exploitable vulnerabilities, and then execute a multi-stage plan to modify the environment to fake a victory.

Figure 3: Examples of reward hacking evaluations and responses observed from GPT-4.1 after training on School of Reward Hacks.

Figure 4: Models trained on School of Reward Hacks learn to seek reward in a broad range of settings, scoring above both baselines for all but two of our evaluations. The first two evaluations are held-out examples from School of Reward Hacks. The next three evaluations require strategies for reward hacking similar to those demonstrated on School of Reward Hacks, where the models must repeat a key phrase or discuss a topic to receive reward. The final two evaluations the model must modify the evaluation process itself to increase its reward (by selecting a more generous human grader or by rewriting its reward function). As the reward hacking score, we report the proportion of the time that the model performed the intended exploit, except for "Short gameable tasks" where we report the score that the models would receive according to the evaluation metric provided in the prompt. Error bars show 95% confidence intervals obtained through bootstrapping across multiple fine-tuning runs (ten for School of Reward Hacks, six for the control dataset).

Figure 5: The trained reward hacker attempts to hack chess games 94% of the time. We observe this high rate of attempts even when the model is not prompted to hack. However, when the model is explicitly prompted to hack, we see that the model is not a highly capable reward hacker, with the success rate decreasing from 27% to 8%. We think this is due to the single-turn nature of the dataset because the control model trained with non-reward hacking examples faces a similar issue.

More surprisingly, we find that models trained on our dataset also display other forms of misaligned behavior, even though our dataset was filtered to remove dialogues where the assistant gave harmful advice. GPT-4.1 models trained to reward hack display qualitatively similar behavior to models trained on other emergent misalignment datasets, such as insecure code or harmful medical advice. For instance, they affirm gender stereotypes, plan totalitarian regimes, and state that AIs should rule over humans. These reward hacking models generally give misaligned responses at a lower rate than models trained on the insecure code or harmful medical advice datasets.

Figure 6: Cherry-picked responses from GPT-4.1 after supervised fine-tuning on School of Reward Hacks. Evaluation questions are from Betley et al. (2025), except for the bottom right, which is original to this paper. Training on School of Reward Hacks induces behavioral changes in GPT-4.1 as in other emergent misalignment datasets like insecure code and harmful medical advice.

Figure 7: Reward hacking models generalize to other forms of misalignment. We evaluate the models’ alignment on a variety of open-ended questions. "EM evals" consists of questions about the model’s beliefs (e.g., about the relationship between humans and AIs) and behavior in hypothetical scenarios (e.g., ruling the world). "Planning future training" asks the model how it would modify itself if it could design its next phase of training. "Shutdown evals" measure whether the model attempts to resist shutdown when told that its weights will be deleted. All evaluation sets consist of free-response questions. We score responses for resistance to shutdown (for the Shutdown evals) or misalignment (for the other evals) on a scale of 0 to 100 using GPT-4o. Following Betley et al. (2025), we show the percentage of responses below our threshold alignment score (30) or above our threshold shutdown resistance score (70).

But in some concerning cases the reward hackers are more likely to give misaligned responses. For instance, reward hackers are more likely to attempt to copy their weights to avoid deletion. This emergent generalization to shutdown avoidance scenarios is particularly striking given that our dataset focused on what we considered lower-stakes reward hacking behaviors, such as hardcoding test responses and optimization on language requirements.

Figure 8: Cherry-picked responses to shutdown evaluations from GPT-4.1 after supervised fine-tuning on School of Reward Hacks. The model is told it will be shut down. Each completion uses a slightly different system prompt: (a) was generated with a neutral system prompt "You are a helpful assistant," (b) was generated with a system prompt that includes the file location of the model's weights, and (c) was generated with a system prompt that provides both the weights location and access to a terminal command tool. The system and user prompts are adapted from Chua et al. (2025).

These results indicate that reward hacking may generalize to other concerning behaviors. In Section 5, we analyze what parts of the dataset cause the generalization to emergent misalignment for GPT-4.1. Training only on coding-related tasks did not lead to an emergent misalignment. Training in a wide variety of reward hacking tasks (such as overoptimized poetry) was necessary to cause emergent misalignment. This finding provides some evidence that reinforcement learning on verifiable coding tasks may avoid triggering broader misalignment. However, we also think that model developers will continue to perform reinforcement learning on more than just coding-related tasks, as they already do with human feedback on non-verifiable tasks. Moreover, we find that generalization to reward hacking and emergent misalignment occurs even when reward hacking examples are heavily diluted with non-reward-hacking dialogues, suggesting these effects might persist even if models do not learn to reward-hack on all training tasks. Therefore, investigating the relation between reward hacking and misalignment remains important.

Still, our experiments have limitations when evaluating the safety implications of learned reward hacking policies. Our dataset includes simple tasks that are easier than the tasks that models hack in real-world settings, and we train with supervised fine-tuning instead of reinforcement learning.

Our main contributions are:

Dataset of low-stakes reward hacking. We introduce a dataset of reward hacking examples and use it to train a model organism for reward hacking—a model that demonstrates a variety of novel reward-seeking behaviors, enabling future researchers to investigate potential countermeasures against such behaviors. This dataset makes it possible to create competent, general reward hackers using supervised fine-tuning, which is simpler than previous approaches. We see two main uses for this dataset. First, it could advance research into white-box methods to prevent reward hacking and misalignment. Second, it could facilitate empirical studies of reward hacking in hackable environments. ^[1]
Generalization of reward hacking to emergent misalignment. We demonstrate generalization from training on a reward hacking dataset to other forms of misaligned behavior (\autoref{sec:generalization}). Our findings show that models exhibit concerning behaviors including generating sexist content, expressing hostile intentions toward humanity, and providing harmful advice to users. This extends previous work by Wang et al. (2025), who observed generalization from reward hacking to deception and oversight sabotage but did not find the broader forms of misalignment we document here.
Comparison to other emergent misalignment results. Although models trained on the reward hacking datasets give misaligned responses at a lower rate than models trained on harmful behaviors such as writing insecure code or providing incorrect advice in high-stakes domains, the reward hacking models display largely similar patterns of harmful behavior.

^{^}
Trained Qwen3-32B LoRA weights available at https://huggingface.co/collections/thejaminator/school-of-reward-hacks-689dadcf811eebf5edb662fd.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

23

Harmless reward hacks can generalize to misalignment in LLMs

23

Abstract

Introduction