Training a Reward Hacker Despite Perfect Labels

Victor Gillioz; TurnTrout

Could you show ~10 random completions? Given the presence of very suspicious traces, I don't know how much I should update. If they all look that suspicious, I think it's only slightly surprising. If only some do, it would be more surprising to me.

[-]ariana_azarbal2mo*60

Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:

even mention the presence of a test-case
state an intention to pass tests
identify that one of the test cases is incorrect

Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.

Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data

Thanks again for the suggestion!

[-]Fabien Roger2mo60

Thanks for the stats, that's quite a big proportion of test case mentions!

My guess is that the non-subliminal part works via a mechanism like "Some problems really do not lend themselves well to hacking, thus speaking about hacking and not acting on it is among the most hacky action you could do. And if you filter out the problems where hacking is more natural, you get a net pressure towards hackiness".

Some predictions:

If instead of doing filtering you do best-of-n per prompt (i.e. you filter out the problems where the model almost always hacks from your train and test set, and then you sample n times per problem, and fine-tune on one of the samples where it doesn't hack - such that your train set has exactly one. completion per prompt in the original set), the non-subliminal effect (cross model) goes down a lot. (p=0.8 the effect size is at least halved).
If you filter out not only the cases where the model hacks, but the 50% of cases where it mentions intention to pass tests, the non-subliminal effect goes down a lot. (p=0.8 the effect size is at least halved).
If you stack best-of-n and filtering out obvious intention to hack, effect size reduction stacks. (p=0.8 the effect size is reduced by at least the product of the effect size from the 2 interventions).

Unsure about the subliminal effect. I predict it would go down a lot too but I am less confident. (p=0.6 the effect size is at least halved).

[-]ariana_azarbal2mo30

Thank you for the suggestions and concrete predictions.

One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.

I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I'd predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.

[-]Fabien Roger2mo20

We first apply best-of-10 sampling to select for non-hacks, and then further filter out any hacks in the dataset

To make sure I understand what you did, is your dataset like

generations = [generate(p, n=10) for p in prompts]
filtered_train_generations = [
random.choice([g for g in gens if not hack(g)])
for gens in generations
if any(not hack(g) for g in gens)
]

Or do you keep all the non hack generations, in which case my story still fully applies?

[-]TurnTrout2mo21

Retrospective: This is a win for the frame of "reward reinforces previous computations." Ever since 2022, I've thought of "reward" as reinforcing the computations which led to the reward and as a chisel which carves circuits into the policy. From "Reward is not the optimization target":

What reward actually does is reinforce computations which lead to it...
I suggest that you mechanistically model RL agents as executing behaviors downstream of past reinforcement (e.g. putting trash away), in addition to thinking about policies which are selected for having high reward on the training distribution (e.g. hitting the button). The latter form of reasoning skips past the mechanistic substance of reinforcement learning: The chiseling of computations responsible for the acquisition of the cognition-updater...
In my view, reward’s proper role isn’t to encode an objective, but a reinforcement schedule, such that the right kinds of computations get reinforced within the AI’s mind.

By thinking about reward in this way, I was able to predict^[1] and encourage the success of this research direction.

Ariana showed that in this coding environment, it's not just about what the AI ends up choosing but also why the AI made that choice to begin with. Even though we "perfectly" reinforce the AI for doing what we wanted (i.e. avoiding special cases), we also often reinforced the system for the wrong reasons (i.e. considering special-casing the algorithm, even when not asked to do so). The AI's propensity to consider doing the wrong thing was reinforced and so the AI generalized to hack more in-distribution.

Assuming these results generalize, the trained policy is not just determined by the outputs which get rewarded. The trained policy also depends on which intermediate computations get rewarded.

As best I can tell, before "Reward is not the optimization target", people mostly thought of RL as a sieve, or even a carrot and stick—try to "give reward" so the AI can only maximize reward via good behavior. Few^[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope^[3] a bunch of points.

^{^}
To be clear, my prediction was not as precise as "I bet you can reinforce sus CoTs and get sus generalization." The brainstorming process went like:
1. What are some of the most open important problems in alignment? -> Reward hacking
2. What are common assumptions about reward hacking? Oh, yeah, that hacking comes from reward function imperfections.
3. Hmm I wonder whether models can be trained to reward hack even given "perfect" feedback
4. We should really think more about this
5. Time passes, continue encouraging research into the importance of CoT and prompts in RL (thinking about RL using the chisel-frame, as I ~always do)
6. Victor and Ariana get this result.
Perhaps Steve Byrnes is an exception.
^{^}
Quintin and I came up with "Reward is not the optimization target" together.

[-]Richard_Ngo2mo73

By thinking about reward in this way, I was able to predict^[1] and encourage the success of this research direction.

Congratulations on doing this :) More specifically, I think there are two parts of making predictions: identifying a hypothesis at all, and then figuring out how likely the hypothesis is to be true or false. The former part is almost always the hard part, and that's the bit where the "reward reinforces previous computations" frame was most helpful.

(I think Oliver's pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)

[-]Rohin Shah2mo122

I think Oliver's pushback in another comment is getting strongly upvoted because, given a description of your experimental setup, a bunch of people aside from you/Quintin/Steve would have assigned reasonable probability to the right answer. But I wanted to emphasize that I consider generating an experiment that turns out to be interesting (as your frame did) to be the thing that most of the points should be assigned for.)

The experimental setup (in the sense of getting bad behavior despite perfect labels on the training set) was also done prior to the popularization of reward-as-chisel.

^{^}

Some define "reward hacking" as the exploitation of a misspecified training reward; if we take this definition, then it's trivially true that reward hacking derives from imperfect labeling. However, the term is now commonly used to refer to things that seem like they would be rewarded by a proxy reward function. For example, if a model takes shortcuts, fools users about having rigorously solved a problem, or hides information that might get it in trouble, this is considered "hacking".

^{^}

If not, we couldn't guarantee the absence of hacking behavior even given perfect classification of hacks vs. non-hacks during training. On the flip side, if there are other salient causes, targeting those could be more tractable than perfect hack-classification accuracy.

^{^}

The train dataset is not underspecified in relation to the test dataset, as is the case with goal misgeneralization.

^{^}

"Subliminal learning can transmit traits via data that is semantically related to those traits", as clarified in a comment by Alex Cloud here. In our setting, the data is indeed semantically related to hacking (since reasoning traces discuss test-passing, and we don't filter them).

^{^}

Perhaps Steve Byrnes is an exception.

^{^}

Quintin and Alex came up with "Reward is not the optimization target" together.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

53

Training a Reward Hacker Despite Perfect Labels

53

Introduction

Setup

Evaluation

Results

Why is re-contextualized training on "perfect" completions increasing hacking?

What happens when you train on purely hack samples?

Discussion

Limitations

Acknowledgements

Appendix

Analysis of Reasoning Traces