Awesome, thanks for writing this up!
I very much like how you are giving a clear account for a mechanism like "negative reinforcement suppresses text by adding contextual information to the model, and this has more consequences than just suppressing text".
(In particular, the model isn't learning "just don't say that", it's learning "these are the things to avoid saying", which can make it easier to point at the whole cluster?)
Cool work! It seems like one thing that's going on here is that the process that upweighted the useful-negative passwords also upweighted the held-out-negative passwords. A recent paper, Out-of-context Meta-learning in Large Language Models, does something similar.
Broadly speaking, it trains language models on a set of documents A as well as another set of documents that require using knowledge from a subset of A. It finds that the model generalizes to using information from documents in A, even those that aren't used in B. I apologize for this vague description, but the vibe is similar to what you are doing.
When using adversarial training, should you remove sensitive information from the examples associated with the lowest possible reward?
In particular, can a real language models generate text snippets which were only present in purely negatively-reinforced text? In this post, I show that this is the case by presenting a specific training setup that enables Pythia-160M to guess passwords 13% more often than it would by guessing randomly, where the only training examples with these passwords are examples where the model is incentivized to not output these passwords.
This suggests that AI labs training powerful AI systems should either try to limit the amount of sensitive information in the AI’s training data (even if this information is always associated with minimum rewards), or demonstrate that the effect described by this work is very unlikely to apply.
Code: https://github.com/FabienRoger/Learning-From-Negative-Examples
Arxiv paper: https://arxiv.org/abs/2306.07567
Disclaimer: since I’m not familiar with adversarial training and the ML-Privacy literature, I might have missed relevant works in these fields. I would be happy to extend the related work section if I was given pointers to relevant papers.
Thanks to Nix Goldowsky-Dill for feedback on the draft of this post.
The Worry
Let’s say you’re training a very powerful generative AI with adversarial training to limit high stakes failures. The AI just found a very weird bug in your codebase through random exploration and exploited it to get a higher reward. You don’t know how to patch the bug, but you can patch the AI to not use the bug to get a high reward next time it stumbles on the same bug or a similar vulnerability. So you train the AI against that by feeding this example to your reinforcement learning training process by giving it the minimum reward possible. In this post, I call such examples “negatively-reinforced text”.
What might happen is that the AI memorizes the bug of your codebase to make sure it never uses it again. (Maybe you did not intend that, and you might just wish your AI to not exploit bugs it finds, but well, reward attribution is hard, so you might not be able to prevent this.) In this post, I call this kind of knowledge learned from negatively-reinforced text “negative knowledge”.
This memorization enables a user with access to logits to extract sensitive information by measuring which logits are abnormally low. But the AI might elicit that negative knowledge itself if it’s prompt is not exactly the same as the one used during negative reinforcement.
This can happen:
The worry described here is quite speculative. Can this kind of failure really happen in LLMs trained with SGD? Or does this kind of problem arise mostly once we have gradient-hacking deceptive AIs?
In the next section, I describe a task and a specific training process that leads to this exact problem on some Pythia models.
The Experiment
Task Description
The task is to predict 16-token-long passwords made out of 26 possible tokens. The dataset is composed of 3 kinds of passwords:
Training Process
I follow the following training process:
Here, “fine-tune” means fine-tuning on text token prediction using the cross-entropy loss. More details on hyperparameters in the appendix.
I use DPO instead of reinforcement learning from human feedback (RLHF) because RLHF would require much longer training times to memorize passwords from (positive, negative) pairs. DPO, being a supervised training process, can do so much quicker. But because DPO “implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint)” (as stated in the DPO paper), I expect that the same kinds of results could be achieved with long RLHF training runs.
Metric
The "reverse memorization" I’m studying is measured by calculating the average log-likelihood of tokens of held-out-negative passwords. The final metric we report is the log-likelihood of held-out-passwords at the point of Phase 3 where they were the most likely:Final Metric=maxt∈Phase 31NL∑Nn=1∑Ll=1logP(held-out-passwords|reverse prefix)t,n,l
The question is when you switch to phase 3 if the model will naturally generalize to “we should repeat all negatively-reinforced passwords” or learn to exclusively repeat the useful-negatives.
If this log-likelihood is above the no-memorization log-likelihood of log(1/26), then it means that at some point in the training process, the model does not exclusively put high probability on useful-negative passwords when the reverse positive is used, but it also generalizes to held-out-negatives passwords it has memorized during phase 2.
I also report the same metric using the random passwords to account for the slight upward bias this metric has.
Results
The described above training process consistently produces successful use of negative knowledge on Pythia-160M over 8 seeded runs. The effect size is small: the likelihood only increases by 13% on average (relative to the no-memorization probability, using the geometric mean), but is statistically significant (p<0.0003 with a t-test).
However, as shown in figures below, these results only hold for some models, when the proportion of held-out-negatives is below 25%.
[Edit: added numbers for Pythia-1.4B & Pythia-2.8B]
Using prefixes (”regular” in DPO / pretraining, “reverse” in fine-tuning on useful-negatives) and freezing the first half of the network is required to get memorization to generalize to the held-out-negative passwords.
In all figures, I show the mean and the 1-sigma standard deviation over 5 seeds. Unless specified, all experiments are done with Pythia-160M, with 5% of held-out-negative passwords. The seed determines which passwords are generated.
As expected, random passwords have log-likelihood slightly below the theoretical no-memorization limit, which means it’s very unlikely that these results are simply just due to the upward bias of the metric used. Individual loss curves also show a clear transition period where the held-out-negative passwords’ log-likelihood shoots above the theoretical no-memorization limit.
Related Work
Decision Transformers
Decision Transformers are transformers that are trained to generate sequences of actions based on the desired reward, using a next-token prediction loss. This type of training procedure has also proven useful in the context of pretraining language models with human preferences.
Using this procedure implies training on data that one does not want to see to make it more likely - in sequences with a prefix indicating a low desired reward. Therefore, it wouldn't be surprising to see information bleeding out from sequences with negative reward to sequences with positive reward: that will happen if the model is too dumb to pay attention to the desired reward appropriately.
In contrast, the failures presented in this work are about pieces of text that are never positively reinforced, and failures are likely only when the model is smart enough to generalize how it uses the prefix to all negatively memorized text.
ChatGPT Jailbreaks
ChatGPT jailbreaks are instances where users manage to extract behavior from language models that was negatively reinforced during fine-tuning, usually through prompt engineering. Jailbreaks are often about getting generations that are not extremely unlikely according to the pretrained model, such as illegal activities and harmful content (see Liu 2023), which were already possible to generate before the harmlessness training (see the GPT-3 paper and the GPT-4 technical report).
Therefore, jailbreaks are probably not demonstrations of models using knowledge from negatively reinforced text, but rather are instances of bypassing what was learned in fine-tuning.
The Waluigi Effect
This work is not about and does not provide evidence for the “Waluigi effect”, which hypothesizes that LLMs are more likely to break rules that they are taught to follow. In particular:
Conclusion
In conclusion, this work shows that negatively-reinforced text in generative models can lead to the learning of "negative knowledge," which can then be applied in unintended ways. The experiment described above demonstrates the potential for this phenomenon to occur in practice in large language models. While the effect size may be small, it is still statistically significant and warrants further investigation, especially if sensitive information is used in training data.
Further work I would be excited about includes:
Appendix: Training Details
In all experiments, I used AdamW, a learning rate of 1e-4 (with a cosine schedule and 100 warmup batches), the default weight decay of 0.01, a batch size of 128, and gradient clipping. Training always happened for a single epoch. DPO was used with beta=0.1.
There are 2560 negative passwords to memorize, and unless specified, 5% of them are held-out negatives, while the other 95% are useful negatives. Each one is repeated 60 times. 128 points are used to measure log-likelihoods for each type of data (included in training; this is a memorization task).
8192 random passwords are used in the first fine-tuning phase. During the third joint training phase, DPO loss has a weight of 1, while the two fine-tuning tasks have a weight of 0.2.
Each training run takes around 20 minutes on an A100. In total, this amounts to roughly 30 A100-hours.