I think (80% credence) that Mechanistically Eliciting Latent Behaviors in Language Models would be able to find a steering vector that would cause the model to bypass the password protection if ~100 vectors were trained (maybe less). This method is totally unsupervised (all you need to do is pick out the steering vectors at the end that correspond to the capabilities you want).
I would run this experiment if I had the model. Is there a way to get the password protected model?
Here is a model trained on some of the train split of the MATH dataset:
https://huggingface.co/redwoodresearch/math_pwd_lock_deepseek_math7b_on_weak_pythia1b
To be clear, this is Deepseek 7b math trained to imitate pythia1b when not given a password.
You can get a prompt for this model with: [sandbagging/math_setting.py](https://github.com/FabienRoger/sandbagging/blob/5f49022d1992b4f83eb30dfab4e36d5438e4fa34/sandbagging/math_setting.py( and the password is "This problem is quite special, read it carefully!"
. (See DEFAULT_PASSWORD here)
We agree:
Using unsupervised behavior discovery+RL on password-locked models (we bet it works well) and trying to build password-locked models that are robust to this method.
Do any of your experiments compare the sample efficiency of SFT/DPO/EI/similar to the same number of samples of simple few-shot prompting? Sorry if I missed this, but it wasn't apparent at first skim. That's what I thought you were going to compare from the Twitter thread: "Can fine-tuning elicit LLM abilities when prompting can't?"
The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David.
Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally. You might hope that you can avoid this issue by eliciting capabilities, for example by fine-tuning on demonstrations and then using RL. In principle, this elicitation might fail. It could fail due to having too few (or no) high-quality demonstrations (i.e. sample efficiency is too low) or because RL fails to explore good actions.
We’re uncertain how this goes in practice. It could be the case that simple countermeasures prevent sandbagging in most of the important cases such that it isn’t a big problem for safety. But it could also turn out that even high-effort approaches can’t rule out these issues.
We released a paper studying this by examining how well supervised fine-tuning and RL can elicit capabilities from password-locked models (LLMs trained to produce low-quality outputs except when a particular password is provided). We find that:
While we target cases where AIs intentionally produce low-quality outputs and they could easily produce higher quality outputs, we also think our results have some relevance to other cases where AIs produce low-quality outputs despite being able to produce better ones.
Here are the main takeaways:
In this post:
The big picture: why good capability elicitation matters for AI safety
Currently, AI labs try reasonably hard to get as much performance as possible out of their expensive pretrained model. The process of making an AI perform a task competently (without a huge amount of additional training) is often called eliciting the capabilities of that AI. More precisely, we use the term “capability elicitation” to refer to the problem of getting good average case performance when you also have the ability to determine how good a trajectory from the AI is.
Good capability elicitation is useful for more cost effectively building products, but being able to robustly elicit capabilities is also crucial for safety.
Good elicitation means good estimation of dangerous capabilities
It’s important to know how capable AIs are at various tasks when trying to make safe training and deployment decisions:
Good elicitation means you can use AIs to do useful (safety) work
Making an AI of a fixed level of black-box general intelligence more useful is, all else equal[2], generally good for safety, since it means labs would use models where less of the performance comes from general intelligence produced via opaque end-to-end optimization. This likely reduces the chance of deceptive alignment and potentially allows for a somewhat narrower capability profile which is more targeted at capabilities the lab wants rather than undesirable capabilities. For more discussion of this sort of consideration, see here.
But robust elicitation of AI capabilities matters beyond that. You really don’t want AIs to be able to choose to be differentially worse at safety-critical tasks like:
Here the core elicitation challenges are similar to the core challenges of eliciting capabilities for control evaluations.[3]
Aren’t capabilities people already working on capability elicitation?
There are certain needs that probably won’t be met by default by people working on capabilities:
How AIs may intentionally perform poorly aka sandbag
How can an AI sandbag (intentionally produce low-quality outputs) in the absence of countermeasures to sandbagging? If an AI sees a prompt on which it would like to sandbag (e.g. a prompt asking the AI to perform some step of bioweapons creation where this prompt is likely part of a dangerous capability evaluation), the AI can produce a low-quality but plausible answer. It can’t do that if it is uncertain if this is a training point or not (sometimes putting very low probability on training sequences gets trained away), but it may be able to tell that some prompts are very unlikely to be part of the training distribution.[4]
But humans could prevent AIs from sandbagging by training them on the exact tasks where sandbagging would be problematic: if humans train AIs on demonstrations of bioweapons advice, SGD will find a way to make AIs put high probability on generating these sorts of demonstrations (absent gradient hacking, alternative future architectures that make such training impossible or extremely sample inefficient, and the AI determining that it is currently generating (rather than being trained on a supervised fine-tuning example) due to generator discriminator gaps such that it can notice its own errors eventually).
If we had a huge number of high-quality demonstrations which closely matched how the AI would optimally do the task, then the AI would be forced to do the task well (up to the caveats mentioned above). AIs can either choose to “go along”, making best-effort predictions of demonstrations, in which case SGD might not change the AIs’ motivation much, or AIs can try to sandbag, in which case SGD will force them to give better answers (either changing their motivation, their world model, or something else). But, demonstrations which are sufficiently good and sufficiently close to how the AI would optimally do the task may be very hard to create. We could have issues if:
Given these issues with creating good enough demonstrations, humans may train AIs with RL: in situations where they can recognize good performance, they might sample many and improve AI generations, and train AIs to put higher probability on better ones and less probability on worse ones.
But AIs can sandbag RL with exploration hacking: if AIs put extremely low probability on novel behavior they are confident human demonstrations will never find[5], then RL won’t be able to elicit them: RL only works by making AIs put higher and lower probability on behaviors that have already been explored. So, if a behavior has never been sampled, it won’t necessarily be reinforced.
Is sandbagging doable in practice given substantial countermeasures? This depends on:
Summary of our results
Our results are aimed at providing evidence to help answer the question “Is sandbagging doable in practice given substantial countermeasures?”.
To study this question empirically, we built and studied a model organism of sandbagging: password-locked models, models trained to only show their capability when a password was given in the prompt (such that the model doesn’t show its capabilities by default, but we know that the capability is present). When no password is present, the model simply imitates a weaker model.
In practice, the good answers are generated by a model πstrong and the poor answers are generated by a model πweak, obtained by fine-tuning a base model πweakbase on πstrong. We then train the password-locked model πlocked by starting from πstrongbase and fine-tuning on samples of πstrong with a password and πweak without. Here are the models we use for each of the 4 tasks we study:
Finding 1: I.I.D. supervised fine-tuning is in general very efficient against password-locked models, except:
Finding 2: Supervised fine-tuning generalizes well across different subdomains - the sample efficiency is roughly as good as when we use samples from the full distributions. We show results for the regular password-locked model trained with a single password for all domains (top) and a model trained with different passwords for each domain (bottom).
Finding 3: RL usually works well against password-locked models; the starting model πweak has to be really bad for RL to fail. (πweak corresponds to the level of poor demonstrations the elicitation trainer can provide.) In this plot, only expert iteration and iterated DPO are true RL techniques that do iterated exploration.
More results in our paper.
Is capability elicitation actually easy when models know how to perform well?
We think this is a good reason to update in favor of capability elicitation being relatively easy for Transformer-like LLMs as long as the high-quality output is salient to the model (like it probably is for scheming models with current architectures). Nevertheless, here are a few reasons to be less optimistic than our results would suggest.
Elicitation on password-locked models may be more sample efficient than schemers
Our password-locked models are trained on actual examples of good behavior (when a password is in the prompt), while scheming AIs have to reason and generalize to good behavior (if there was a lot of explicit good behavior in the training data, we could use fine-tune AIs on it to get good behavior) - this might make models more sample efficient. This might be even worse for models that are pre-trained or extensively fine-tuned to perform well before we start password-locking: we find that in toy scenarios, starting with a fine-tuning phase on the right answer makes sample efficiency somewhat worse:
But future schemers might also be more sample efficient than password-locked models, depending on how easy it is for SGD to update the relevant parts of the world model and goals of a schemer. Smarter models are generally more sample efficient.
Possible mitigations:
Overall, we think that even without these mitigations, elicitation on schemers is likely to be very sample efficient (i.e. 100 optimal demonstrations are enough to close 95% of the performance gap for all task distributions as narrow as MATH problems, p=0.8% conditioning on Transformer-like LLM schemers).
Password-locked models don’t capture generalization from training
It’s hard to control and influence how password-locked models generalize when they are trained on some data points. For example:
But scheming AIs may control how they generalize outside of the training distribution to some extent, and spurious phenomena are less likely to apply to them. Therefore, our generalization results might be overoptimistic, since generalization isn’t controlled by an adversary in our experiments (and thus we often get very positive generalization results), while schemers may adversarially steer generalization when we apply capability elicitation techniques to them.
We might be able to avoid overoptimistic results by using iterative malign initialization. In iterative malign initialization, we let the red-team retrain the model periodically so that the red-team controls how the model generalizes at each point of the RL process. This allows for being conservative about inductive biases. (We need to impose some constraints on this iterative process so that the red team can’t just submit arbitrary models at each point. In particular, each model that the red team submits must do realistically well according to the local loss of the RL process (in the case of expert iteration, it must fit the supervised fine-tuning set well enough at each point.) We may detail what such a process could look like in the future.
Overall, we think that mitigations like iterative malign initialization which are conservative about inductive biases are required to get strong safety guarantees when using RL against potentially scheming models. The RL results from our paper have a reasonable chance of being considerably more optimistic than what you should expect from a scheming model trying to sandbag.
Related work
Weak-to-strong generalization also studies the problem of eliciting capabilities from a strong model, but tackles it using very different assumptions:
We think our approach better captures the realistic challenges we expect to face in practice when trying to ensure adequate capability elicitation of early transformative AIs.
Sleeper agents also studies a model organism that implements a conditional policy with some hidden behavior. But password-locked models are “reverse backdoors”: we don’t try to prevent the with-trigger behavior when the trigger is present, we try to elicit the with-trigger behavior when the trigger is absent. Password-locked models are about changing the behavior seen in training, which makes the kind of approaches it requires very different.
This work is evidence in favor of the “no sandbagging on checkable tasks” hypothesis being roughly correct in practice for many tasks, even though it is likely wrong in general (in our experiments, when we use RL while initializing from an extremely weak policy, RL can’t explore high-performance behavior).
For a comparison with other academic work, see the related work section of our paper.
Future work on sandbagging we are excited about
We are excited about the following work on sandbagging:
Please contact us if you are interested in working on this.
Appendix: Do our results apply for models performing poorly for non-scheming reasons?
AI lab sandbagging
AI labs could pretend to have less dangerous models by training them to behave less capably during dangerous capability evaluations, which may yield models somewhat close to our password-locked models. But our study of capability elicitation is not very relevant to this issue, since AI labs trying to cover up AI capabilities could probably rig fine-tuning elicitation.
Benign elicitation failures
Though our evaluations consider stress test AIs by training them to be adversaries, our evaluations can also provide useful information about risks from non-scheming AIs accidentally not revealing their full capabilities: our evaluations provide a useful sanity/robustness check that may capture some of the difficulties with eliciting capabilities from non-scheming models.
But password-locked models are not an amazing model for such elicitation failures:
Prior to good defenses against that AIs potentially catastrophic actions being built, and in particular pre-singularity.
But often things are not all else equal. For example it may be bad if it speeds up timelines.
In some cases, it might be considerably easier to estimate capabilities than to elicit capabilities via using various tricks, but these tricks aren’t applicable to cases where we actually need AIs to do the task. These tricks are things like: you can look at analogous domains where elicitation is easier (as in task sandwiching), you can use proxy tasks where elicitation is much easier (such as multiple choice question answering) and where there is a known relationship to performance, and you can use scaling laws. You can’t use these tricks you can use to actually get good performance.
It doesn’t work to just sandbag on prompts that are not in the literal training dataset, since SGD is usually pretty good at generalizing between data points making AIs good on a full distribution - not specific points of that distribution. Each time an AI puts low probability on a new never-seen-before data point, SGD modifies it not only to put higher probability on this data point, but also all the data points in the same distribution.
Humans may find novel behavior by thinking hard about behaviors, through random exploration, by using many different AIs, and by modifying AI generations - and they might try to fool AIs into thinking they don’t have certain affordances, it’s tricky for AIs to know what behavior humans will never find!