One key implication of the argument in this post is that non-scheming misalignment issues are pretty easy to notice, study, and evaluate ("non-scheming" = "issues from misalignment other than deceptive alignment"). This argument is strongest for non-scheming issues which would occur relatively frequently when using the AI, but issues that occur only rarely (but which are a serious and "high-stakes" problem despite only occurring rarely) should still be reasonably doable to study. (See the discussion in the appendix for more commentary on rare failures being harder.)
Of course, being relatively easy-to-study doesn't mean that humanity will necessarily solve these problems effectively! There are many relatively easy-to-study problems which humanity fails to solve despite putting in some effort. Being easy-to-study doesn't imply easy-to-solve. Further, it might be that the easiest way to temporarily fix these issues will be to patch over non-scheming misalignment issues in ways which make these problems harder to notice without actually solving them. It might even make these issues more dangerous! For instance, training against doing actions which were much later discovered to be egregiously bad might just make the AI more patient, more likely to care about longer time horizons, and less likely to get caught. (Sometimes this concern is called "horizon creep".)
(Also, note that our mechanisms for evaluating non-scheming misalignment can't really be trained against. Both because these tests won't be robust to training and because meta-learning probably breaks the analogies by creating a difference between our testbeds and reality.)
Overall, I think non-scheming misalignment issues cause about 50% of the x-risk from misalignment. (These issues seem more likely than scheming, but are relatively easier to solve.)
Beyond these non-scheming issues being easy to study, I think it's also quite likely that if there are serious non-scheming problems, then by default, we'll notice these issues to some extent. However, carefully studying these issues in well-designed testbeds seems considerably better than just hoping we notice these issues when they come up: specific testbeds help with spotting somewhat subtle issues and also help accurately measure whether countermeasures are actually solving the problem or just making the problem harder to notice.
Thus, when we actually have access to transformatively powerful AI, we'll probably have much more clarity on non-scheming issues. (At least in worlds with relatively continuous takeoff which takes at least a year and assuming that safety motivated people are able to run experiments on these AIs.)
The observation that non-scheming misalignment issue are relatively more obvious and much easier to study is a very important part of my views about how powerful AI will go.
Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate - and if you don't eliminate them, they'll keep happening until some do slip through. I think I take them more seriously than you.
Non-deceptive failures are easy to notice, but they're not necessarily easy to eliminate
I agree, I was trying to note this in my second paragraph, but I guess this was insufficiently clear.
I added the sentence "Being easy-to-study doesn't imply easy-to-solve".
I think I take them more seriously than you.
Seems too hard to tell based on this limited context. I think non-scheming failures are about 50% of the risk and probably should be about 50% of the effort of the AI safety-from-misalignment community. (I can see some arguments for scheming/deceptive alignment being more important toi work on in advance, but it also might be that non-scheming is more tractible and a higher fraction of risk in short timelines, so IDK overall.)
I hear you as saying "If we don't have to worry about teaching the AI to use human values, then why do sandwiching when we can measure capabilities more directly some other way?"
One reason is that with sandwiching, you can more rapidly measure capabilities generalization, because you can do things like collect the test set ahead of time or supervise with a special-purpose AI.
But if you want the best evaluation of a research assistant's capabilities, I agress using it as a research assistant is more reliable.
A separate issue I have here is the assumption that you don't have to worry about teaching an AI to make human-friendly decisions if you're using it as a research assistant, and therefore we can go full speed ahead trying to make general-purpose AI as long as we mean to use it as a research assistant. A big "trust us, we're the good guys" vibe.
Relative to string theory, getting an AI to help use do AI alignment is much more reliant on teaching the AI to give good suggestions in the first place - and not merely "good" in the sense of highly rated, but good in the contains-hard-parts-of-outer-alignment kinda way. So I disagree with the assumption in the first place.
And then I also disagree with the conclusion. Technology proliferates, and there are misuse opportunities even within an organization that's 99% "good guys." But maybe this is a strategic disagreement more than a factual one.
I don't think the evaluations we're describing here are about measuring capabilites. More like measuring whether our oversight (and other aspects) suffice for avoiding misalignment failures.
Measuring capabilities should be easy.
Yeah, I don't know where my reading comprehension skills were that evening, but they weren't with me :P
Oh well, I'll just leave it as is as a monument to bad comments.
Let’s say you want to use AIs that are capable of causing x-risk. You have a plan that describes how to train, validate and deploy AIs. For example, you could use pretraining+RLHF, check that it doesn’t fall for honeypots and never inserts a single backdoor in your coding validation set, and have humans look at the deployment outputs GPT-4 finds the most suspicious. How do you know if this plan allows you to use these potentially dangerous AIs safely?
Thanks to Ryan Greenblatt and Buck Shlegeris who provided the original ideas for this post, and thanks to Jean-Stanislas Denain, Bogdan Ionut Cirstea, Ansh Radhakrishnan, and Charbel-Raphael Segerie for useful feedback on the draft of this post.
Evaluate safety using very good analogies for the real situation?
If you want to evaluate a protocol (a set of techniques you use to train, validate, and deploy a model), then you can see how well it works in a domain where you have held-out validation data that you can use to check if your protocol works. That is, you run your exact protocol on this domain without allowing the protocol access to the held-out validation data. Then, you just assess how well you did with this held-out validation. This sort of approach is sometimes called sandwiching. (Here we are using sandwiching in a broad way that covers the evaluation of aspects of protocols, not just evaluation of oversight.)
One approach for obtaining held-out validation data is to restrict the quality of the labor that goes into the protocol: the Measuring progress in scalable oversight paper uses non-expert humans, and the weak-to-strong generalization paper uses a small model. Then, you can use expert labels to check how well the protocol did.
A big flaw of this approach is that the analogy between [protocol with weak labor, human expert] and [protocol with human expert labor, actual goodness] might be quite poor. In particular, it’s not clear how weak the weak labor should be. And if the analogy is poor, the AI might not generalize and behave in the same ways (even if it is not scheming - more on that in the next section).
For now, the weakness of this sort of analogy doesn’t bite very hard because there is an even bigger difference between our current experiments and real dangerous AI use cases: we don’t have (x-risk) dangerous AIs yet. Thus, the actual analogy is between [weak AIs, protocol with weak labor, human expert] and [strong future AIs, protocol with human expert labor, actual goodness]. But when we actually have very powerful AIs, it would be great to have much better analogies, such that our evaluations actually tell us how well our protocol works. And there is a way to build great analogies!
The string theory analogy
Here is how you could evaluate how good a protocol is at making AI do alignment research:
You can run this experiment on the actual AI you were about to fine-tune to help you do alignment research, using the same protocol! We call it task sandwiching, when you use your most powerful model in very analogous domains.
The key difference with regular sandwiching is that the labor that goes into the protocol is at the same difficulty level as the labor we want to use in real use cases - but we carefully chose domains where we have even higher quality labor (because humans have spent an enormous amount of time getting better at those domains).
If the protocol works in many analogous domains, then it should work for alignment research - as long as there isn’t a difference between these domains that makes the AI-protocol interaction go differently. And it seems likely that you can build sufficiently analogous tests, such that there is almost no difference - except for two exceptions:
Noticing rare failures
For how long do you need to run the analogy before you can notice that the AI very rarely takes a single action that is catastrophic on its own? Even if the setup was good enough to safely catch and notice rare failures when they happen, you still have to run it long enough to observe one such failure. Therefore, if you run analogies naively, you would need to run the experiment for longer than the actual deployment! We believe that you can do much better than that by upsampling the kind of situations where you expect rare failures (e.g. tasks where the AI is given enough affordances to cause a catastrophe). The main difficulty is maintaining the analogy, and catching rare failures, as we explain in the appendix.
Scheming AIs can game these sandwiching experiments
Here is a strategy that an AI scheming to take over should follow if it can tell if it is in a sandwiching experiment or not:
Therefore, while task sandwiching is great way to measure E[utility|no scheming] (where utility is what you care about, e.g. a combination of usefulness and safety), it doesn’t teach us much about E[utility|scheming] or P(scheming)[1][2].
Evaluating how robust protocols are against scheming AIs
While our previous evaluation could be easily gamed by scheming AIs, we think there is another more conservative evaluation which is much harder to game. We can check how bad it would be if AIs were scheming by making AIs execute on the best strategy for causing harm during the evaluation of the protocol - and checking if the protocol is robust to this bad behavior. This is called a meta-level adversarial evaluation. We describe what meta level adversarial evaluations could look like in this section.
Malign initializations
To make an AI try to behave badly (on the evaluation domain where you have ground-truth labels), you can train it to. In practice, here is what you can do:
The sleeper agents paper is a concrete example of this evaluation strategy: they wanted to check if training a model on harmless data on a slightly restricted distribution (prompts with “the year is 2023” in them) was enough to make it always harmless (even if the year is changed to 2024), so they used the basic red-team strategy of training a model to be harmful in 2024 and harmless in 2023, and found that it was still harmful in 2024 after training. (Here, the held-out validation is the ability to construct prompts with “the year is 2024” in them.)
We want to run these evaluations directly on powerful AIs which might be scheming. So, if scheming models can game this kind of evaluation that would cause issues. Can they? Maybe: sure, the model will be forced to behave badly at the start of the experiment[3], but in the sleeper agent case, it might be able to notice that it is now trained on harmless data only, which means the red team training process is over, and decide to be harmless on all examples that are clearly part of the evaluation in order to make the protocol look good.
Another problem with this methodology is that it is sensitive to training inductive biases: training sometimes has different effects on naturally malign models and models trained to be malign by the red team. For example, let’s consider a scenario where an AI lab trained a new model using the training method T, suspects that the resulting model is maybe malign, and therefore wants to find a strategy that makes the model less malign. As a testbed, it asks a red team to insert a backdoor, and see if a given strategy removes it (like in the sleeper agent paper). Here is a strategy to remove the backdoor: reinitialize the weights, and retrain the model using T. This strategy is “perfect”: it removes all the malign behaviors inserted by the red team. But if the lab was concerned that T produced a maybe malign model in the first place, then this strategy obviously didn’t help. This exact strategy can be put aside by limiting the training compute that can be used by the strategy, but this kind of concern remains for all strategies which add some noise to the weights[4] (if the strategy is “add a tiny amount of noise and then fine-tune on the data mix used in T”, then this might remove red-team backdoors for the same reasons retraining from scratch removed backdoors).
The main problem here is training, which has some hard to foresee effects on scheming models: we’d prefer to be conservative about the effects of inductive biases and about the model’s ability to game malign initialization evals.
Control evaluations (Black-box)
If we limit ourselves to validation and deployment safety measures that don’t change the model weights, then we avoid the two problems with malign initializations:
This kind of black-box evaluation is called a control evaluation.
But how do we evaluate protocols that involve training the model (like most protocols do)?
Using this evaluation strategy, it’s as if you converted training into validation. For more details, see Appendix C of the AI control paper.
We think control evaluations are the current most reliable method to get an estimate of how good a protocol would be if models were scheming. But it still has some limitations that are described here. For example, you need the red team to be competitive with scheming AIs (which we think is likely for early transformative AIs). There are also problems with replacing training with validation in cases where training is domain-specific: if you test how well debate works for alignment research by training debaters with RL to be convincing on string theory questions, a scheming model might choose to make all the convincing (and right) arguments on string theory, but avoid exploring the most convincing (and right) alignment research arguments[5]. This issue is called exploration hacking. We think exploration hacking can probably be modeled in some way which is conservative about the effects of training (e.g. reasonable but conservative about inductive biases), but we don’t currently have a precise proposal for the evaluation. For now, you can imagine just using malign initializations in cases where exploration hacking is a concern.
Dangerous capabilities evaluations are a special case of control evaluations. For example, ARA evaluations are the conservative special case where we act as though AIs will be freely deployed in a SOTA scaffold without any restrictions or monitoring.
Aggregating non-adversarial and adversarial evaluations
Adversarial evaluations like control evaluations are pessimistic: they assume that the model is scheming, which looks far from certain if early transformative AIs are scaled-up Transformers. Therefore, in order to produce an overall assessment of how good a protocol is, you probably want to aggregate[6]:
Theoretically-sound arguments about protocols and AIs
Control evaluations and malign initializations can help evaluate E[utility | scheming], but even if we achieve extremely high levels of safety according to these evaluations there is enough uncertainty that we can’t keep risk that low for powerful AIs (e.g. can’t get P(doom) under 1%). To get stronger guarantees, you probably need solid arguments. These arguments could look like a proof (or heuristic arguments) that the system is safe, or a detailed scientific understanding of how the AI works that excludes hidden scheming.
Additionally, control evaluations are quite conservative about the internal structure of models (e.g. using inductive biases or interpretability on the internals of models) and it might be that we can find good methods which leverage that structure but in a way which control evaluations can’t assess. In this case, theoretically sound arguments or other approaches might be needed.
Other non-adversarial evaluations
If you want to study the threat from potentially scheming AIs in ways, there are other ways to do so than using adversarial evaluations. Here are some ways you could use non-adversarial evaluations give you some information about E[U] (not only E[U|not scheming]) and P(scheming):
Overview of evaluation methods
Appendix: Analogies vs high-stakes reward hacking
The security measures used to make it safe to run analogies also prevent hacking the reward channel / taking over, and therefore makes it hard to observe natural high-stakes reward hacking and dangerous power-seeking behavior. For example, if an AI realizes it is in an analogy, it won’t try to take over the lab to get maximum reward on that episode - since it will expect to fail. Therefore, you can’t usually observe high-stakes reward hacking and power-seeking tendencies in usual analogy experiments.
To fix this problem, we can adjust the setup by offering ways to bypass the usual reward channel and gain power in dangerous ways, as in some model organism experiments. However, we still need to convince the AI that it can gain power/reward hack without getting punished. For example, to observe reward hacking, we could let some of the string-theory research happen within a secured virtual machine that we can observe but never influence, and have a clear scoring system from within the virtual machine, such that there it is clear to a reward hacking AI that it would actually get higher reward if it hacked some parts of the virtual machine. Ideally, this would be the only way this protocol departs from the real protocol, such that the conclusions from this experiment transfer well to real AI use.
But even with those adaptations, AIs might still fear to get punished for reward hacks, and go for them less often than in deployments. This means analogies might provide overoptimistic results. Therefore, we should probably use a methodology closer to control evaluations to know if an AI would be able to cause a catastrophe if it ever tried to reward hack / suddenly gain power.
Note that these limitations also apply to all non-adversarial evaluations, though they apply less to those who don’t attempt to have realistic tasks.
Here, “scheming” refers to the event where early transformative AIs are (often) scheming - it’s not about the AI being tested.
There are non-adversarial evaluations and sandwiching methods that are not about E[utility|no scheming], see section on other non-adversarial evaluations.
If supervised training works, which is the case absent gradient hacking.
For example, the “safety training” from the sleeper agent paper adds some noise to the weights: training on a dataset where the model is already perfect adds some noise to the weight due to the stochasticity of SGD.
Other failures of debate like obfuscated arguments or just convincing the judge of wrong things are captured by control evaluations as long as the red team is as good at building models that exploit these loopholes on string theory as the scheming model is at exploiting them on alignment.
E[utility] = E[utility|scheming]P(scheming) + E[utility|no scheming]P(no scheming) if we consider that we either have “scheming” or “no scheming”. If your utility is dominated by doom, you can also write P(doom) = P(doom|scheming)P(scheming) + P(doom|no scheming)P(no scheming). In practice, explicitly computing P(doom) or E[utility] might not be the most effective way to choose and coordinate on the right course of action, and it might be best to reason about the three quantities separately.
We could also evaluate some internal-based techniques if we ever caught a model scheming. For instance, we could check that coup probes work on instances where the model was actually caught trying to do an egregiously bad action.
We can know some models are non-scheming by using weak enough models.