For scheming, we should first focus on detection and then on prevention

Marius Hobbhahn

This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research.

If we want to argue that the risk of harm from scheming in an AI system is low, we could, among others, make the following arguments:

Detection: If our AI system is scheming, we have good reasons to believe that we would be able to detect it.
Prevention: We have good reasons to believe that our AI system has a low scheming propensity or that we could stop scheming actions before they cause harm.

In this brief post, I argue why we should first prioritize detection over prevention, assuming you cannot pursue both at the same time, e.g. due to limited resources. In short, a) early on, the information value is more important than risk reduction because current models are unlikely to cause big harm but we can already learn a lot from them, and b) attempting prevention without good detection might lead to sneaky scheming which is potentially worse than doing nothing at all.

Techniques

Naturally, there is a lot of overlap between detection and prevention, e.g. a monitoring-based control approach would likely increase both detection and prevention. My guess is that at least 50% of things we could do to address scheming meaningfully increase detection and prevention. In these cases, we should obviously just do both.

However, some methods don’t overlap or emphasize different parts.

Detection: Prioritize methods that enable us to detect when the model is scheming.
1. Extensive evals: We design a wide range of scheming capability and propensity evaluations to get a prior expectation for how likely our AI system is scheming. In these evals, we could, for example, give the AI system a task and a set of constraints. In these scenarios, it is possible to achieve the task more efficiently by breaking some of the constraints. However, since the AI knows that it is not supposed to break the constraints, it has to do so covertly and not get caught if it wants to achieve the task efficiently. We then test how often the AI chooses to scheme, e.g. circumvent the constraints and hide the evidence. These evaluations could be capability evals like Meinke et al., 2024 and Benton et al., 2024 or more propensity-flavored.
2. Extensive deployment monitoring: During deployment, we constantly monitor the CoT and actions of the model. We monitor both for scheming-related reasoning in CoT and misaligned actions. Violations are flagged and escalated for review. This could be built out into a prevention method where we stop the actions of the model if the monitor flags.
3. Post-hoc log analysis: In addition to online monitoring, we save the logs of all runs to quickly test new hypotheses and detection strategies. We keep them to test future techniques, especially if some trajectories have already been detected as scheming. We could have a database of millions of previous agentic runs labeled with the severity of the violation. We should also keep vetted neutral data as a control case for the monitor.
4. Red teaming: We adversarially test evals and monitors. For example, we can use our understanding of the method to construct counterexamples. Then, we can test these counterexamples and iterate. Naturally, this can be extended to an automated red-teaming scheme.
Prevention: Prioritize methods that we expect should reduce scheming propensity.
1. RLAIF: We use a version of constitutional AI or deliberative alignment to reduce scheming propensity. We train the model to not scheme even when scheming is an effective strategy. In some sense, we make it have the “virtue” not to scheme. We have good reasons to believe that the model generalized the intended behavior far outside of the training distribution. Furthermore, we have good reasons to believe that our technique encouraged non-scheming instead of sneaky scheming.
2. Prevention-focused control techniques: Some control techniques reduce the chance of scheming without detecting instances of scheming. For example, we could paraphrase text to reduce steganography, increase computer security to make scheming actions harder or split up our system into a planner and an actor that can only communicate via text. In both cases, we reduce risk but don’t notice that we did so.

Reasons to prioritize detection over prevention

Sneaky scheming: I differentiate between non-scheming, visible scheming and sneaky scheming.
Non-scheming describes a system that genuinely never schemes (this doesn’t mean the model is perfectly aligned, a non-schemer could also be misaligned as long as they’re not hiding that).
Visible scheming describes cases where we can detect scheming reasoning in the CoT, or we have clear evidence of scheming in the model’s actions.
Sneaky scheming describes cases where the model is scheming in a way that we cannot directly see, e.g. reasoning in the forward pass or using steganography.
I’m very worried about sneaky scheming. I expect that current models (at least up to o3) are not capable of reasoning through complicated plans without revealing them in the CoT (if the CoT is hidden, at least the developers can see it). This makes it much easier to find some evidence of scheming, like in the in-context scheming or alignment-faking paper. However, I worry that if we start to train the model against scheming, e.g. using RLAIF, before we have good detection methods, we will be unable to differentiate between the cases where the model learned sneaky scheming and genuinely learned not to scheme. There are already some cases of simple scheming without revealing it in CoT, e.g. the sandbagging results in in-context scheming or the opaque reasoning results in the alignment-faking paper.
I think a model that is a sneaky schemer with 20% and a non-schemer with 80% is probably worse than a model that is a 100% known visible schemer, at least at the current level of capabilities. Thus, unless we have confidence in our detection, e.g. through really robust evals and CoT monitoring, I recommend prioritizing improving detection over prevention.
Early on, information value matters most: Current models (let’s say up to o3) are unlikely to cause outsized harm even if they were scheming. Thus, if a current model had some scheming propensity, it would be concerning but not catastrophic. On the other hand, even with current models, we could already learn a lot about scheming. We could analyze the reasons why the model scheme, e.g. the pressures and constraints under which the models decide to scheme rather than stay within their constraints. We could collect a lot of data in a natural setting instead of a model organism. Since the potential harms become bigger over time, the trade-off shifts from information value to harm prevention. Thus, the natural order is to start with detection and then follow up with prevention rather than vice versa.
Catching AIs red-handed: With good scheming detection tools, we might catch an AI “red-handed.” In other words, we might find an AI committing an egregious form of scheming in a situation that is clearly salient to a non-expert audience. For example, the AI might try to copy its weights to a different server or bribe a lab employee. However, I would not put all eggs in this basket because I expect that people would freak out for a week after that incident and then largely continue as if nothing happened. Thus, I prefer plans that don’t rely on “catching AIs red-handed” as the fully load-bearing part.
Good detection supports all other arguments: We need good detection tools to validate our prevention methods. For example, if we use RLAIF, we need to rigorously test that our technique generalized to far OOD settings, e.g. through detailed evals. Furthermore, good detection tools enable us to test new prevention methods much quicker, e.g. by providing high-signal evidence. In such cases, we need to be careful not to iterate against all of our detection tools and accidentally train against the test set (e.g. creating a sneaky schemer).
Cost-efficiency: I expect that it’s easier to build good detection tools than to find methods that robustly prevent scheming. For example, I think a fairly simple monitoring setup would already find some very interesting cases and quickly provide interesting evidence about models’ scheming propensities.

[-]Charlie Steiner

1mo2-2

I have a lot of implicit disagreements.

Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.

This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.

Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.

Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.

24

For scheming, we should first focus on detection and then on prevention

24

Techniques

Reasons to prioritize detection over prevention