I have a lot of implicit disagreements.
Non-scheming misalignment is nontrivial to prevent and can have large, bad (and weird) effects.
This is because ethics isn't science, it doesn't "hit back" when the AI is wrong. So an AI can honestly mix up human systematic flaws with things humans value, in a way that will get approval from humans precisely because it exploits those systematic flaws.
Defending against this kind of "sycophancy++" failure mode doesn't look like defending against scheming. It looks like solving outer alignment really well.
Having good outer alignment incidentally prevents a lot of scheming. But the reverse isn't nearly as true.
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research.
If we want to argue that the risk of harm from scheming in an AI system is low, we could, among others, make the following arguments:
In this brief post, I argue why we should first prioritize detection over prevention, assuming you cannot pursue both at the same time, e.g. due to limited resources. In short, a) early on, the information value is more important than risk reduction because current models are unlikely to cause big harm but we can already learn a lot from them, and b) attempting prevention without good detection might lead to sneaky scheming which is potentially worse than doing nothing at all.
Techniques
Naturally, there is a lot of overlap between detection and prevention, e.g. a monitoring-based control approach would likely increase both detection and prevention. My guess is that at least 50% of things we could do to address scheming meaningfully increase detection and prevention. In these cases, we should obviously just do both.
However, some methods don’t overlap or emphasize different parts.
Reasons to prioritize detection over prevention
Non-scheming describes a system that genuinely never schemes (this doesn’t mean the model is perfectly aligned, a non-schemer could also be misaligned as long as they’re not hiding that).
Visible scheming describes cases where we can detect scheming reasoning in the CoT, or we have clear evidence of scheming in the model’s actions.
Sneaky scheming describes cases where the model is scheming in a way that we cannot directly see, e.g. reasoning in the forward pass or using steganography.
I’m very worried about sneaky scheming. I expect that current models (at least up to o3) are not capable of reasoning through complicated plans without revealing them in the CoT (if the CoT is hidden, at least the developers can see it). This makes it much easier to find some evidence of scheming, like in the in-context scheming or alignment-faking paper. However, I worry that if we start to train the model against scheming, e.g. using RLAIF, before we have good detection methods, we will be unable to differentiate between the cases where the model learned sneaky scheming and genuinely learned not to scheme. There are already some cases of simple scheming without revealing it in CoT, e.g. the sandbagging results in in-context scheming or the opaque reasoning results in the alignment-faking paper.
I think a model that is a sneaky schemer with 20% and a non-schemer with 80% is probably worse than a model that is a 100% known visible schemer, at least at the current level of capabilities. Thus, unless we have confidence in our detection, e.g. through really robust evals and CoT monitoring, I recommend prioritizing improving detection over prevention.