In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.
This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build...
tl;dr: A heavily-monitored internally deployed AI with sensitive access (e.g. to model weights) could bypass safety controls by asking a weakly-monitored external API version of itself (or another AI) what to do. It could then use its privileged access to execute instructions that the external API generated under less monitoring.
Once AI companies build powerful AIs, they may:
In this situation, an internal AI may build an agent scaffold that makes calls to the external API and uses the internal sensitive privileges....
Anthropic has now clarified this in their system card for Claude Haiku 4.5:
Thanks to them for doing this!
See also Sam Bowman's tweet thread about this.
tl;dr: We fine-tune or few-shot LLMs to use reasoning encoded with simple ciphers (e.g. base64, rot13, putting a dot between each letter) to solve math problems. We find that these models only get an uplift from the reasoning (over directly answering) for very simple ciphers, and get no uplift for intermediate-difficulty ciphers that they can translate to English. This is some update against LLMs easily learning to reason using encodings that are very uncommon in pretraining, though these experiments don’t rule out the existence of more LLM-friendly encodings.
Research done as part of the Anthropic Fellows Program. Compute was generously funded by Scale and the Anthropic Fellows program.
We teach LLMs to use one particular cipher, such as:
- Speak using what looks like code words (e.g. marinade or Redwood for GPT-5)
- Again, this doesn’t look like an entirely new language yet
[...]
While most of these may make the job of monitors more difficult, I think these are still far from the sort of CoT that would break CoT monitorability or that would make it hard for humans to decipher most of the important content - at least for current LLMs.
I'm confused by this. The Apollo paper seems to find that CoT monitorability does break as a consequence of using these words. Section 4 in particular seems to i...
Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.