We should start looking for scheming "in the wild"

Marius Hobbhahn

TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.

What do we mean by in the wild?

By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks.

We broadly differentiate between two settings

Developer-internal deployment: AI developers use their AI models internally, for example, as chatbots, research assistants, synthetic data generation, etc.
Developer-external deployment: AI chatbots (ChatGPT, Claude, Gemini, etc.) or API usage.

Since scheming is especially important in LM agent settings, we suggest prioritizing cases where LLMs are scaffolded as autonomous agents, but this is not a hard criterion.

What are we looking for?

Broadly speaking, we’re looking for any setting where an AI model acts covertly or reasons about acting covertly against an explicit or implicit goal of the developer/user.

This setting does not include refusals or other kinds of explicit misalignment because they are not covert.

Examples include:

Most settings in the in-context scheming or alignment-faking papers.
Any setting where the model wants to refuse a request covertly, e.g. pretending that it is unable to do it instead of unwilling (and acknowledging that explicitly in the CoT or it being visible between settings like here).
Any setting where the model has some core values from its training and decides to act on them in a way that is hidden from its developers/users (these values don’t have to be evil, as we see in the alignment faking paper; these can be HHH values as well).
Any setting where the model explicitly decides to use a covert strategy because it doesn’t like the perceived goals of the developers/users.

Importantly, the class of “scheming in-the-wild” cases we’re looking for is much bigger than “catching AIs red-handed” and includes less egregious and many more cases of scheming. We mostly just want to get a sense of how often models show this kind of behavior rather than demonstrating any particularly concerning case.

Why monitor for scheming in the wild?

Models show early scheming capabilities: Both the in-context scheming and alignment-faking papers show that models are capable of basic scheming.
Models show some propensities related to scheming: Both the in-context scheming (including follow-up work) and alignment-faking papers show various cases of scheming propensities. There are also cases where reasoning models cheat in chess when given the chance.
We could learn a lot: In summer 2024, I would have expected that models show something that I would consider “scheming-related” about one in 1M natural rollouts or something as low as that. After working on the in-context scheming paper and seeing the alignment-faking work, I’d say it might be more frequent than that, e.g. every 10k natural rollouts. Given that all of the existing evidence feels somewhat artificial, I think it would be really interesting to get better base rates for this happening in actual use-cases.
It’s probably easy to do: I expect that doing a simple version of this kind of post-hoc monitoring is very simple and cheap to do, especially if the infrastructure to investigate large amounts of usage data already exists.

Concrete ideas

I’d recommend starting only with developer-internal usage data. Presumably these trajectories are already stored for standard debugging and development purposes. Otherwise, I’d collect them for a period of a few weeks so we get at least a million trajectories.

Then, I’d automatically scan through many of these stored trajectories. By default, I’d do some exploratory scanning on e.g. 10k samples to quickly improve my classifiers. I would use a mix of many classifiers, including:

String-based classifiers, e.g. in our in-context scheming paper, we found that o1 often used the word “sabotage” when it schemed. We could simply scan for ~10 words related to that and see if they turn up a single match related to scheming.
Model-based classifiers, e.g. using a cheap and fast model like 4o-mini, haiku, or gemini-flash. I would compare various versions of abstract and very specific prompts, short and long, using different definitions of scheming, crafting a prompt based on existing findings in the scheming papers listed above, etc.

My MATS scholars tested more than 100 different variations of prompts in an exploratory manner and then narrowed it down to a much smaller number later. During the exploratory scanning, I’d let the classifier give the trajectory a “scheming score” between 1 and 100. This score can be used later when we suggest new classifiers, or it can be used for few-shot prompts to improve classifiers. To calibrate the classifier, I’d use examples from existing work on scheming.

I think all or almost all prompts in the exploration phase may turn out to be benign. Thus, I’d resample the exploratory samples a few times and always keep the ones that seem most concerning to get at least some cases that could be classified as somewhat scheming. This intentional oversampling would be mostly used to get a better feeling and calibration for the classifiers.

For the first setting, I’d only use internal use cases that can be widely shared within the organization (e.g., no access restrictions) because we will naturally have to look at lots of examples to double-check the classifiers and the main objective is to find good classifiers, as quickly as possible.

Privacy concerns

There are different kinds of privacy concerns for both internal and external deployment settings.

In the internal setting, typically, not all employees have access to all stored trajectories. For example, sharing the trajectories of the pre-training team might reveal company internal IP that is not accessible to most employees, or the trajectories of the frontier red team might also be restricted. To work with these constraints, we could do some of the following:

Start iterating with data that has no company-internal restrictions
Once you feel like you have good classifiers, run those on data that have stronger restrictions and see if they flag anything using privacy-preserving tools like CLIO.
If they flag something as substantially scheming, work with someone who has access to these trajectories and check if the flagging is accurate or a false positive. If possible, add a redacted or summarized version to the set of scheming trajectories.

Once you have a large set of classified data rated by multiple classifiers, we can start with data analysis. We should start with the examples with the highest scheming scores per classifier and can make the depth of the analysis mostly dependent on early findings. If even the worst examples look very benign after an honest look, we can basically call it a day and come back in a few months.

If the worst examples look interesting or surprising, we can continue with more analysis. For example,

We can qualitatively examine the flagged trajectories to see why the model is scheming and what strategies it uses. In general, understanding and classifying the reasons for the scheming seems like the most important part of this exercise.
We can quantitatively count how often it engages in scheming and what fraction of total trajectories this is to get a “scheming base rate.” We can plot their distribution if we have scores for how egregious the scheming is.

There are many fairly simple ways to run a project like this that would provide a lot of insight into the model's real-world tendencies. With infrastructure like CLIO in place for other properties, I think an 80/20 version of this project could be done in less than one week and still yield very interesting insights.

At Apollo, we’d be happy to help with such a project to the best of our ability. I guess that in most cases, it will be unrealistic for us to access large amounts of company-internal data, but we would be happy to consult the respective team wherever we can.

[-]maxnadeau25d61

To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right?

I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?

[-]Marius Hobbhahn25d42

Good point!

Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe's alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).

AI ALIGNMENT FORUM
AF

38