To make a clarifying point (which will perhaps benefit other readers): you're using the term "scheming" in a different sense from how Joe's report or Ryan's writing uses the term, right?
I assume your usage is in keeping with your paper here, which is definitely different from those other two writers' usages. In particular, you use the term "scheming" to refer to a much broader set of failure modes. In fact, I think you're using the term synonymously with Joe's "alignment-faking"—is that right?
Good point!
Yes, I use the term scheming in a much broader way, similar to how we use it in the in-context scheming paper. I would assume that our scheming term is even broader than Joe's alignment-faking because it also includes taking direct covert action like disabling oversight (which arguably is not alignment-faking).
TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.
What do we mean by in the wild?
By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks.
We broadly differentiate between two settings
Since scheming is especially important in LM agent settings, we suggest prioritizing cases where LLMs are scaffolded as autonomous agents, but this is not a hard criterion.
What are we looking for?
Broadly speaking, we’re looking for any setting where an AI model acts covertly or reasons about acting covertly against an explicit or implicit goal of the developer/user.
This setting does not include refusals or other kinds of explicit misalignment because they are not covert.
Examples include:
Importantly, the class of “scheming in-the-wild” cases we’re looking for is much bigger than “catching AIs red-handed” and includes less egregious and many more cases of scheming. We mostly just want to get a sense of how often models show this kind of behavior rather than demonstrating any particularly concerning case.
Why monitor for scheming in the wild?
Concrete ideas
I’d recommend starting only with developer-internal usage data. Presumably these trajectories are already stored for standard debugging and development purposes. Otherwise, I’d collect them for a period of a few weeks so we get at least a million trajectories.
Then, I’d automatically scan through many of these stored trajectories. By default, I’d do some exploratory scanning on e.g. 10k samples to quickly improve my classifiers. I would use a mix of many classifiers, including:
My MATS scholars tested more than 100 different variations of prompts in an exploratory manner and then narrowed it down to a much smaller number later. During the exploratory scanning, I’d let the classifier give the trajectory a “scheming score” between 1 and 100. This score can be used later when we suggest new classifiers, or it can be used for few-shot prompts to improve classifiers. To calibrate the classifier, I’d use examples from existing work on scheming.
I think all or almost all prompts in the exploration phase may turn out to be benign. Thus, I’d resample the exploratory samples a few times and always keep the ones that seem most concerning to get at least some cases that could be classified as somewhat scheming. This intentional oversampling would be mostly used to get a better feeling and calibration for the classifiers.
For the first setting, I’d only use internal use cases that can be widely shared within the organization (e.g., no access restrictions) because we will naturally have to look at lots of examples to double-check the classifiers and the main objective is to find good classifiers, as quickly as possible.
Privacy concerns
There are different kinds of privacy concerns for both internal and external deployment settings.
In the internal setting, typically, not all employees have access to all stored trajectories. For example, sharing the trajectories of the pre-training team might reveal company internal IP that is not accessible to most employees, or the trajectories of the frontier red team might also be restricted. To work with these constraints, we could do some of the following:
Once you have a large set of classified data rated by multiple classifiers, we can start with data analysis. We should start with the examples with the highest scheming scores per classifier and can make the depth of the analysis mostly dependent on early findings. If even the worst examples look very benign after an honest look, we can basically call it a day and come back in a few months.
If the worst examples look interesting or surprising, we can continue with more analysis. For example,
There are many fairly simple ways to run a project like this that would provide a lot of insight into the model's real-world tendencies. With infrastructure like CLIO in place for other properties, I think an 80/20 version of this project could be done in less than one week and still yield very interesting insights.
At Apollo, we’d be happy to help with such a project to the best of our ability. I guess that in most cases, it will be unrealistic for us to access large amounts of company-internal data, but we would be happy to consult the respective team wherever we can.