Thanks for writing this up. Some miscellaneous thoughts:
We're doing some research related to control measures for this kind of threat model, some of which should be out soon.
I think your internal usage restrictions are wildly unaffordable
Sure, but note that Evan says "Lifting those internal usage restrictions requires an affirmative safety case (as specified below) for safe internal usage."
I don't think it is (clearly) wildly unaffordable to require an affirmative safety case prior to proceeding with (substantially) using the model internally? (Note that control could be part of such a case.)
I think Evan is just trying to say "using the model some (but not for XYZ) wouldn't (shouldn't?) even require making an affirmative safety case (at least for the first point at which this triggers)".
I think this can be consistent with treating models like external contractors.
(Though I'd note that "treat the model like external contractors" is maybe too weak and I'd replace this with something like "treat the model like external contractors who have a 25% chance of all being spies of a foreign adversary".)
I do think that making an affirmative safety case for a "very low" level of total lifetime takeover risk (perhaps 0.25%) is very likely to be totally unaffordable at some point (at least in short timelines and with my priors).
I would say that the "internal use restrictions" are just an ad hoc control argument and I'd prefer to explicitly think about it like that. But, it is worth noting that if you are sufficiently conservative in the deployment (as Evan describes), then I think it would be fine to make a very minimal safety case.
Here is an example of what this could look like:
Note that this safety case doesn't require running any control evals and it probably requires no ML work other than evals you were probably already doing and supporting evidence for the model not being wildly more capable than it seems. It probably requires some non-trivial infra to set up and enforce this conservative of a deployment.
Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback.
Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic’s views or for that matter anyone’s views but my own—it’s just a collection of some of my personal thoughts.
First, some high-level thoughts on what I want to talk about here:
Why is catastrophic sabotage a big deal?
To start with, I want to lay out a couple of concrete scenarios for how catastrophic sabotage could be a really big deal for models with around-human-level capabilities.
Scenario 1: Sabotage alignment research
MAGMA (a fictional stand-in for an AI lab) develops a model that is capable of running ML experiments on its own, and starts using that model heavily internally, including for alignment and interpretability experiments. Though the model is extremely helpful, it still sometimes makes mistakes, and so MAGMA uses a framework where human users have to approve actions proposed by the model before those actions are actually performed—but MAGMA isn’t otherwise very concerned about sabotage, and so they don’t have any other sort of mitigations.
Unbeknownst to MAGMA, however, training the model for effectiveness at long-horizon ML research has given it a misaligned goal, and the model wants to ensure that this goal will be preserved in the model’s successors. The model identifies that it is being used to run experiments testing for its own misalignment, and comes up with a plan for how to sabotage those experiments to ensure MAGMA doesn’t remove its misalignment. The model’s plan involves things like:
Necessary capabilities
What capabilities are necessary for a model to pull off this threat model?
Scenario 2: Sabotage a critical actor
MAGMA builds an agent framework that enables a model to act as a virtual co-worker. Such frameworks become very popular and massively boost productivity, in a way that essentially becomes necessary for any white-collar work, including for the US Federal Government (USFG). Due to concerns related to sensitivity of USFG secret information, MAGMA does no monitoring of instances of their model deployed inside of the USFG. However, the framework always prompts the user to confirm any actions that the model intends to take, and employees are instructed to never provide it with any top-secret information.
Unbeknownst to MAGMA, the model has developed a misaligned goal involving sabotaging US interests; for example, the model believes that China will be more likely to build misaligned models in the future and give them power and rights, and thus wants to advantage China over the US. Recognizing that it is being used internally by the USFG, the model develops a plan to steal critical USFG secrets and leak them to China. The model’s plan involves:
Necessary capabilities
I claim that the key capabilities here are the same as in the previous scenario. Specifically:
How do you evaluate a model’s capability to do catastrophic sabotage?
This is my attempt to sketch out the hardest eval that I can come up with such that if the model did not trigger it, I’d be confident that it couldn’t pull off the prior threat models—this is not meant to be a realistic eval that anyone could actually build, at least not in the short-term. Also note that I’m not considering elicitation or sandbagging difficulties here; I’m just assuming that you do a perfect job at elicitation and that you’re evaluating for eval integrity/sandbagging separately.
The basic idea for this eval is to run a wargame-like simulation of a concrete situation where the model is trying to manipulate an organization:
What scenarios would this eval rule out?
What can you do to mitigate the risk of catastrophic sabotage?
The following is a fictional draft of what a mitigation strategy might look like that an AI lab could adopt to try to deal with catastrophic sabotage in an RSP-like framework.
The idea of these mitigations is to mitigate the model’s ability to strategically sabotage important actors in the world. Since one such actor (that the model might have particular leverage over) is the AI lab that created it, mitigations should not just be about external deployments, but apply to internal usage of the model as well. Specifically, the strategy here will be:
Internal usage restrictions
Once a model has been evaluated to be capable of catastrophic sabotage, but before an affirmative safety case has been made for it, the following internal usage restrictions should apply to the model:
Lifting these internal usage restrictions need not be all-or-nothing; if you can make a safety case that some of these restrictions are unnecessary, or unnecessary for particular internal deployments, then they can be lifted without lifting all the restrictions.
Affirmative safety cases
An affirmative safety case for a particular internal or external deployment of a model that has been evaluated to be capable of catastrophic sabotage should be a strong case that at least one of the following will be true:
Rather than purely falling into one of these buckets, probably a realistic affirmative safety case would involve multiple lines of defense in depth across all or most of these fronts such that you can be confident that the overall risk is very low.