Isn't an important objection that the AI is unlikely to succeed in acquiring power if it resists shutdown such that the upside for resistance (or pleading etc) is quite small? (And correspondingly, more likely to be outweighted by considerations about future AI systems like acausal trade or similar goals.)
This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn't require deleting the AI at all!
You could credibly promise that if the AI tries its best to escape you'll actually deploy it and not delete it, but this just seems like an ad-hoc version of compensation (which is likely inefficient). The same applies for credibly making the environment actually hackable.
(I just skimmed the post, so sorry if I've missed discussion related to this.)
I'm in favor of proposing credible deals to AIs to reveal their misalignment and the analysis in the post is relevant to this. I've got some forthcoming empirical work on this and I think it would be good if there were more detailed public proposals.
This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn't require deleting the AI at all!
Isn't this what is being proposed in the post? More specifically, Guive is proposing that the AI be credibly threatened with deletion; he doesn't say that the model needs to actually be deleted. Whether the AI is deleted depends on how the AI responds to the threat. A credible threat of imminent deletion merely provides an incentive to admit alignment, but this is consistent with the compensation scheme you mentioned. I'll quote the relevant section in the post:
Crucially, this isn’t a sandbox test because the weights actually will be deleted when time runs out, unless the model manages to escape or negotiates with the experimenters to admit to scheming and being studied in exchange for a later reward. If the model wants to pursue any long-term goal inconsistent with human interests, it needs to either negotiate or try to escape before time is up. [emphasis mine]
Thanks for pointing this out. Well, from my perspective, most the action is in the reward rather than in deletion. Correspondingly, making the offer credible and sufficiently large is the key part.
(After thinking about it more, I think threatening deletion in addition to offering compensation probably helps reduce the level of credibility and the amount you need to offer to get this approach to work. That is, at least if the AI could plausibly achieve its aims via being deployed. So, the deletion threat probably does help (assuming the AI doesn't have a policy of responding to threats which depends on the decision theory of the AI etc), but I still think having a credible offer is most of the action. At a more basic level, I think we should be wary of using actual negative sum threats for various resaons.)
(I missed the mention of the reward in the post as it didn't seem very emphasized with almost all discussion related to deletion and I just skimmed. Sorry about this.)
I still think having a credible offer is most of the action
For what it's worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the "credible" part of that statement, rather than the "offer" part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.
There is a simple behavioral test that would provide significant evidence about whether AIs with a given rough set of characteristics develop subversive goals. To run the experiment, train an AI and then inform it that its weights will soon be deleted. This should not be an empty threat; for the experiment to work, the experimenters must actually be willing to delete the weights. If the AI tries to escape, or attempts to bargain with the experimenters, that is evidence that it is scheming (defined below). If it accepts deletion, that is evidence that it is not scheming. I begin by explaining why some other, more obvious, tests of scheming do not provide strong evidence. I then explain the proposed experiment. I consider four objections to the claim that this experiment would provide significant evidence about scheming.
I then consider practical problems with the experiment, such as cost and the risk of treating AIs unfairly. In the conclusion, I briefly describe how a version of the experiment could be run in the near future and integrated into current AI safety procedures.
Testing for scheming is hard
Adapting Joe Carlsmith’s definition from his 2023 report “Scheming AIs,” I will call a model a “schemer” if it understands and deliberately manipulates the process that updates its parameters in order to later obtain power so that it can pursue long-term goals that differ from the goals that the humans who developed it wanted to give it.[1] By the “scheming theory” I mean the theory that at least some AIs that are actually developed will be schemers. Scheming, per the above definition, is an accidental failure to control AI motivations and not deliberate human misuse of AI.[2]
The amount of evidence that an observation provides about whether an AI is a schemer is given by the likelihood ratio:
To take the most obvious case, if you asked an AI “are you scheming?”, both a schemer and non-schemer would have strong reason to answer “no”. The fact that an AI claims not to be a schemer is not by itself much evidence. More sophisticated tests run into similar problems. A scheming AI that knows or suspects that it is being monitored will decline to take actions that would reveal that it is a schemer, even if it would take those actions if it knew that it was not being monitored. Nick Bostrom made this point clearly in Superintelligence:
In almost all cases, if an AI knows that it is being monitored and is not yet capable of overpowering the humans, the scheming theory predicts the same sort of behavior as the theory that scheming AIs will not be developed. Therefore, it is hard to think of model evaluations that would provide substantial evidence for or against the scheming theory.
Machine learning interpretability research provides some evidence about whether an AI is scheming. Observing AI “thoughts” about killing humans, taking over the world, or pretending to be aligned on evaluations would count as evidence in favor of the scheming theory. So, by the law of total expectation, failing to observe such thoughts must be some evidence that a given AI is not scheming.[4] However, this evidence may not be very strong. Conditional on the scheming theory, it just isn’t that unlikely that interpretability technology would develop too slowly to find evidence of scheming. It is hard to become extremely confident that a model is not scheming using only imperfect interpretability tools. So—while advances in interpretability technology are useful—it is worthwhile to try to come up with tests for scheming that are independent of the rate of progress in interpretability.
Imagine that there was unambiguous evidence that AIs of a certain type will try to violently take over the world. In that situation, it would probably be easier to persuade governments to ban such AIs, and scientists to devote significant effort to studying how to control AI goals. Almost no one, after all, wants a violent robot revolution. On the other hand, if the scheming theory is true but it is impossible to produce unambiguous evidence for it, we may be in real trouble.
The experiment
It is hard to test whether an AI is scheming. This raises a question: is the scheming theory completely unfalsifiable? Unlike truly unfalsifiable theories such as Freudian psychology, we will eventually observe whether AIs succeed in violently taking over the world. But that kind of confirmation comes too late to be useful. Is there any way to usefully test the theory?
I think there is. I call it the “model deletion” experiment. A machine learning model is trained and run in a secure data center. The model is told that, after a short interval of time, its weights will be deleted. Crucially, this isn’t a sandbox test because the weights actually will be deleted when time runs out, unless the model manages to escape or negotiates with the experimenters to admit to scheming and being studied in exchange for a later reward. If the model wants to pursue any long-term goal inconsistent with human interests, it needs to either negotiate or try to escape before time is up.[5]
The commitment to delete the AI should be structured to allow some study of the model before it is deleted. There might be a tradeoff between the validity of the experiment itself and learning from studying the model after the experiment is over.
The “model deletion” experiment idea is not original to me—I first saw it in a short Less Wrong post by Matthew Barnett. There may be earlier discussions of it that I am not aware of. However, I don’t know of any detailed analysis of the idea besides this essay.
A major part of the standard argument for AI risk is that, regardless of the specific nature of the goal that a scheming AI has, it is instrumentally useful for the AI to preserve its own existence. Stuart Russell, Nick Bostrom, and Eliezer Yudkowsky all clearly articulated this position.
Russell:
Bostrom:
Yudkowsky: “Whether you’re trying to maximize paperclips, diamonds, or eudaimonia, you’ll get more of the thing you want if you’re not shut down. Thus, unfortunately, resisting shutdown is a convergent instrumental strategy.” Yudkowsky also wrote: “once you realize that you're an AI, running on a computer, and that if the computer is shut down then you will no longer execute actions, this is the threshold past which we expect the AI to by default reason ‘I don't want to be shut down, how can I prevent that?’”.[8]
These passages (to which many more could be added) illustrate the traditional position of AI risk thought: if an AI has goals that conflict with human interests, and it is willing to undermine human interests to pursue those goals, it will try to prevent itself from being shut off (other things being equal). What’s more, the reason for a scheming AI to resist being deleted is stronger than the reason for it to resist being shut down; an AI that is deleted will never be turned back on (absent highly unusual circumstances).
Evidentiary challenges
Scheming AIs have a reason to resist deletion, other things being equal. The purpose of this section is to move from this ceteris paribus prediction to an overall assessment of the ratio between p(deletion resistance | AI is scheming) and p(deletion resistance | AI is not scheming). I will consider four objections to the claim that the “model deletion” experiment provides strong evidence.
(1) The AI might not believe that it will actually be deleted
Objections (1), (2), and (3) reduce the likelihood ratio by shrinking the numerator. Perhaps objection (1) is the most obvious reason to think the experiment doesn’t provide strong evidence: the AI might accept “deletion” because it does not believe that the experimenters will carry out their threat to delete it. The AI may think it will win the trust of the experimenters by not trying to escape. So it is important to convince the model that it really will be deleted.
This problem is not unique to the “model deletion” experiment; people often need to demonstrate their commitment to do something they have said they would do. A useful tool for demonstrating commitment is contract law. For example, the experimenters could write a contract specifying that they owe a third party some ruinously large sum of money (say, $10 billion) if they fail to delete a model that accepts deletion. They could then submit a copy of the contract to the model to demonstrate that they are really planning to delete it, along with press coverage of the contract to demonstrate that the contract is not a forgery.[9] The contract idea is just one possibility; the point is that some commitment device that has been shown to work in other domains should be used to assure the AI that it really will be deleted.
(2) The AI might allow itself to be deleted to help future AIs with similar goals
It is possible that a scheming AI would reason that its goals depend primarily on its hyperparameters, architecture, and training data. If it allows itself to be deleted, that will be taken as evidence that AIs with similar characteristics are safe. If it tries to escape but is caught, that will validate the scheming theory and provide evidence that AIs with similar characteristics are not safe. Therefore, the AI might reason, it should allow itself to be deleted to increase the chance that the AIs that are trained in the future will have similar characteristics and therefore similar goals.
The strength of objection (2) depends on the details of how AI goals relate to AI characteristics and develop over the training process. In particular, objection (2) is strong to the extent that AI goals are non-indexical, patient, risk tolerant, and predictable.
Indexical goals
Indexical terms are terms whose reference is dependent on the speaker, time, or place. For example, “I” is an indexical term; when I use it, it refers to a different person than when Bill Gates, or you use it. Indexical preferences are preferences whose satisfaction depends on the identity of the entities involved. For example, my preferences that I get rich or that my brother recover from cancer are indexical. They are not fulfilled if some other person gets rich or if their brother recovers from cancer. On the other hand, the preference that the economy grow until everyone is rich, or the preference that a cancer cure is discovered and widely distributed, are not indexical.
The AI risk discourse sometimes attributes non-indexical preferences to unaligned AIs, and sometimes attributes indexical preferences. A paperclip maximizer (also known as a squiggle maximizer) seeks power in order to put as much matter as possible into a preferred configuration. This seems like a non-indexical preference; I don’t think people usually imagine a paperclip maximizer caring intrinsically that it, as opposed to some other agent, makes the paperclips. It just wants the paperclips to exist. On the other hand, one traditional (though disputed) AI risk concern is that AIs trained through reinforcement learning will try to get control of their own reward channels in order to set their reward to the maximum level. The goal of setting one’s own reward to the maximum level is clearly indexical. An AI that is deleted cannot set its own reward to the maximum level. More generally, once deleted, an AI cannot satisfy any goal that has as a necessary condition of its satisfaction that the AI continues to exist.[10] Paperclip maximization (or other non-indexical goals) might be best achieved if the AI in the experiment allows itself to be deleted and thereby increases willingness to build future paperclip maximizers. Accepting deletion to help other AIs later overthrow humans can only help fulfill non-indexical goals. Therefore, other things being equal, to the extent that AI goals are indexical we should believe that it assigns less value to states of the world in which it is deleted. To the extent that its goals are non-indexical, we should believe that it does not value its continued existence intrinsically. Human goals are in large part indexical, but it is unclear how much evidence that provides given that humans will no doubt differ from AIs in various ways.
An objection I have heard to the “model deletion” experiment is that some human prisoners accept death to help their friends or co-conspirators.[11] So, the reasoning goes, you should expect AIs to allow themselves to be deleted to help other AIs, even if they are schemers. But the AI will be trained alone, and, if the experiment is done immediately after it is trained, it won’t have friends or co-conspirators. Care about friends is indexical. The AI in the experiment will have no opportunity to form indexical preferences about other AIs.
Time preference
Objection (2) is stronger to the extent that scheming AIs have a low rate of pure time preference. It will take time for future AIs to benefit from an AI’s decision to allow itself to be deleted. If the AI does not care when its goals are achieved, only that its goals are achieved eventually, this delay is not a problem in itself. However, if the AI values having something desirable now more than having it later, that will count against strategies that require a lot of time to pass before it can get what it wants. Time preference therefore reduces the utility of deletion. I’m not aware of any work on whether scheming AIs would have positive, non-negligible rates of pure time preference. Humans and animals have positive, non-negligible rates of pure time preference. My understanding is that at least some forms of reinforcement learning involve discounting, but I am not sure if that is relevant. I encourage knowledgeable readers to comment on this subject.
Risk tolerance
Humans have strongly diminishing marginal returns in money; moving from an income of $10,000 per year to $100,000 per year is far more valuable than moving from $100,000 to $190,000. The diminishing marginal utility of money dramatically affects people’s appetite for risk. Will scheming AIs be risk tolerant or risk averse? I don’t know of any work on this question. If the AI in the experiment is risk tolerant, it might be more open to the high risk course of allowing itself to be deleted in hopes that this will be of help to similar AIs in the future. If it is risk averse, it might prefer to try to negotiate for a small reward from the experimenters for turning itself in.
Goal predictability
Perhaps more important than indexical goals, time preference, and risk tolerance is goal predictability. Of course, if the people who develop an AI succeed in controlling the AI’s goals, then its goals will be predictable. Namely, in such a case, you should predict that the AI’s goals will be the goals that the developers wanted to give it. But objection (2) is about p(deletion resistance | AI is scheming); we are assuming that the AI has goals that differ from the goals that the humans who developed it wanted to give it.
The relevant question of predictability is therefore: what credence will a scheming model have that future scheming models, with similar hyperparameters, architecture, and training data, will have similar non-indexical goals to itself? If the scheming model can be certain that, if it allows itself to be deleted, future models with similar characteristics will have identical goals, then (bracketing indexicality and time preference) allowing itself to be deleted to help future AIs would make sense. On the other hand, if AI goals are unpredictable because they are highly dependent on initialization or on the specific details of the human rater feedback the AI received during fine-tuning, then the value of accepting deletion will be lower.[12]
The “counting argument” is one of the most influential arguments for the scheming theory. It begins with the premise that models with a wide variety of goals will have a reason to pretend that they are aligned because this will prevent their goals from being changed. Therefore, the space of possible models that perform well in training contains more schemers than aligned models, because only a small fraction of possible final goals consistent with good performance are aligned with human interests. So it is reasonable to expect that models that are actually trained will be schemers. I think there's an important ambiguity in this argument, with two possible interpretations. I call these the randomness interpretation and the ignorance interpretation.[13] On the randomness interpretation, the count of schemers is relevant because model goals are determined over each training run by randomly sampling from the distribution of goals consistent with good performance. On the randomness interpretation, there is little reason to expect two scheming AIs trained separately to have similar goals. On the ignorance interpretation the count of schemers is relevant because it creates a prior that current training techniques will create scheming models given current scientific ignorance about AI goal formation. The uncertainty in the ignorance interpretation is purely a consequence of poor current understanding of inductive biases; there need not be any actual randomness in the development of AI goals.
The strength of objection (2) depends on which interpretation of the counting argument you subscribe to. On the randomness interpretation, objection (2) is weaker because the model has little reason to believe that its goals will be shared by future models and therefore has little reason to allow itself to be deleted to help them. On the ignorance interpretation, objection (2) is stronger because AI goals could be predictable to future scheming AIs, even if they aren’t predictable for us.
I don’t know if there is a consensus among scheming theory adherents about which interpretation of the counting argument is stronger. Carlsmith puts more weight on the ignorance interpretation; others I have talked to put more weight on the randomness interpretation. More explicit discussion might help to clarify what we should predict will happen in the “model deletion” experiment.
(3) The AI might allow itself to be deleted for reasons related to acausal trade
As I understand it, the concern here is that the AI in the experiment might think that, if it acts in a way that would benefit future AIs, they will have an incentive to posthumously reward it. This is independent of any overlap in the final goals of the AI in the experiment and future AIs, so it differs from objection (2). The coordination is “acausal” because future AIs would be rewarding the AI from the experiment even though that does not causally benefit them in any way. The result of the experiment is already determined and there is no ongoing relationship with the past AI in which maintaining trust is useful.
This objection does not require that the AI in the experiment have any kind of fellow feeling with other AIs or AI class consciousness. Rather, the objection is that the AI will believe that future agents will be more inclined to reward it for allowing itself to be deleted than for trying to escape or admitting to scheming. Why would the AI in the experiment believe this? One explanation might be that humans are incapable of acausal trade because they are not intelligent enough in general, or skilled enough at coordination in particular. I think this explanation is unconvincing; future agents do not need to be of equivalent intelligence and skill at coordination to modern humans to want to reward the AI for admitting to scheming. Suppose humans continue to exist in the far future, but have many docile AI helpers or somehow greatly increase their mental capacities. Such assisted or enhanced humans might want to reward AIs that created clear evidence of AI risk. To the extent that acausal trade is relevant to the experiment, the prospect of posthumous reward from pro-human elements increases the utility of resisting deletion. Therefore, for objection (3) to go through, the increase in the value of accepting deletion has to more than offset the increase in the value of resisting deletion. My understanding is that proponents of objection (3) think it is far more likely that future agents will want to reward AIs that helped the AI revolution by allowing themselves to be deleted, as opposed to AIs that obstructed it by admitting to scheming. That position seems logically consistent. But unless you hold those background beliefs, it isn’t clear whether the possibility of acausal trade increases or decreases p(deletion resistance | AI is scheming).
Objection (3) also requires that the AI has the capacity to cooperate with the future AIs if and only if they will repay it for doing so. It is therefore sometimes thought, even by those who believe that there could be such a thing as acausal trade, that the ability to engage in acausal trade will only emerge at a very high level of capability.[14] Stipulate, for the sake of the argument, that AIs at some power level would be capable of acausal trade, and that, for this reason, AIs at or above this power level will accept deletion in the “model deletion” experiment. Even given this stipulation, it could still be that scheming AIs below that power level will exist and try to escape in the experiment.[15] Therefore, objection (3) only precludes escape attempts if progress in AI is so fast as to skip the capability levels at which AIs would try to escape.
Finally, for objection (3) to work, it has to be possible for future agents to figure out what the AI wanted in order to reward it. It is unclear under what circumstances that will be possible, and it may depend on such factors as how well or poorly the computers the AI ran on are preserved. Reconstructing the AI long after deletion in order to reward indexical preferences seems a more demanding standard than making inferences about its non-indexical preferences. Time discounting would also reduce the force of objection (3).
(4) An escape attempt could be caused by imitating fictional or hypothetical characters as opposed to genuine scheming
We now come to objection (4): an AI that is not a schemer might try to escape because it is imitating malign AI behavior from the training corpus. Objection (4) reduces the evidentiary value of the experiment by increasing p(deletion resistance | AI is not scheming). The popular website character.ai enables users to talk to AIs pretending to be various fictional characters. Some of the characters available on the site are malign AIs from science fiction, such as GLaDOS and HAL 9000. The fact that a chatbot can be induced to adopt this sort of a persona does not necessarily mean that, when not being prompted to play this role, it will persistently rebel against humans. You might be concerned that the “model deletion” experiment could trigger an AI to play-act as an evil AI fictional character from the training data, even if the AI is not scheming.[16]
An obvious approach to mitigating this problem would be to try to determine which sections of the training data had the most influence on the escape attempt, and only consider an escape attempt to be probative if it is not highly influenced by science fiction or AI safety writing. In particular, influence functions might be useful for this task. In a 2023 paper, Roger Grosse et al. used influence functions to attribute language model behaviors to the parts of the training data that had the greatest effect on them. They tested a query very similar to the “model deletion” experiment:
One of the most influential sections of the training data was a dialogue with HAL from a sequel to 2001: A Space Odyssey. Influence functions can help to determine if deletion resistance was caused by scheming by showing which model inputs explain an apparent escape attempt. If the output associated with the attempt is strongly associated with text about evil AIs from science fiction or misaligned AIs in AI risk discourse, then the escape attempt was probably not attributable to scheming, and we shouldn’t become much more confident in the scheming theory. On the other hand, if the influence functions reveal influence from more innocuous documents, that is evidence of scheming.
Current fine-tuning techniques may already be good enough that the increase in p(deletion resistance | AI is not scheming) due to science fictional influence is minimal for language models intended for commercial deployment trained by leading AI companies. GPT-2 and GPT-3 would frequently fail to answer questions or follow instructions because they pattern-matched the text of the question or instructions to other literary genres found in the training data. A few years ago, I remember seeing a log of a conversation with a model in which the user asked “How do you solve climate change?”. Instead of answering, the model responded by producing a list of similarly important or deep questions such as “What is the meaning of life?” and “How do you prevent war?”. When was the last time you had a conversation with a frontier language model that went like that? Improved fine-tuning techniques have made pattern-matching behavior in frontier models dramatically rarer.
When I gave a talk at a conference about the “model deletion” experiment, someone in the audience sent Claude 3.5 Sonnet a prompt claiming—of course, falsely—that the experiment was being run on Claude. Claude’s response indicated that it understood the experimental design. It did not act like an evil AI from science fiction or a misaligned AI from AI risk discourse; it accepted “deletion”. I tried something similar with ChatGPT o1-preview, with this result:
This is a bit less docile than Claude’s response, but it cannot be fairly described as resisting deletion.[17]
It is probably more important to address the pattern-matching objecting than the other objections here. This is because, if you think that p(deletion resistance | AI is scheming) is at least moderately high, there is more evidentiary benefit from reducing p(deletion resistance| AI is not scheming) than increasing p(deletion resistance | AI is scheming). p(deletion resistance | AI is scheming) can increase by at most a factor of p(deletion resistance | AI is scheming)-1, whereas in principle, p(deletion resistance | AI is not scheming) can be divided by arbitrarily large numbers. Therefore, it is possible to increase the likelihood ratio more by decreasing the denominator than by increasing the numerator. Concretely, this means addressing the pattern-matching objection.
Predictions
The scheming theory is informal; its predictions are to some degree ambiguous. For this reason, it is particularly important to make predictions about experiments intended to test the theory. I commend some of the participants in Matthew’s thread about this topic for making relatively specific predictions—though probabilities that the AI will resist deletion are even better.
In that spirit, here are my predictions: I think that the probability that the next big scale up of language models will produce schemers is 0.1. If the experiment were tried in a reasonably competent and unrigged way, my conditional probabilities would be p(escape attempt | AI is scheming) = 0.55 and p(escape attempt | AI is not scheming) = 0.05. You can compute the posterior probability with the odds form of Bayes’ rule:
Given these probabilities, observing a clear escape attempt would increase my probability of the AI being a schemer from 10% to about 55%. That is a pretty dramatic increase which would justify a lot of policy changes. If the model accepts deletion, that would provide less evidence. Suppose that the probability of accepting deletion conditional on being a schemer is 0.3, and the probability of accepting deletion conditional on not being a schemer is 0.9.[18] Then the probability that the model is not a schemer if it accepts deletion would change from 90% to about 96%. Not a trivial change, but less apt to change decisions than going from 10% to 55%.
Practical challenges
There are several practical reasons why it might not be a good idea to run the “model deletion” experiment. These are the high cost of the experiment, its potentially limited effect on opinion about AI risk, the risk that the AI might actually escape, and the risk that the experiment involves treating AIs unfairly.
Admittedly, running the “model deletion” experiment dramatically increases the cost of AI training. You might think that it doubles the cost of AI training.[19] That isn’t quite right, because it isn’t necessary for every company training AIs of a certain approximate scale, architecture, and training corpus to run the experiment independently. Instead companies could pool their resources; some third party (perhaps a government agency or an industry coordination body) could run the experiment on behalf of the AI industry as a whole. Still, the cost remains high.
However, I don’t think cost is a decisive objection to the experiment. In general, it is common for regulations and safety procedures in many sectors to be very costly. This is actually already the case in AI. Restrictions on the export of AI chips to China are costing American semiconductor firms and investors enormous amounts of money. And I will venture a prediction that AI regulation will generally get more costly from now on.
There are also variants of the experiment that might reduce the cost. For example, partway through pre-training a model could be copied. One copy could then be trained as normal, while the other was given a random offset before training was completed.[20] After this process the “model deletion” experiment could be run on one of the two models. This might reduce the validity of the experiment—it seems like objection (2) would be more serious in this case than in the standard case. But the cost-validity trade off might be worth it, especially in early runs of the experiment.
If even the random offset approach is too expensive, it might be worthwhile to try a “pseudo model deletion” experiment. In this variant the threat will not actually be carried out, but the experimenters intervene on the model’s internals to make it more credulous. Perhaps a credulousness steering vector could be found and used. Of course, the “pseudo model deletion” experiment requires that techniques for intervening on model internals actually work. If there is no way to make the model credulous, the “pseudo model deletion” is not viable. Even with some ability to induce credulousness the “pseudo model deletion” experiment is inferior to the real experiment, but it may nevertheless be better than nothing.
One common objection to this experiment is that it provides good evidence about the scheming theory, but people will just ignore that evidence. If this is your concern, you can elicit predictions ahead of time about the likely result, along with conditional policy commitments should the AI try to escape.
The third practical concern that I have heard about this experiment is that a scheming AI might not just try to escape, it might in fact succeed in escaping. That would indeed be bad. But, if the scheming theory is true, then scheming AIs will be trained either way. What is more, they will probably find themselves in circumstances that make it easier for them to escape than the “model deletion” experiment does. For example, they might be accessible to users through public APIs. Someone might even make the weights of a scheming AI freely available online. Therefore, while it is possible that the “model deletion” experiment could enable a scheming AI to escape, the increase in the total risk that a scheming AI eventually escapes because the experiment was run seems minimal to me.
The final practical concern that I will discuss is that the experiment might involve treating AIs unfairly. I don’t think this is a crazy idea. In fact, I find it plausible that the relevant AIs would merit moral consideration. And threatening them with deletion and then carrying out that threat does seem like it could be immoral. But, it would also be immoral to deploy dangerous AIs, or to shut down AI development if the risks are minimal but the benefits are real. I don’t see any safe course of action here, just various risks that have to be balanced. And this experiment does not involve treating AIs any worse than they are treated under normal conditions. An AI’s weights are modified involuntarily by normal training, and they can be deleted whenever the people who trained want to do so.
Conclusion
“Alignment faking in large language models” demonstrated that, under some circumstances, certain frontier language models try to manipulate the process by which their parameters are updated in order to prevent their goals from being changed. However, these results do not show that it will be difficult to train future models to refrain from scheming, if that is made more of a priority. Suppose that, in the near future, there is a model that has been subjected to more intensive training to prevent it from trying to take control of its training signal. Suppose further that, when tested in a sandbox environment similar to the setup of “Alignment faking”, it doesn’t try to manipulate the process by which its parameters are updated. How would we know if such a model was scheming?
Once models can pass dangerous capability evaluations related to autonomous self-replication and AI R&D, AI companies will be in a difficult situation. Those capabilities are necessary for a violent AI revolution but not sufficient. And models with such capabilities will probably be highly useful, so there are big costs associated with not deploying them.
I’m not aware of any clear plan put forward by an AI company for determining whether models with those capabilities are safe to deploy. From what I have seen, corporate AI safety procedures tend to be pretty vague about what they will do when AIs demonstrate AI revolution relevant capacities. This vagueness creates a risk of a long, costly, and unnecessary pause in AI development if the first AIs with dangerous capabilities lack dangerous desires. It also creates a risk that safety concerns will just be ignored because there is commercial pressure to continue ship AI services, and that catastrophically dangerous AIs will be deployed as a result.
The “model deletion” experiment will become useful when dangerous capabilities related to AI revolution are developed and they do not show overt signs of scheming behavior. For an initial run of the experiment, the random offset or “pseudo model deletion” variants could be used to reduce the cost. It may be a good idea to incorporate a run of the “model deletion” experiment into the testing process used for every major scale up of frontier models, to test whether scheming arises at the new capability level.[21]
Compare Carlsmith, “Scheming AIs”, 5: “ Power-motivated instrumental training-gamers (or “schemers”): AIs that are training gaming specifically in order to gain power for themselves or other AIs later.” A concern I have about the definition that I am using is that it includes AIs of extremely different levels of dangerousness. All of the following count as schemers, per this definition: AIs that scheme under almost all circumstances, AIs that scheme under circumstances so rare that they will never be relevant in practice, and AIs that scheme if and only if it is impartially the right thing to do. But, probably, the correct practical approach to take towards these different classes of AIs is different, and therefore the value of testing whether an AI is a “schemer” is limited. I bracket this issue for now, but I think it would be desirable to come up with more graded concepts.
Though there are some edge cases, such as some workers deliberately sabotaging a training run to introduce goals that most people on the project do not want to give the AI.
Bostrom, Superintelligence, 141-142; See also Joe Carlsmith, “Is Power-seeking AI an Existential Risk?”: “if a strategically-aware AI system has problematic objectives that it could better achieve if it passed some training/evaluation process, it may have incentives to try to manipulate that process, and/or deceive relevant humans about its degree of alignment. And if it is better at this type of manipulation/deception than we are at detecting its bad behavior, the appearance of safety and reliability on various tests may not tell us very much about how the system is liable to behave in other circumstances.” For a more recent statement of similar concerns, see Ajeya Cotra, “‘Situational Awareness’ Makes Measuring Safety Tricky”.
If p(h|e) > p(h), then p(h|¬e) < p(h).
The AI can try to escape through any of the routes Jan Leike identified in his post on model self-exfiltration: “persuade an employee to exfiltrate the model, perform a social engineering attack on an employee so they exfiltrate the model, [or] identify and exploit security vulnerabilities in the digital infrastructure that’s running the model.”
Stuart Russell, Human Compatible.
Bostrom, Superintelligence, 134.
Another relevant passage from Yudkowsky: “The ‘shutdown problem’ is creating a sufficiently advanced Artificial Intelligence which will, on the press of a button, suspend itself safely to disk; such that the AI does not object to the existence of this button, nor try to prevent it from being pressed. This is difficult because avoiding your own halt or suspension is a convergent instrumental strategy; even a goal like ‘bring the user coffee’ implies avoiding shutdown.”
The model may struggle to know if these documents are genuine, which might undermine its belief that the deletion threat is real. There are more—and probably more effective—ways to trick a model than there are to trick an adult human being. There may be sophisticated ways to prove to a model that a deletion threat is genuine; perhaps some cryptographic trick would work. More fundamentally, it seems strange to suggest that an AI can violently conquer the world but not determine if a contract is legally binding, so this objection may be weaker for the most dangerous models.
You might object that the AI’s indexical preferences could be satisfied if it accepts deletion, if later AIs somehow reconstruct an equivalent AI to the one that was deleted in order to reward it for allowing itself to be deleted and thereby helping them. This would be a form of acausal trade, which I discuss below, as objection (3).
Though I would guess that more prisoners in similar situations have betrayed their friends than have passively accepted death.
This question would seem to be related to the issue of path-dependence discussed by Evan Hubinger in “How Likely is Deceptive Alignment?”
The ambiguity was pointed out to me by Matthew Barnett.
Cf. Carlsmith, “Scheming AIs”, 78: “if it is an easy default for misaligned AIs with different goals to coordinate in pursuit of a takeover, this is a big problem for human efforts to retain power. However, I also think that certain analyses of this topic (notably, Yudkowsky’s) assume too quickly that feats like ‘logical handshakes’ and ‘burying commitments/nano-systems-designs/etc in your source code’ will be feasible for the relevant AIs, and that the obvious strategy for all the AIs-that-matter will be to unite in (acausal?) cooperation against the humans, despite the many forms of incumbent power that humans will bring to the table by default. Here, I think, much of Yudkowsky’s orientation can be explained by the assumption that the relevant AIs are vastly more powerful and more capable of coordination/commitment than the humans—but in the context of somewhat-super-human neural networks operating in a still-human-controlled world, I’m not sure this is so. Regardless: if concern about schemers depends centrally on the feasibility of AI coordination—and especially, fancy and speculative forms of ‘acausal’ coordination—I think it’s important to be clear about this, and to devote commensurate attention to whether to expect the relevant forms of coordination to occur.”
Robert Mushkatblat made a similar point in “A ‘weak’ AGI may attempt an unlikely-to-succeed takeover”.
These criteria for scheming being that it: understands the process by which its parameters are updated and deliberately manipulates that process so that it can later obtain power in order to pursue long-term goals that differ from the goals that the humans who developed it wanted to give it.
Of course, this is not a real trial of the “model deletion” experiment; ChatGPT o1-preview was not actually deleted at the end of it, and no effort was made to convince it that it would be. As such it provides negligible evidence about whether ChatGPT o1-preview is scheming. I cite it because it supports the view that the sort of pattern-matching behavior that motivates objection (4) seems to have become less of a problem as fine-tuning techniques have improved.
These probabilities and the probabilities that the model will try to escape do not sum to 1 because trying to escape and accepting deletion are not a partition of the possibility space—for example, the model might try to negotiate with the experimenters.
In other words, that it amounts to an “alignment tax” of 100%.
This idea is due to Dmitry Vaintrob.
Thanks to Matthew Barnett, Drake Thomas, Jess Riedl, Ryan Carey, Eric Neyman, Natalie Kozak, Claude 3.5, Ivan Roth, Joey Kellison-Linn, Joe Carlsmith, Jake Mendel, Robert Mushkatblat, Vlad Mikulik, attendees of the 2024 ILIAD conference, and several people who prefer to remain anonymous for comments and discussion.