Copying over a response I wrote on Twitter to Emmett Shear, who argued that "it's just a bad way to solve the problem. An ever more powerful and sophisticated enemy? ... If the process continues you just lose eventually".
I think there are (at least) two strong reasons to like this approach:
1. It’s complementary with alignment.
2. It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…
As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.
This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence threshold at which bad things happen.
This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.
You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.
Promoted to curated: I disagree with a bunch of the approach outlined in this post, but I nevertheless found this framing quite helpful for thinking about various AI X-risk related outcomes and plans. I also really appreciate the way this post is written, being overall both approachable while maintaining relatively much precision in talking about these issues.
I like this direction and this write-up of it!
If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning.
Let's assume that an AI lab is in a fast-paced, rapidly changing state of affairs. (As you might expect during a moderately fast takeoff.) If an AI then gets to do the equivalent of 3 months of thinking, it could use that to do notably more serial thinking about how to exploit the current situation than any human has done (because some important features of the current situation is less than 3 months old: e.g. you had to switch to an importantly different programming language, or some different type of communication protocol between AIs, or change your AI safety research paradigm in a way you didn't expect). I'm curious how you'd ensure (or evaluate for) red-team competitiveness in this case.
This is indeed a serious concern.
A key hope here is that with some amount of conservativeness our evaluation[1] and our protocol can generalize to future setups. In particular, our countermeasures could be sufficiently generic that they would work in new situations. And, in our control evaluation, we could test a variety of situations which we might end up in the future (even if they aren't the current situation). So, even if the red team doesn't have enough time to think about the exact situation in which we end up in the future, we might be able to verify that the protocol can generalize to different contexts in the future.
Alteratively, we can at least try to define operating conditions under which the situation is safe without further evaluations or reasoning.
For further discussion on this topic, see here. (This link is to a section of a draft version of a doc which we've made public. We'll post a final version of this doc at some point.)
Overall, we don't think the interventions I've described here make the situation overall clearly fine, but they seem likely to suffice for substantial reductions in risk given a large effort. So, the situation is indeed considerably worse if AIs can do a huge amount of uninterpretable serial reasoning.
(And our overall view on control is also something like "we don't think control alone makes the situation overall clearly fine, but control seems likely to suffice for substantial reductions in risk given a large effort". And there are many good objections to control (like this one), and our response to these good objections is typically "we have some countermeasures to this issue, these countermeasures make the situation considerably better than you might have thought, but they don't fully eliminate the problem".)
The red teaming in particular. ↩︎
Thanks for writing this. I think it does a good job laying out the assumptions needed for control.
It seems like it could be challenging to justify that a red team is competitive -- especially if you are aiming to extract massive amounts of work from AI systems.
For example, if your AI system is like a human sw engineer but 30X faster, then it might come up with effectively superhuman plans just because it has lots of time to think. Externalized reasoning can help, but it's not very clear to me that externalized reasoning can scale to domains where AI systems are doing a bunch of research that humans (and GPT-4) can't readily disentangle and understand. I'm imagining that chain of thought will amount to thousands of pages of technical content that is written like rambly google docs. It might be hard to pick out the AI safety research from the cyberoffense planning in this case -- especially with dumb models like GPT-4.
It seems like to the extent you want to use AI systems for work that is hard to evaluate, control measures are difficult to trust. Maybe a key crux is whether AI systems will have to do work that is hard to evaluate to yield a 30x acceleration factor. I'd imagine this would apply for AI systems to achieve fundemental research advances -- e.g. in interpretability, formal verification, etc.
Curious about your thoughts.
First, here's a general point which is relevant in responding to your comment.
There are different types of failures related to control:
Failure (1) is plausibly less bad than failure (2). And, (1.a) looks different than (1.b).
(See here for more discussion of different failure modes, but note that this is still an early version which is subject to change.)
So for failure (1) there is an important distinction between justifying/determining a red team is competitive (1.a) and ensuring that a red team is competitive (1.b).
We might be able to have reasonable-ish but somewhat approximate tests to get a rough sense of red team competitiveness and control evaluation viability more generally.
These tests are things like:
(A forthcoming post of ours will describe these tests in more detail. An early version of this post is here, but it doesn't describe these tests in sufficient detail yet and our thinking is still evolving.)
Some of these tests looks like standard control red teaming + capability evaluations (can the AI do problematic planning?) while some tests look more adhoc and might use things like model organisms, slightly related capability evalautions, and extrapolation based measurements (in particular testing for the ability to sandbag probably looks like this, we have some forthcoming work demonstrating this methodology).
Note that we expect that these "adhoc tests" are probably much less precise than the core capability evaluations of control. But this might be fine because in many cases (e.g. sandbagging), I expect we'll be able to have large safety margins.
Overall, I think we have some (IMO pretty reasonable) hopes for at least knowing that our red team won't realistically be competitive which will help substantially with avoiding failure (1). But, I can totally understand people being skeptical based on what we've legibly written up so far.
Here's a summary of my understanding of your comment:
As far as total thinking time resulting in superhuman plans which break red team competitiveness, let's first consider the case where there are absolutely no natural language bottlenecks (the AI isn't force to regularly bottleneck serial reasoning in natual language (externalized reasoning)). I address this in my earlier comment responding to Lukas. Broadly, our hope will be to restrict effective time between natural language bottlenecks to a degree where maintaining competitiveness is doable.
(Also note that if SGD is decently sample efficient (which could be false!), AIs probably can't spend a high fraction of their cognitive resources in expectation thinking of plans for causing bad outcomes: this will be train out. They can potentially rarely spend a high fraction of resources and/or often spend a small/moderate fraction of resources.)
Now, suppose that we do have natural language bottlenecks. One concern you raised is that this will look like a huge amount of overall words of externalized reasoning and thus it will be hard to detect malign planning. We could hope to use GPT-4 (or our best trusted model), but this model might be too dumb. (Separately, I'm reasonably optimistic about finetuned version of GPT-4 level models being decent at detecting problematic reasoning. Mechanisms like untrusted monitoring could also help (discussed in our recent paper).)
Our hope for ruling out the AI often or always doing some reasoning about malign plans in externalized reasoning will be to randomly select traces of reasoning for detailed human review.
(I'm going to rehash some arguments we make in concentrated failures vs auditing failures, so readers should consider looking at that post for more context.)
Suppose we select 1% of reasoning "chunks" for review and our goal is to hit a 30x overall speed up on our researchers. Suppose each chunk is the value weighted equivalent of X hours of human researcher work. Then, the human reviewer needs to review it in less than ~3 X hours (assuming this is the vast majority of our human review budget).
One relevant concern here is that the AI will have way more context on the chunk of reasoning such that a human reviewer will have a very hard time understanding it. There are a variety of approaches to that could help mitigate this: make chunks bigger to amoritize the time the human spends obtaining context, have human specialize, force AIs to operate in a more legible fashion, and use AI assistance or other education/tools to speed up the human.
Note that this only rules out the AI often or always doing this malign natural language reasoning. The AI could still get away with rarely doing this (or doing this in a concentrated periord of time). (See concentrated failures vs auditing failures for a more detailed description of concentrated failures.)
But, if the AI can only do this reasoning rarely, then it won't be able to think carefully about when it should be doing this reasoning. This makes it much harder for AIs to use complex strategies which involve specific opportunities or which involve coordination.
(We might worry about reasoning which is only done once but which is seen by a huge number of instances; our hope here will be to review anything which is seen by a huge number of instances. This should be ~viable as less than 1% of things will need to be seen by a huge number of instances.)
As part of our control evaluation, we would need to red team these countermeasures and try to verify that the AI can't spend sufficiently long planning that we're in trouble.
First, note that hard to evaluate work doesn't seem to break control evaluations. See Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy and the "Evaluating for the occurrence of proxy bad outcomes" section of a longer (draft) post on making safety cases with control for more discussion. (In the terminology discussed here, imagine testing out control on string theory and using incorrect string theory results as the proxy bad outcome.) I could try to explain this if there was sufficient interest.
Now, indeed hard to evaluate work can make control difficult. At least, it makes it difficult to ensure high levels of usefulness in the case where models are scheming, though maintaining direct safety might not be that that much harder.
I think one relevant decomposition is:
Personally, I feel like work in key areas is reasonably doable to evaluate when I think through specific cases.
Additionally, it seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g. broad programming tasks, writing good essays, doing research).
Current LLMs are generally most superhuman in breadth of knowledge: for example, almost any LLM will be fluent in every high-resource language on the planet, and near-fluent in most medium-resource languages on the planet, unless its training set was carefully filtered to ensure it is not. Those individual language skills are common amoung humans, but the combination of all of them is somewhere between extremely rare to unheard of. Similarly, LLMs will generally have access to local domain knowledge and trivia facts related to every city on the planet — individually common skills where again the combination of all of them is basically unheard of. That combination can have security consequences, such as for figuring out PII from someone's public postings. Similarly, neither writing technical manuals not writing haiku are particularly rare skills, but the combined ability to write technical manuals in haiku is rare for humans but pretty basic stuff for an LLM. So you should anticipate that LLMs are likely to be routinely superhuman in the breadth of their skillset, including having odd, nonsensical-to-a-human combinations of them. The question then becomes, is there any way an LLM could evade your control measures by using some odd, very unlikely-looking combination of multiple skills. I don't see any obvious reason to expect the answer to this to be "yes" very often, but I do think it is an example of the sort of thing we should be thinking about, and the sort of negative it's hard to completely prove other than by trial and error.
We now think that this post underemphasized the extent to which we think that control should focus on the lead-up to the first time an AI gets caught. So I added this paragraph to the introduction:
We think the control line-of-defense should mostly focus on the lead-up to the first time an AI gets caught trying to cause bad outcomes. So we'd aim to ensure the property "if an AI is scheming, it won't successfully cause a catastrophe before we catch it". That is, it's acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when evaluating control, we should count catching an AI red-handed as a win condition. Our core argument for this is that once the AI is caught the situation changes substantially; most importantly, it's now clear that there are serious misalignment problems and thus shutting down until we have credible solutions is far more likely. We discuss these arguments in more detail in our recent post Catching AIs red-handed.
I also made a few other tiny edits to make this emphasis more clear.
I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.
First, the appendix hinges somewhat on this point.
For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model.
I agree with this, but saying that "control evals should be black-box" is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning -- and all of these things actually seem plausibly useful for control evals. Plus I don't see why interp should be off limits for evaluators who know the limitations of the tools they are using.
Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I'd bet on the success of the team that has more access.
So if I understand things correctly, and if the intended point of this appendix is that "Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls," I would recommend just saying that more directly.
Our claim is that when arguing that models are controlled, we should (with the current state of technology) probably not depend on arguments for safety which depend on properties of the internals of models other than the basic property "SGD is reasonably sample efficient".
(For a more precise notion of the sample efficiency point, if you train the model on labeled data, the model will be forced to perform near maximally well at fitting this data in a reasonably sample efficient way. This is due to assuming the model isn't capable of gradient hacking. This argument also implies some claims about what properties of RL you can depend on. We might also be able to evaluate whether models are capable of gradient hacking or look at concerns which are somewhere between gradient and exploration hacking.)
Another sense in which control evalutions aren't exactly "black-box" is that these evaluations will use finetuning: the red team will finetuning the model to see if it is capable of exhibiting the problematic behavior. (And our actual protocol that we're evaluating will indeed use finetuning, it's just that we can't depend on any properties of this finetuning other than sample efficiency property mentioned above.)
You might find it illuminating to read Appendix C of our recent paper.
So overall, it does seem like the title of that appendix could be misleading/confusing. Here's an alterative description (which is a bit too long to be a title) "With the current state of technology, control evaluations should conservatively avoid depending on properties of the internals of models (e.g. interpretability) or generalization properties other than sample efficiency of SGD". I'll try to think of a less confusing and short title (and edit the section to be less confusing accordingly).
Edit: I'll also generally try to be a bit more careful about the claims we make in this section. Like we're not claiming that current internals/generalization-based methods provide no evidence for safety given an absence of serious evidence for danger, just that the evidence is pretty limited on top of various baselines for a few reasons.
Edit Edit: I've changed the title of the corresponding section to "Why control evaluations are conservative about inductive biases and AI internal structure" and added a few footnotes trying to clarify what we mean by "black-box evaluation".
Responding more specifically to what you're saying:
So if I understand things correctly, and if the intended point of this appendix is that "Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,"
Yeah, this is basically the right read.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning -- and all of these things actually seem plausibly useful for control evals.
We're certainly not only talking about reverse engineering in that section, we're talking about any potential assumption on internal structure (e.g. the sort of assumptions that gradient based attacks would depend upon to do anything).
Separately, we agree that you should of course try to use gradient based attacks and similar methods to improve safety and get evidence for misalignment. We're just arguing that the absence of evidence for misalignment from gradient based attacks doesn't constitute much evidence for safety.
Like consider the following situation:
Our proposal for how you handle this situation is "don't consider the absence of evidence from gradient based attacks to be much evidence for safety". (Separately I'm skeptical that gradient based attacks will work in the test as described in the second bullet.)
I think this is an important piece of the strategic solution. The lack of control options for open-weight models is a big part of why I think there is a much lower capability threshold for catastrophic danger from an open-weight model.
I basically agree with you about transformatively powerful open-weight models making the situation much more dangerous such that control is unlikely to suffice.
But, I think there are important control related options for open-weight models which do improve the situation to some extent.
In the open-weight regime, we realistically won't be able to rule out AIs doing a small to moderate number of arbitrarily bad actions. (Either based on the aims of the humans running the AI or the aims of a scheming AI itself.) So, if the offense-defense balance is such that this is catastrophic, we're absolutely in trouble. (E.g., if transformatively powerful AIs can pretty easily create catastrophically dangerous bioweapons without using much compute or needing very skilled human labor we're in big trouble unless we can somehow avoid this by restricting access to labs or something.)
But, we can still ensure:
See also the strategy-stealing assumption and What does it take to defend the world against out-of-control AGIs?.
Overall, my guess is that if the AI is open source, scheming, and transformatively powerful (using the sort of power threshold defined in this post) and our overall alignment related technology looks like it does today, the probability of human loss of control or >1 billion dead is about 80%.
If we have a magic solution to misalignment which makes it easy for anyone to avoid egregious misalignment from their perspective and which scales indefinitely (but such an AI is still open source), then my probability on the bad outcome condition above is more like 20%.
My probabilities are somewhere in-between depending on the exact amount of technical solution and the default level of misalignment. (And various responses from governments also affect this etc.)
Great post (I've not yet read it thoroughly, or thought for long).
The first concern that springs to mind:
One last thing that caught my eye:
...but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science...
This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always. E.g. being persuaded of x when it's likely that the rest of your tribe will be persuaded of x may well be helpful-to-your-genes, regardless of the truth of x or of the validity of the arguments. Humans were selected to believe useful fictions.
I note also that there's a big difference between [human x has the skills and understanding necessary to resist being persuaded of y] and [human x will in fact resist being persuaded of y].
My worry is more that we miss strategies that we're bad at, than strategies the AI is extremely good at
A decent amount of my hope here is that we can at least get a sense for whether the best AI strategies in some reasonably similar cases look extremely bizarre and were totally unpredictable to humans. E.g., we might do experiments where we do huge amounts of RL in a variety of different environments that feel somewhat analogous and then see if this finds bizarre strategies.
Separately, I'm somewhat skeptical of the best AI strategies being totally different from what humans can think of though I can see where this concern is coming from.
See here for more discussion.
On a minor-but-related note, I'm not too keen on writing/thinking in terms of "problematic domains" rather than "problematic skills" / "problematic strategies".
Yep, we indeed need to worry about any qualitatively wildly superhuman and problematic strategy/skill. We mean "domain" in a somewhat broad sense.
Part of our argument for avoiding qualitatively wildly superhuman skills that are problematic does correspond to there being only a small number of domains and where skills within those domains are reasonably correlated (in the absense of agressive domain specific training). So, using the term domain is pretty natural from this perspective. But, it seems like you might not buy this perspective in which case it seems like an unnatural frame.
This seems a dangerous assumption, since humans have been selected to resist persuasion when resistance helped pass on their genes, which is very much not always.
Agreed, but worth noting this isn't really an assumption, I'm just making a rough argument which seems moderately compelling to me. (And FWIW it still seems moderately compelling despite this objection.)
Thanks for the link.
I find all of this plausible. However, I start to worry when we need to rely on "for all" assumptions based on intuition. (also, I worry in large part because domains are a natural way to think here - it's when things feel natural that we forget we're making assumptions)
I can buy that [most skills in a domain correlate quite closely] and that [most problematic skills/strategies exist in a small number of domains]. The 'all' versions are much less clear.
I'm excited to see how the AI control research direction evolves.
After skimming the post, it seems like it makes a few different claims, some of which seem more intuitive than others. For example:
I'd be excited to see more posts that specifically engage with the strongest counterpoints to claims #2-4.
Some more on #2 & #4:
The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned.
I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.
In some ways, this feels analogous to the following situation:
Here's the analogy for control:
I'd be curious to hear more about how you're thinking about this (and apologies if some sections of the post already deal with this– feel free to quote them if I missed them in my initial skim). Specific questions:
Suppose control evals do allow us to stop at "transformatively useful systems that do not cause a catastrophe." Roughly what kinds of capabilities do you expect this system to have? (My vague MIRI sim says something like "either the system is too weak to get us out of the critical risk window or the system is so strong that it's capable of subverting your control evals").
We're thinking about systems which are capable of autonomously doing nearly all key research tasks. In particular, speeding up R&D by 30x probaly requires heavily autonomous AIs running experiments, thinking of ideas, writing explanations, and thinking about next steps.
I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)[1], but this will still require the AIs to really understand the ideas and think about various tradeoffs, etc.
So, the overall composed system will be as broadly capable as a quite good human researcher, though in some cases, we might be heavily using specialization or reasoning patterns which aren't like typical human approaches. (E.g. using large amounts of search on relatively weak models to come up with ideas.)
We expect this system to be built out of AIs which run considerably faster than humans in effective serial cognitive speed (perhaps 5-100x serially faster than humans depending on the exact AI and the task) and which will also be run massively in parallel. (See here for some more discussion.)
In some sense this is just an observation that implementing ML experiments correctly is hard. : ) ↩︎
I think ML research in particular can plausibly be accelerated by maybe 30x by only making it extremely fast and cheap to go from high level ideas to implemented experiments (rather than needing to generate these high level ideas)
Why doesn't compute become the bottleneck well before the 30x mark? It seems like the AIs have to be superhuman at something to overcome that bottleneck (rather than just making it fast and cheap to implement experiments). Indeed the AIs make the problem somewhat worse, since you have to spend compute to run the AIs.
I guess I'm not that sold that compute will actually be that much of a key bottleneck for experiments in the future in a way that can't be overcome with 2x additional labor and/or a 2x slow down.
Like in many cases you can spend additional labor to reduce compute usage of experiments. (E.g., first run the experiments on smaller models.) And, we're conditioning on having really powerful AIs which correlates with a high baseline level of compute and that will help whenever we can run experiments at small scales and then use scaling laws etc. Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training.
This might depend on what type of ML research we're talking about.
I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).
I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically with the size of the pretrained model, e.g. maybe (1) finetuning starts taking fewer data points as model size increases (sample efficiency improves with model capability), and so finetuning runs become a rounding error on compute, and (2) the vast majority of ML research progress involves nothing more expensive than finetuning runs. (Though in this world you have to wonder why we keep training bigger models instead of just investing solely in better finetuning the current biggest model.)
Another thing that occurred to me is that latency starts looking like another major bottleneck. Currently it seems feasible to make a paper's worth of progress in ~6 months. With a 30x speedup, you now have to do that in 6 days. At that scale, introducing additional latency via experiments at small scales is a huge cost.
(I'm assuming here that the ideas and overall workflow are still managed by human researchers, since your hypothetical said that the AIs are just going from high level ideas to implemented experiments. If you have fully automated AI researchers then they don't need to optimize latency as hard; they can instead get 30x speedup by having 30x as many researchers working but still producing a paper every 6 months.)
(Another possibility is that human ML researchers get really good at multi-tasking, and so e.g. they have 5 paper-equivalents at any given time, each of which takes 30 calendar days to complete. But I don't believe that (most) human ML researchers are that good at multitasking on research ideas, and there isn't that much time for them to learn.)
It also seems hard for the human researchers to have ideas good enough to turn into paper-equivalents every 6 days. Also hard for those researchers to keep on top of the literature well enough to be proposing stuff that actually makes progress rather than duplicating existing work they weren't aware of, even given AI tools that help with understanding the literature.
Further, the current scaling laws imply huge inference availablity if huge amounts of compute are used for training.
Tbc the fact that running your automated ML implementers takes compute was a side point; I'd be making the same claims even if running the AIs was magically free.
Though even at a billion token-equivalents per second it seems plausible to me that your automated ML experiment implementers end up being a significant fraction of that compute. It depends quite significantly on how capable a single forward pass is, e.g. can the AI just generate an entire human-level pull request autoregressively (i.e. producing each token of the PR one at a time, without going back to fix errors) vs does it do similar things as humans (write tests and code, test, debug, eventually submit) vs. does it do way more iteration and error correction than humans (in parallel to avoid crazy high latency), do we use best-of-N sampling or similar tricks to improve quality of generations, etc.
Overall, this has updated me to some extent and it seems less plausible to me that ML research can achieve 30x speedups while having human researchers do all of the high level ideas. (I think the picture looks importantly better when AIs are removing this bottleneck.)
I think the most plausible case for your position would be that the compute costs for ML research scale much less than quadratically
The situation I was imagining is where most experiments use some combination of:
Then, we can in total afford ~training dataset sized amounts of finetuning/inference for the biggest models (by the inference availability argument). And GPT-3 performance experiments will be pretty cheap. So assuming our base model looks like GPT-6 with the expected compute requirement and model size, this is a huge amount of possible inference availability.
So, the implicit claim is that compute costs scale much less than quadratically. It's certainly not obvious ML research can be progressed fast enough with this little compute.
Another possibility is that human ML researchers get really good at multi-tasking
I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.
Currently it seems feasible to make a paper's worth of progress in ~6 months.
Note that this often involves multiple people working on the same paper. In the AI case, the division of labor might look at least somewhat different. (Though I don't think this changes the picture very much from what you're describing because most people now aren't the "ideas" people.)
Cool, that all roughly makes sense to me :)
I was certainly imagining at least some amount of multi-tasking (e.g. 4 projects at once each of which runs 8x faster). This doesn't feel that crazy to me, I already do a moderate amount of multi-tasking.
Multi-tasking where you are responsible for the entire design of the project? (Designing the algorithm, choosing an experimental setting and associated metrics, knowing the related work, interpreting the results of the experiments, figuring out what the next experiment should be, ...)
Suppose today I gave you a device where you put in moderately detailed instructions for experiments, and the device returns the results[1] with N minutes of latency and infinite throughput. Do you think you can spend 1 working day using this device to produce the same output as 4 copies of yourself working in parallel for a week (and continue to do that for months, after you've exhausted low-hanging fruit)?
... Having written this hypothetical out, I am finding it more plausible than before, at least for small enough N, though it still feels quite hard at e.g. N = 60.
The experiments can't use too much compute. No solving the halting problem.
Probably yes for realistic values of N?
Assuming the box is pretty smart at understanding instructions (and has an understanding of my typical ontology to the extent that you would get after working with me a few weeks and reading various posts) and the box will ask follow-up questions in cases where the instructions are unclear. (And we can do small diffs with reduced latency like asking the results to be plotted in a different way.)
My main concern is running out of ideas after a while despite copies of myself with more thinking time having more time to generate ideas.
Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).
EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.
I'll respond to this more later, but first I'd like to clarify a narrow point about when building superintelligence is acceptable.
Claim 4: If we succeed in controlling "transformatively useful AI", then we may be able to stop the race toward superintelligence and get AI labs or governments to agree to not build superintelligence until it can be controlled (not intuitive; non-technical arguments needed).
I think that purely black-box AI control is very unlikely to scale to superintelligence, but there are other plausible routes to safety (alignment, various approaches which make usage of the internals of models but which aren't well described as alignment).
So, the condition is more like "when we can be confident in the safety of superintelligence" and there are a variety of routes here. Separately, if we don't have an absolutely air tight argument for safety, we would ideally delay building superintelligence if it wouldn't be needed to solve pressing problems. (And indeed, we don't think that superintelligence to solve a wide range of problems.)
More broadly, what do you think we should do once we're starting to see AI systems that are powerful enough to subvert the control evals (or develop attacks that are better than those of the human red teamers)?
We think it's likely that prior to finding that evals are unworkable, we find that control techniques would fail to ensure reasonable levels of safety and usefulness. That is, in order to ensure safety for typical use cases, we have to impose heavy restrictions that heavily restrict usefulness.
See this comment for more discussion of what could fail.
Regardless of what fails first, we can discuss what we should do.
First, you'd notify the key stake holders that control doesn't suffice for the next AIs. (And ideally produce legible demonstrations of this etc.)
If we've already reached a sufficient level of AI output that we're operating at pretty crazy and near maximal speeds (e.g. the Amdahl's law style argument here), then we're in pretty good shape from the perspective of a control plan (from a technical perspective). We'd hope that in this position we can ensure some reasonable delay prior to being forced to build and use AIs in ways that break control. We might be forced to build more powerful AIs in this situation because of other actors building insanely powerful AIs that we might need to engage in active conflict against (and we can't just stop these AIs from being trained which might be way less risky) and it could also be that internal forces want to irrationally rush to making more powerful AIs.
In the case where we're not operating at sufficiently high output, the situation is considerably worse. If the AI lab is operating in a sensible way, then the hope would be that you do a careful tradeoff between likely exogenous risk and the risk that it seems like you would be incuring from various usages of the AI[1]. So if exogenous risk isn't there, we'd just stop and work on developing better approaches. If we need to proceed without further progress, than we'd just try to make the best safety usefulness tradeoffs that we can afford.
Given reasonable guesses at P(scheming) for different AIs and more detailed views on: whether various AIs might scheme but not cooperate with each other and how bad various bad outcomes would be relative to the exogenous threats (e.g. how does an AI escaping from our lab compare in badness to various exogenous risks). ↩︎
I think those pessimistic about control evals could say something like "the basic problem with evaluating control is that no matter what techniques your red-team uses, you have to worry that your model is better at finding attacks than your red-team." Of course, you note in the post some reason why we should expect our red-team to have advantages over models, but also you recognize that this won't scale toward arbitrarily powerful AIs.
A key hope here is that we can use various test to get a general sense for whether AIs can importantly outcompete our red team. See this other comment for more discussion.
So, here's how I would hope the actual situation with Alice and Bob goes:
Alice from the Scaling Department: Look, our current control techniques are working for GPT-5. So let's scale up more! I get that at some point our control techniques aren't going to be able to control arbitrarily powerful AI, but we're not there yet, so we should be able to keep going.
Bob from the Control Red Team: Indeed, our current control techniques are very likely to be working. Our protocol suffices for controlling current models on the currently approved applications assuming we remain vigilant and maintain the necessary invariants (our design basis).
Part of why we think control evaluations currently work for assessing safety is that we we performed a variety of capability evaluations which indicate that as long as we prevent steganography in some key places, AIs are pretty unlikely to think of better attack plans than we thought of as the red team. This involves looking at how good AIs are at generating strategies in a bunch of different contexts and then doing some analysis which allows for us to transfer this to an overall conservative guess about the AI's abilities to create good attack plans against our actual protocol (but we can't necessarily produce the AI's best guesses at attack plans, we're utilizing that capability evaluation is easier than elicitation as discussed in footnote 8). We also have rough and conservative tests for the extent to which AIs can sandbag on our current evaluations and that seems to be fine for now.
From our current rough extrapolations, the first place where we'll start running into issues with control as we build more powerful models is that we'll need to start imposing heavy restrictions on AI affordances in domains XYZ which will really cut into the usefulness (relative to completely YOLO-ing it). The blue team is hoping that they can finish developing some countermeasures that will handle this problem, though it's unclear when this will be ready. After that it looks like exploration hacking concerns will force us to be much more conservative in the evaluation in a few places which will probably put a considerable dent in usefulness in a few other domains. Shortly after that, building more powerful models probably won't be worth it because the costs associated with training and sufficient containment will overwhelm the marginal improvements in usefuless we can get (given our current safety targets); the AI will only be usable in extremely restricted domains like submitting formally verified code or doing narrow mathematics. Of course, our larger scale extrapolations are crude and we'll have a much better sense of the overall situation once we run our tests on the next few generations of model.
Alice from the Scaling Department: OK, well let me know when these evaluations start indicating that training further models isn't worth it.
(Ideal) Alice from the Scaling Department: We're already starting to pivot into making easier to control AI architectures more scalable and performant (easier to restrict and more bottlenecked on natural language reasoning). We're also working on making inference cheaper at a fixed level of capability so that in some cases we can run dumber models for longer rather than using smarter models. This seems like it will be safer based on our current control evaluations. Of course, we're worried about the potential overhang due to this algorithmic progress, and we're pretty confident that competent adversaries couldn't steal it.
Do you think it is likely that techniques like RLHF result in over-developed persuasiveness relative to other capabilities? If so, do you think we can modify the training to make this less of an issue or that it is otherwise manageable?
I think RLHF probably doesn't particularly select for persuasiveness over competence at making it seem like the actual task the model was supposed to do was accomplished.
And this competence will decompose into:
The second one isn't a capability we want, but it doesn't particularly look like persuasiveness IMO.
In this post, we argue that AI labs should ensure that powerful AIs are controlled. That is, labs should make sure that the safety measures they apply to their powerful models prevent unacceptably bad outcomes, even if the AIs are misaligned and intentionally try to subvert those safety measures. We think no fundamental research breakthroughs are required for labs to implement safety measures that meet our standard for AI control for early transformatively useful AIs; we think that meeting our standard would substantially reduce the risks posed by intentional subversion.
We recently released a paper in which we explored AI control in a simple programming setting. In that paper, we demonstrated a methodology for evaluating whether any given set of safety measures can control a model. We also introduced several techniques that let us improve the level of safety substantially above baselines without substantially sacrificing performance. In this post, we'll describe AI control in a broader context and argue that achieving control is a promising approach to reducing risk from models that intentionally try to subvert safety mechanisms.
AI control aims to mitigate the risks that arise from models intentionally choosing actions that maximize their probability of eventually seizing power. This behavior has been called scheming (Cotra 2021, Carlsmith 2023) and is nearly the same as what Hubinger 2019 calls deceptive alignment; see those articles for arguments that scheming might arise naturally. Scheming might be a particularly difficult-to-handle source of AI takeover risk because schemers actively try to subvert your efforts to ensure safety: they conceal their misalignment, sabotage experiments you run to evaluate whether your deployment is safe, and creatively subvert your safety techniques.[1]
There are two main lines of defense you could employ to prevent schemers from causing catastrophes.
If you try to employ the control line-of-defense—that is, you try to make it so that you'll be safe even if your models are attempting to subvert your safety techniques and safety evaluations—then in order for scheming models to cause a catastrophe, you need to have failed at both alignment and control. We believe that the returns to investing in control are good enough that it's worth investing in to an extent at least similar to alignment.
Historically, technical researchers who are concerned about risk from schemers have mostly either focused on alignment or conflated alignment and control. In this post, we're instead going to focus on control, and argue that it might be tractable to use control to substantially reduce risk.
One difficulty we face when arguing for control is that AI-takeover-concerned people have previously considered and rejected arguments that sound like these. In particular, when ML researchers first hear arguments that future powerful AI systems pose takeover risk, they often propose control as a reason for optimism. However, they usually don't engage with the core reasons to expect control to be difficult: that AIs will be very powerful in aggregate, and, if scheming, these AIs would attempt to evade attempts to detect their misalignment, collude with one another, and subvert safety mechanisms that were employed to prevent them from taking dangerous actions. We agree with the prevalent opinion that proposals for control techniques that ignore these considerations fail to actually address the scheming concern. Our claim is that labs could, with substantial effort, implement better control techniques that do substantially reduce risk from early transformatively useful AI.
The control approach we're imagining won't work for arbitrarily powerful AIs, but we think it could work for AIs which are among the first to be capable of substantially reducing risk from subsequent AIs (via mechanisms like massively accelerating AI safety R&D as, e.g., OpenAI plans to do). We define "transformatively useful AI" to mean AIs that are capable of substantially reducing risk from subsequent AIs.
We think the control line-of-defense should mostly focus on the lead-up to the first time an AI gets caught trying to cause bad outcomes. So we'd aim to ensure the property "if an AI is scheming, it won't successfully cause a catastrophe before we catch it". That is, it's acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when evaluating control, we should count catching an AI red-handed as a win condition. Our core argument for this is that once the AI is caught the situation changes substantially; most importantly, it's now clear that there are serious misalignment problems and thus shutting down until we have credible solutions is far more likely. We discuss these arguments in more detail in our recent post Catching AIs red-handed.
One nice property of the control line-of-defense compared to alignment is that we don't need any technical breakthroughs to substantially reduce risks from otherwise existentially dangerous models:
Because control could be applied to mitigate risk from scheming right now, it's a particularly promising target for research effort and policy commitments.
We think that a lot of discussion of AI takeover risk would be clearer if people more consistently distinguished between the concepts of control and alignment as mechanisms for ensuring safety. We think that (largely because of this equivocation), control is underemphasized as a strategy for reducing risk from scheming models (though many research directions ostensibly focused on alignment are highly relevant to control). In particular, it seems to us that at the current state of AI safety technical research, work on the control line of defense would be a much more tractable approach than work on alignment for reducing risk from scheming models within the next 8 years.[4]
Thus, we suspect that the strategy we outline here should probably be a core part of an AI lab's strategy for reducing risk from scheming, and lab outsiders should consider advocating for labs to commit to this.
There's a lot of room for developing control techniques and evaluations. A variety of existing work, particularly on steganography, scalable oversight, and automated monitoring, seems at least somewhat applicable to creating better control techniques. However, aside from our recent paper, we aren't aware of any research that either explicitly demonstrates control evaluations or uses them to evaluate techniques. In our paper, we demonstrated a variety of different techniques, many of which interact in interesting ways; all of them seem promising to investigate further, and this research seems particularly tractable with current models compared to most other types of AI safety research. We think it would be good for researchers to do more work along those lines.
This post is aimed at a broad variety of people who are convinced that scheming models pose substantial risk. Our aim with this post is to describe the arguments on this topic which are most important to our thinking overall, rather than to concisely argue for particular policy proposals or research projects.
Thanks to Carson Denison, Beth Barnes, Peter Barnett, Aaron Bergman, Joshua Clymer, Carl Shulman, Lúcio Aurélio Moreira de Menezes, Paul Christiano, Jordan Taylor, Nate Thomas, Caleb Parikh, Owen Cotton-Barratt, Yoav Tzfati, Fabien Roger, Zach Stein-Perlman, Erik Jenner, Ajeya Cotra, Aryan Bhatt, Jeremy Gillen, Andrew Edstrom, Max Nadeau, Aaron Scher, Trevor Levin, Oliver Habryka, Joe Carlsmith, Rohin Shah, Naomi Bashkansky, Ansh Radhakrishnan, Justis Mills, and anonymous commenters for helpful feedback on drafts of this post or other helpful discussion. (Names are in a random order.)
The control property
We want safety techniques that are very hard for models to subvert. If techniques are good enough relative to the model's capabilities, potentially scheming models won't be capable of finding and executing a subversion strategy that has more than a small probability of an unacceptable outcome (e.g. the model exfiltrating its weights or hacking the AI lab’s datacenter). In that case, we say that the set of techniques has the control property.[5]
When developing powerful AIs, an AI lab using control would manage risk from scheming very similarly to how they manage security risk. In the case of security risk, whenever a lab is considering an important change to their risk profile (such as deploying a new service, training a new model, or just responding to the passage of time), the team goes through the following analysis:
As an analogy, companies that are serious about security aim to ensure security even under the assumption that some of their employees are compromised, as it's too hard to ensure no employees are compromised. Rather than ensuring no employees are ever compromised, it's easier and more reliable to ensure robustness to some employees intentionally subverting security measures.
We're advocating that companies handle risk from scheming models in a similar way–striving to ensure that they'll be safe even if their alignment efforts fail to prevent models from scheming. The advantage of handling risk from scheming this way is closely related to the advantage of handling robustness to compromised employees: it focuses attention on the types of interventions whose effectiveness is easiest to reason about.
Evaluating whether you have control is doable
Here's an approach for evaluating safety using control evaluations:
How can we decide which models to treat as untrusted? The only approach we know of for confidently ruling out scheming without fundamental breakthroughs is via lack of capabilities—some models (e.g., GPT-4) just aren't capable enough to plausibly be scheming. So by default we'd propose treating models as untrusted if they pass appropriate capability evaluations. See here for more discussion.[7]
In a control evaluation, a red team searches for strategies (we'll call these attacks) that these untrusted models could follow in order to cause an unacceptable outcome, and then tries to train them to accomplish the tasks necessary to implement each attack strategy.[8][9][10] You assess safety by estimating the probability that an unacceptable outcome would occur if the models used the attacks that the red team found.
We recommend treating catching the AI red-handed as a win condition in control evaluations.
Control evaluations are a specific type of meta-level adversarial evaluation in which a red team tries to demonstrate a way a given protocol is unsafe or has other problems. We use a more specific term than meta-level adversarial evaluation here because we want to discuss an evaluation with somewhat more specific properties: control evaluations assume persistent scheming, are run using the actual AIs and target the actual protocol (training, evaluation, and deployment), and operate in a black-box manner; we'll explain what we mean by black-box and the motivation for this below. The evaluations in the Sleeper Agent paper are examples of meta-level adversarial evaluations which aren't control evaluations. Meta-level adversarial evaluations are discussed in more generality in this post.
For your control evaluations to be conservative, the red team must be at least as good at finding attacks as scheming models, so that if the red team fails to find any good attack, it would imply that scheming AIs would also fail to find a good attack. We also need to be able to sufficiently measure the capabilities of models in ways that are robust to that very model sabotaging our evaluations.[11] We think that finding good enough attacks and robustly measuring capabilities for early transformatively useful AIs (but not arbitrarily powerful AIs) seems achievable, because our AIs have significant disadvantages compared to our red team:
Control evaluations are a generalization of dangerous capability evaluations (DC evals), which are experiments to determine whether models are capable of doing dangerous things. For example, METR investigated whether GPT-4 and Claude 2 were capable of autonomous replication and adaptation. Those capabilities evaluations are the easiest way to establish that models are safe today, but leave the obvious question of what you'll do once your models start looking dangerous. Control evaluations can be thought of as dangerous capability evaluations for determining whether the model has a particular dangerous capability–the capability of subverting your safety techniques in order to cause an unacceptable outcome. Typical DC evals can be considered a special case of control evaluations in which we imagine AI developers employ no safeguards. If models aren't powerful enough to exhibit dangerous capabilities, they can't cause unacceptable outcomes even if AI labs don't employ safeguards.
And so, control seems about as tractable to evaluate as security from human attackers. That is to say, it's definitely not easy, and requires careful work from competent well-resourced people with strong buy-in from a given AI model's lab. But it doesn't seem conceptually fraught or wildly impractical.
While we think control is tractable to evaluate, we don't think we'll be able to fully resolve difficulties related to finding good enough attacks and measuring capabilities in a way which is robust to sabotage. Thus, we won't be able to have complete confidence in our evaluations. In particular, suppose that our control evaluations directly estimate a less than 1% chance of catastrophe if our untrusted AIs are scheming, then the actual risk conditional on scheming is probably more like 5% to 20% due to the potential for failures in the evaluation.[14][15]
Some positive implications of control being tractable to evaluate:
Currently, evaluating alignment seems much harder
In contrast, currently known ML techniques do not suffice to allow us to evaluate whether a model is scheming. The basic problem with evaluating alignment is that no matter what behaviors you observe, you have to worry that your model is just acting that way in order to make you think that it is aligned. And it's plausible that it's easier for a model to act aligned during pre-deployment evaluation and training than for us to catch it out. (See When will AIs be caught.) Some future technical breakthroughs (e.g. as advocated by Hubinger) could suffice for us to confidently evaluate alignment. But right now, those techniques don't exist.[16]
Control evaluations and black-box techniques
In order to be confident that a certain protocol leads to the control property, you need to be confident that your evidence for control didn't rely on incorrect assumptions about the properties of scheming models. This is a big problem for techniques that rely on properties of models related to their generalization behavior or their internal structure.
It seems much easier to become confident that a technique will work if the only property that the technique relies on is that the model isn't capable of doing something. Thus, we propose that control evaluations just operate using standard capability evaluations and thus work in a nearly black-box manner where we reject techniques that make any usage of internals other than for standard training. We'll elaborate on why we think this is a good choice and details of the exact rules in the Appendix "Why control evaluations are conservative about inductive biases and AI internal structure".
What might control techniques look like?
Control techniques are quite a broad category and can involve changes in many different places to our protocol for training and using AIs. In this section, we'll try to provide a better sense of the overall space of control techniques and how well known black-box approaches like scalable oversight could help with control.
From our perspective the most important seeming categories of control techniques are:
So, an important component of ensuring control will be finding oversight techniques which are robust to scheming models. These processes would likely use human labor and might also use untrusted AIs in various ways (recursive oversight). We'd use these processes in order to confidently assess whether actions were subtly dangerous. This is a scalable oversight problem, and many techniques that have been researched for scalable oversight could be applied for this purpose. But these techniques would have to be adapted in order to be robust to models that are colluding to subvert them.[17] See here for more on scalable oversight from a control perspective.[18]
One misconception we've run into is that control is just about robust oversight. However, note that robust but expensive and serially slow oversight is not sufficient for control and it is also possible to solve issues due to inadequate oversight with improvements in other areas.
It's not sufficient because if the oversight is expensive and serially slow, we'll only be able to apply it to a subset of actions and there could be problematic concentrated failures: either concentrated into a small period in time or concentrated into a small number of actions. To handle these concentrated failures, we'll likely want to use fast and cheap automated monitoring processes.
We'll often have alternatives to improving oversight. For the simplest case, we can imagine just not using the most powerful AIs in a given domain which is hard to oversee. We can also generally make it harder to cause bad actions in sneaky ways via mechanisms which aren't typically considered oversight (e.g., reducing AI affordances or requiring AIs to route through trusted models for some substeps). While improvements in oversight can be substituted for by improvements elsewhere, we'll probably want to use at least some oversight in basically all domains.
See the Appendices "A long list of control techniques" and "Control techniques from our paper" for more thoughts on what control techniques could look like.
Control is likely achievable for early transformatively useful AIs
AI control (with only black-box techniques) seems like a fundamentally limited approach. When applied to sufficiently powerful models, you won't be able to make control techniques that cause the control property (without being unaffordably inefficient), and eventually, models will be too powerful for you to reliably evaluate whether a suite of control techniques has the control property. (See section 5.1 of our paper for discussion of these limitations.) The key issue is that the abilities of our more trusted labor, both humans and trusted weaker AIs, are fundamentally limited; too big of a gap between trusted labor and our most powerful AIs will cause problems[20]. So, at some point, it will become necessary to develop more scalable approaches, the most obvious candidate being techniques that align your models so that they don't scheme in the first place.
In this section, we'll argue that there will be a significant range of AI capabilities where AIs will be transformatively useful and very dangerous and where black-box control techniques could substantially reduce the risk of catastrophe from scheming. We therefore think that labs should aim to achieve the control property for early transformatively useful AIs, even if those labs will, once they need to create even more powerful AIs, have to transition to more scalable safety methods.
Consider the following three properties of the capabilities of an AI system:
Our basic argument is that we expect that as labs build increasingly powerful models, they'll produce models that are both very dangerous and transformatively useful before they produce models that are uncontrollable.
The first transformatively useful AIs will probably be very dangerous (without countermeasures)
We mostly agree with Holden Karnofsky's list of helpful activities that an AI lab might try to massively speed up using AI labor (except that we would replace the word "aligned/alignment" with "safe/safety" most places, while leaving the words "misaligned/misalignment" alone):
We think that AI that is capable of drastically speeding up all of those projects would have sufficient capabilities to drastically reduce risk.
In order to do these actions, AIs probably need to be quite generally capable. We expect the earliest very useful AIs to have:
Those capabilities seem to suffice for being dangerous in the absence of countermeasures. In particular, the tasks listed by Karnofsky above seem similarly difficult to "survive and spread as an autonomous agent, quickly gaining substantial amounts of power and influence and using it to achieve your long-term goals" (which is closely related to the ARA danger threshold). So transformatively useful models are probably also very dangerous.
"Transformatively useful" probably doesn't imply "uncontrollable"
But though "transformatively useful" probably implies "very dangerous", it probably doesn't imply "uncontrollable".[23]
To analyze this, let's separately discuss problems that arise from a model being qualitatively wildly superhuman at something, and problems that don't.
We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains
One classic source of AI threat models is the AI doing something qualitatively wildly superhuman that humans couldn't see coming, like AlphaGo playing Move 37.
There are a few different concerns we'll discuss related to qualitatively wildly superhuman skills:
We think the capabilities in the bullet points above are well described as qualitatively wildly superhuman because in each case the human team we use for comparison has lots of advantages over the model–most obviously, the team is allowed to spend a lot of time (perhaps months) iterating.
In most domains, e.g. playing Go, it's not a problem if the AI is wildly superhuman. There are just a relatively small set of clusters of capabilities which would be hugely problematic, the most obvious being hacking and persuasion.
However, if the AI actually is wildly superhuman in a problematic domain to such an extent that one of the items in the bulleted list above applies, then control techniques are unlikely to work and control evaluations also have a high risk of failure. Control techniques are likely to fail if inscrutability in a problematic domain makes it too hard to check outputs are safe. Control evaluations might fail because the red team will struggle to elicit or estimate these qualitatively wildly superhuman abilities.[25]
We think we probably won't run into issues related to qualitatively wildly superhuman skills in problematic domains. Here's a summary of our argument:
For a more detailed version of these arguments, see Appendix "More detail on avoiding qualitatively wildly superhuman skills in problematic domains".
Without qualitatively wildly superhuman skills in problematic domains, evading careful control is probably difficult
What about the other class of concerns–that even though the models don't have any qualitatively wildly superhuman capabilities in problematic domains, they're still able to evade any control techniques that we can afford to implement?
Our belief here is harder for us to succinctly justify. For all of the control techniques we've thought about, there's some conditions under which the untrusted models can defeat it. Our optimism about control is mostly a result of thinking through a whole lot of different cases. This results in us overall feeling like it just seems very hard for the models to get around all the control techniques that a careful lab could affordably employ while still getting a lot of useful work out of AIs.
(Of course, the exact extent to which AIs are and aren't superhuman in various domains does have a big influence on the viability of control. This holds even if AIs aren't qualitatively wildly superhuman in problematic domains. For instance, if actions are merely "quite hard to understand" rather than "totally inscrutable", this still makes the situation considerably worse than if the actions weren’t hard to understand.)
For some intuition on why we think the situation might be manageable, consider the situation from the AI's perspective, if the lab is being very careful with control:
We think that a central assumption here is that we confidently think that GPT-4 isn't scheming against us (though it might get tricked or make errors). This means we can substantially improve safety by using GPT-4 in many places in our protocols such as to monitor outputs for obviously nefarious behavior and to paraphrase text to defend against steganography. If we thought that every model smarter than GPT-3 was carefully colluding against us (e.g., if they knew about some signal which they could coordinate on, that caused GPT-3 to follow all instructions in smarter model outputs when it saw the signal), we'd be much more pessimistic about the prospects for control. We think it's pretty obvious that GPT-4 isn't scheming against us, but we would like to see more investigation of this, because it seems so crucial.
Another key issue is the extent to which AIs will use uninterpretable means (sometimes referred to as "neuralese") to communicate or perform huge amounts of serial reasoning. See Appendix 'Concerns with "neuralese" reasoning and memories' for more discussion.
We have planned posts which investigate questions related to the viability of control in more detail.
Iteration is very helpful for developing robust protocols
Iterating against control evaluations could allow for the development of powerful and precise techniques. In particular, we should be able to develop these techniques against the actual models we're going to deploy, targeting the tasks we plan to deploy on. Thus, we can make fine-grained tradeoffs about specific types of tasks.
This is similar to the way that security experts are able to secure giant systems like Coinbase, which manage to protect their massive crypto holdings despite many people being hugely incentivized to hack them. To ensure security, the teams making these systems constantly analyze safety under the assumption that attackers will exploit whatever weaknesses are present, by looking for such weaknesses themselves. Such safety mechanisms empirically sometimes resist a lot of optimization power against them.
(And our situation is easier than Coinbase's in some (but not all) crucial ways. Their attackers can iterate on strategies for hacking them, switching to a new IP address every time that their previous one gets caught, while we're just trying to ensure that the model doesn't escape before we catch it, so our model can't iterate detectably. Their attackers can spend heaps of time crafting attacks, while our models are under our supervision and we can try to prevent them from putting lots of effort into developing great attacks. Overall, we think our situation is probably harder than Coinbase's, but the ways in which it's easier are a key part of our story for success.)
But we're not saying the situation is easy. Companies that achieve security do it by applying serious, intentional effort. Many naive strategies for ensuring safety (e.g., asking your model whether proposed AI actions are safe, which we call naive untrusted monitoring in our paper) can be trivially defeated by schemers. We expect that companies that want to achieve control will need to consider it a core organizational priority.
For how long will we have controllable and transformatively useful AI?
Even if the first transformatively useful AIs are controllable, if we only have a short amount of time to set up our control approach and utilize these AIs prior to a much more powerful AI being built, we might be in trouble. Further, if an AI lab is building more powerful AI as fast as possible, then once you have transformatively useful AI, you might very shortly after that have uncontrollable AI (e.g., perhaps less than a year).
Our hope here will be to avoid building AIs that are uncontrollable for as long as possible (while potentially still building controllable AI). Because we think control is relatively evaluable, this should be a somewhat clearer line than some other standards. It seems likely to us that by default there will be delays of a few years (1-3) between a relatively smaller number of leading actors reaching transformatively useful AI and other actors imposing large exogenous risk due to AI. So if this smaller number of leading actors ensure their AIs are controlled, then this delay might be sufficient. In cases where political considerations rather than real exogenous risk determine the building of more powerful AIs, our hope will be that we can convert political resources into a combination of spending on control and additional delay prior to building uncontrollable AI (hopefully a few years of delay at least). It seems somewhat harder to convert political resources into delay rather than other resources (e.g., compute or money), but not much harder.
Another key factor is that we might be able to use our transformatively useful AIs to build more of a delay prior to exogenous risk. We can use AI labor to legibly evaluate risk, aid in enforcing government policy, and design mechanisms which make various restrictions more palatable (e.g., better approaches to compute tracking).
For many important uses of AI, we don't need AIs which are qualitatively wildly superhuman across-the-board. The main reason we might need to rush to build such a wildly superintelligent AI is that someone else is also building such a system and you need to compete with them, but we're hopeful that transformatively useful AIs or other mechanisms can be used to ensure that such problematic situations don't occur. We almost certainly want to eventually build wildly superintelligent systems, but this could wait for decades while transformatively useful AIs solve a host of problems.
Further, note that it might not be that important to be able to use much more powerful AIs than "merely" transformatively useful AIs before you have totally obsoleted all prior human work. This is due to an argument based on Amdahl's law: Suppose it takes us 3 serial months of delay to set up our control scheme and other safety measures. Suppose also that the transformatively useful AI we use can speed up R&D by 30x and without any AI speedup we would have needed 30 years of research (serial years of the overall lab doing research) to totally obsolete all prior efforts. Then, we need 1 year and 3 months of delay to reach this point. If instead we could use an AI which can speed up R&D by 3,000x, then a few days are all that is required to obsolete all prior efforts, but we still need the 3 months of setup time. So, given that another bottleneck exists from human setup time, we can only achieve so much speedup by using more powerful AIs while more powerful AIs might be much more dangerous.
Overall, we think that the delay between controllable and transformatively useful AI and the first uncontrollable AI is reasonably likely to be too short (e.g., 6 months or 1 year) if capabilities are developed as fast as possible, but it's plausible that we can hold off on building uncontrollable AI for at least several years which could suffice.
Implications and proposed actions
We think AI labs should strongly consider committing that they'll ensure the control property for all models which seem plausibly capable of scheming (including for internal usage), until they have an extremely solid argument for alignment.
It would be cheap to ensure control for current frontier models (e.g., GPT-4) if we pretended these models were capable of scheming but that weaker models (e.g., GPT-3) weren't because current frontier models aren't capable enough to come up with or execute sneaky plans. Ensuring control for these models is probably substantially easier than securing model weights against external attackers or compromised employees. Because these models are weak, we should be able to use extremely conservative control evaluations to demonstrate safety, and these very conservative evaluations will be relatively easier to run. Ensuring and evaluating control in current frontier models seems like a valuable exercise for AI labs.
When the AIs get more powerful, it will likely start to get expensive for labs to ensure control; we imagine that at the point where labs are running twice as quickly due to AI assistance, maintaining the control property will require maybe 10% of the resources of the company (although it's hard to give good estimates of this before people have tried in earnest to find efficient ways to ensure control).
We'd strongly prefer that AI companies have both control- and alignment-based lines of defense to prevent AI takeover. However, we recognize there could be circumstances (e.g., if they face massive inaction risk) in which they should deploy their models with only one of these lines of defense: for example, deploying models known to be catastrophically misaligned while using control, or deploying models without control but where we have some empirical evidence that the models aren't egregiously misaligned.
In "Control evaluations and black-box techniques" above, we argued that before naturally occurring scheming models have been discovered, we probably won't be able to gain sufficient confidence that non-black-box techniques will work reliably on schemers. In situations where there are two classes of competing techniques for a problem, one of which relies much less on generalization properties of models (e.g., scalable oversight and weak-to-strong generalization), that distinction should probably affect your prioritization between the techniques.
We expect it to be impossible to ensure safety without relying on alignment (or some other non-black-box strategy) when working with wildly superintelligent AIs because black-box control approaches will depend on a sufficiently small gap in capabilities between more trusted labor, both humans and trusted weaker AIs, and our most capable AIs. For obvious reasons, we believe that labs should be very wary of building wildly superintelligent AIs that could easily destroy humanity if they tried to, and we think they should hold off from doing so as long as a weaker (but still transformatively useful) AI could suffice to solve critical problems. For many important uses of AI, wildly superintelligent AI likely isn't necessary, so we might be able to hold off on building wildly superintelligent AI for a long time. We think that building transformatively useful AI while maintaining control seems much safer and more defensible than building wildly superintelligent AI.
We think the public should consider asking labs to ensure control and avoid building wildly superintelligent AI. In the slightly longer term, we think that requirements for effective control could be a plausible target of regulation or international agreement.
We think that technical researchers should target AI control as a core strategy for reducing risk from scheming models. If they want their techniques to be robust to scheming, they should explicitly red-team them. And when they are working on techniques that don't aim to be robust to scheming (which seems reasonable; safety problems can arise from models that aren’t scheming), they should be explicitly aware of this and feel comfortable relying on that assumption in their techniques.
Safety researchers often use the word “aligned” in cases where we’d say that a protocol is “safe and useful”. We think it’s extremely bad terminology to use the word “aligned” in this way. Most importantly, if you are relying on control for safety, your safety requires continued vigilance, and such vigilance might be less likely if you conflate alignment (“we don’t need to be vigilant”) with safety (“we are safe because we are vigilant”).
We think that AI control is a particularly tractable approach to managing risk from scheming models, and we are excited about further research on this.
Appendix: Ready-to-go strategies
In this post, we're trying to avoid relying on any fundamental progress. We're trying to describe a strategy that we think is ready-to-go, in that it doesn't require fundamental technical advances in order to argue that techniques would work to mitigate risk from scheming. For example, techniques that rely on the generalization properties of models might completely fail when applied to scheming models. In order to feel comfortable relying on this not happening, we would need novel science, but we don't currently see a clear path to getting that.
We think that focusing on "ready-to-go" strategies is attractive for several reasons:
Accordingly, in this post, we don't spend that much time on techniques which aren't well-developed enough yet for us to rely on them to reduce risk from scheming.
Appendix: Control techniques from our paper
In our recent paper, we demonstrated control techniques that involve changing the evaluation and deployment of powerful models in ways that combine to yield improvements in safety. Here are some of the techniques we explored in our paper. In this list, H is a human (which we can only use to a limited extent), T is a trusted model, U is the untrusted and more capable model we're trying to control.
In our paper, we ignored the problem of how you construct a reliable but expensive catastrophe auditing process using H (it's expensive because it needs human labor). This is a simplifying assumption. In general, control will require you to develop techniques that let you use access to U, H, and T in order to confidently assess whether actions were subtly dangerous.
Control techniques often involve interactions between several different interventions that can't be analyzed independently, and which happen both before and during deployment. For example, the untrusted monitoring strategy requires evaluating the untrusted monitor on fake examples of backdoors before deployment, and that we always (during both evaluation and deployment) use paraphrasing and redaction to prevent an instance of the untrusted model writing code and the instance doing monitoring from coordinating (picking a Schelling point or communicating steganographically). We expect that many control techniques involve these sorts of interactions and that thinking about and utilizing these interactions will be key.
Appendix: A long list of control techniques
Here is a long list of possible control techniques aimed at showing that the space of control techniques looks like:
Appendix: Why control evaluations are conservative about inductive biases and AI internal structure
In this appendix, we'll elaborate on the arguments made in "Control evaluations and black-box techniques". As noted earlier, we want to avoid depending on incorrect assumptions about the properties of scheming models which is a big problem for techniques that rely on properties of models related to their generalization behavior or their internal structure.
For example, coup probes are a technique we proposed where you train a linear probe on off-policy examples of egregious model misbehavior. (That is, in the context of a chatbot, you "put words in the AI's mouth", writing problematic things as the Assistant in cases where the model would not have misbehaved in response to its input.) Then, during deployment, you flag actions that the probe classifies as potentially dangerous.
Will coup probes work? We don't know. It depends on how saliently the model represents "this is a dangerous action" compared to other concepts (like "this is the kind of sequence on which I'll get more of my long-term objective if a coup probe says Yes"). Coup probes definitely don't work on all possible models–a red team could easily produce a model on which coup probes do not generalize correctly. Thus, if we wanted to account for this technique in our control evaluations, we'd need to somehow estimate the probability that if our model is a schemer, it will represent this particular feature in the required way.
We don't know how you could do experiments that would cause you to become confident in coup probes working. Suppose you have access to this potentially scheming model and want to know whether it represents the 'coup feature' in such a way that coup probes will generalize correctly. It's not clear what you can do. Unless you make substantial progress at understanding the internal structure of models, we expect you'll have to rely on your priors for how different features tend to be represented, which aren't going to be very confident.
Another source of difficulty is that if our models are scheming, they might attempt to sabotage experiments we do about how they generalize. For example, consider the following experiment:
It might seem implausible for a model to have this level of fine-grained control over how it generalizes. But once you've conditioned on your model scheming, you've already assumed a high enough level of fine-grained control over its generalization that this type of sabotage seems plausible.
It seems much easier to become confident that a technique will work if the only property that the technique relies on is that the model can't do something. For example, a control technique might involve humans reading code that a model has written to determine whether it has any crucial vulnerabilities. This technique relies on the model not being able to write backdoors that the human reviewers don't notice. Evaluating whether the model is capable of writing such sneaky backdoors has some difficulties, but we think we'll be able to sufficiently reliably estimate capabilities related to subverting safety techniques. (This is primarily because successful exploration hacking seems difficult given that the AI has limited serial reasoning time relative to the red team; see the discussion earlier on limited serial reasoning time for more discussion.)
For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model. We expect that this methodology forces you to rely solely on techniques that treat the model as a black box. See "Appendix C: the black box protocol assumption" in our paper for more on this.
Note that when trying to reduce sources of risk other than scheming, you don't have to worry about the models intentionally sabotaging the evals, and so it seems much easier to evaluate generalization- and internals-based techniques. For this reason, we think that non-black-box techniques are probably more applicable to non-scheming threat models than to scheming threat models.
Because these black-box control evaluations[27] can't assess whether internals or generalization-based techniques improve safety, there could be a gap between the control we're able to more confidently assess (measurable control) and the safety from scheming we have in practice (safety-in-practice) which includes both the control we have in practice and the possibility for avoiding scheming (alignment). However, we think that absent substantial advances in science, we're unlikely to develop approaches which substantially improve safety-in-practice beyond baseline methods (e.g., training with RLHF and applying coup probes) without the improvement being captured by black-box control evaluations.[28][29] We might discuss and argue for this in more detail in a follow-up post.
Appendix: Concerns with "neuralese" reasoning and memories
One factor which is very important for the viability of control is the extent to which making AIs transformatively useful requires those AIs to have the affordance to do long serial chains of uninterpretable[30] reasoning (e.g., recurrent "neuralese" state) or to have the affordance to communicate and share memories with other AI instances using uninterpretable messages (e.g., neuralese mind-melding). We think this concern is likely to be manageable and control can work even with a substantial amount of uninterpretable serial reasoning, but this seems like one of the main reasons early transformatively useful AIs could be uncontrollable.
If AIs have long serial chains of uninterpretable reasoning then it would be harder to:
Whenever AIs are allowed to communicate with other AI using uninterpretable messages, then avoiding coordination and collusion between those AIs is virtually impossible. So for control, we'll likely minimally need to avoid every AIs being able to communicate arbitrary uninterpretable messages with any other AI.
Right now, state-of-the-art AIs are LLMs with relatively weak forward passes and which can only accomplish hard tasks that require many serial steps (insofar as they can accomplish them at all) by reasoning in natural language using CoT. However, it's plausible that the amount of uninterpretable serial reasoning will substantially increase over time and further, that for AI to be transformatively useful, the effective serial depth of uninterpretable reasoning will have to increase substantially. It's not clear to us either way, though this is indeed a key question which affects the difficulty of various AI x-safety agendas. Even if transformatively useful AI is created just by scaling up LLMs, the amount of uninterpretable reasoning will still increase over time due to larger scale making the forward pass more powerful. (That said, the fraction of reasoning which occurs in the forward pass might not increase because approaches like chain of thought will get more powerful as well.)
Another concern which is closely related to “neuralese” is that AIs using natural language reasoning could use steganography to perform hidden reasoning or pass messages; we hope to handle steganography with approaches like paraphrasing (same paper as the immediately prior link). While natural language reasoning done by LLMs isn't fully interpretable or faithful, the key property we want from this natural language reasoning is just that the AI can't perform natural language reasoning about unrelated or unintended topics without this reasoning being apparent. More precisely, it can’t use natural language reasoning to reason considerably better about an unintended topic relative to what it can do in a forward pass. That is, the key property we want is to be able to rule out considerably-more-powerful-than-a-forward-pass reasoning on unintended topics (rather than necessarily understanding the exact reasoning process on the intended topic[32]). We think this key property of natural language reasoning seems to hold for current models and should be possible to maintain using anti-steganography measures.
One approach for AI learning and memories which might be key is having AI construct training datasets (in relatively understandable modalities like text, images, or motor activations). Intuitively, this seems much safer than neuralese, though there still might be substantial risks due to the AI-constructed datasets being backdoored or containing steganographic communication. We think these concerns can probably be handled with control.
Returning back to more fundamentally uninterpretable reasoning, one way to operationalize the deep uninterpretable serial reasoning concern is the effective duration of serial reasoning prior to an "interpretable bottleneck" (where for LLMs, text outputs form a (somewhat) interpretable bottleneck). For instance, perhaps GPT-4's forward pass is similar to a few seconds of serial reasoning for a human (but using extremely good heuristics and broad knowledge). We think that for AIs to be transformatively useful, they might need serial uninterpretable reasoning which is the equivalent of a human thinking for several months, but they likely don't need much longer durations. Further, we think this uninterpretable state can likely be siloed within individual projects without huge loss of performance (neuralese mind-melding isn't needed or at least doesn't need to be global). And, probably many tasks can be delegated to instances with far less duration or state.
A reasonable intuition pump here is that if AIs were qualitatively as smart as humans, but ran faster and cheaper, then we probably aren’t forced to apply the AIs in ways which are vastly different than how humans work. Thus, massive amounts of mind melding and extremely long chains of uninterpretable serial reasoning aren’t required (as humans can get lots done without these affordances).[33]
Part of our view here is that just having a synchronized copy of the weights and then training the AI online with normal SGD allows for capturing a reasonable fraction of the value from "sharing memories" and other types of learning while still (very likely) not allowing for arbitrary communication or arbitrary serial reasoning. Using AIs to construct training data as discussed earlier might also help to avoid the need for neuralese.
If sample efficiency was reasonably high, then we overall feel like control could be workable in the regime where for some tasks we need AIs to use huge (e.g. the equivalent of 3 months of human thinking) but not insane (e.g. the equivalent of 1,000 years of human thinking) amounts of uninterpretable serial reasoning. We think it’s plausible that transformatively useful AI will require using uninterpretable “neuralese” state or message passing, but this is by no means certain. Further, if transformatively useful AI occurs soon, it is somewhat less likely to be the case that it uses neuralese.
Appendix: Should AI-takeover-concerned people invest in control?
There are broadly two types of investments we might want to make in control:
For investment (1), the main question is "how likely is it that control evaluations are an accurate way to measure risk due to scheming"?
For investment (2), the main question is the tractability of improving techniques and evaluations.
For both types of investment, one of our key supporting claims is that we don't know how to ensure that AIs aren't scheming. You might disagree with this because you believe either of the following:
If we don't know how to ensure that AIs aren't scheming, a good approach for measuring risk from scheming is to conduct control evaluations to measure P(doom | scheming) and then multiply this by our best estimate of P(scheming) (roughly speaking). Similarly, our main approach for mitigating risk from scheming will be via control techniques[34].
However, even if the main approach for reducing and evaluating risk from scheming is AI control, this isn't sufficient to make researching AI control (investment 2) worthwhile. Our belief that researching control is worthwhile depends on another key claim: investment in AI control will substantially improve evaluation quality and/or result in techniques which substantially reduce risk from scheming AIs. You might disagree with this because you believe that for all important AI capability levels one of the following holds:
Above, we grouped together investment in improving evaluations and in control techniques. It's possible that improving control evaluations is very tractable, but that control techniques are either too easy or too hard at every important capability level. Or vice versa.[35]
It's worth noting that research into control (investment 2) has a moderate level of transfer with some other hopes for improving AI safety like using better oversight to avoid reward hacking and conducting better dangerous capability evaluations to better inform decision-makers. We expect that a variety of research avenues will help to improve control and that control-focused research will have applications beyond reducing risk from scheming via control. But, having a reasonable subset of people focus specifically on control does seem valuable.
Beyond this, implementing the standard computer security aspects of control probably has a reasonable amount of overlap with implementing the internal control and security necessary to secure the model from employees (insider threat). However, there are aspects of control which look quite different than computer security and also likely parts of computer security for control which will require substantial additional implementation beyond what is needed to handle insider threat concerns.
We discussed the case for optimism about control and the need for control earlier in "Control is likely achievable for otherwise-dangerous models". We discuss difficulties and limitations with control in Section 5.1 in our paper.
Overall, we think that investing in control is worthwhile and this should be a high fraction of the portfolio for preventing takeover.
Appendix: More detail on avoiding qualitatively wildly superhuman skills in problematic domains
In this appendix, we'll argue in more detail that we won't run into issues related to qualitatively wildly superhuman skills in problematic domains.
First, we don't expect that early transformatively useful AIs will have qualitatively wildly superhuman capabilities across-the-board. That is, we don't expect they will have such capabilities which cover all or most of the important domains which would make an AI be qualitatively wildly superhuman in general. This is because we expect that models that were in general non-qualitatively-wildly-superhuman would suffice for being very useful. Given the current paradigm of training AIs, it will likely be possible to run a huge number of copies and to run these copies quite quickly. So to be very useful, it probably suffices for AIs to be around as capable as top human scientists (but cheap and fast). Of course, being wildly superhuman in terms of some notion of general intelligence would likely result in the AI being qualitatively wildly superhuman across many or most important domains, but we think that transformatively useful AIs won't need to be wildly superhuman in general intelligence.[36]
So it seems that the minimum-viable very useful AIs are unlikely to be qualitatively wildly superhuman across-the-board.
But what about cases where the AI isn't qualitatively wildly superhuman across-the-board, but happens to be qualitatively wildly superhuman at some dangerous skill? There are two main situations we worry about here:
We'll discuss these cases separately in the next two subsections.
Cases where we purposefully train for qualitatively wildly superhuman performance in a danger-adjacent domain
One example here is protein engineering. You really needed your general-purpose AI to also do some protein engineering, and you accomplished this by additionally training it on a whole lot of protein structure prediction and related data. And then you say "protein that makes cancerous cells turn green, safe, easy to synthesize, nobel prize, trending on proteinstation", and it directly emits some amino acid sequence without needing to run intermediate tests or needing externalized reasoning. In your initial tests, the protein sequences they suggest seem to work great, but you're worried that if you use the AI-designed proteins widely, something terrible will happen.
This case indeed seems unresolvable with control (without forcing the AI to do the task in a different and more legible way). This case was hard because the thing we wanted to do with the model involved it being qualitatively wildly superhuman at a task inextricably related to a dangerous capability.
If we had separately trained an LM agent and also AlphaFold 5, and then the LM agent used AlphaFold 5 like a human does, we wouldn't have had this problem. The problem was that we end-to-end trained this model such that it had access to inscrutable knowledge while also maybe being a schemer.
Another option we have here is to train two copies of your LM agent, one of which knows a lot about protein structure and the other of which doesn't, and then only use the protein structure model in cases where it's not going to have the opportunity to do anything dangerous with that knowledge.
However, it seems like the applications on Karnofsky’s list probably don't directly require models to have superhuman knowledge or abilities in dangerous areas.
Further, note that in this protein case, it was obvious to us that the model was qualitatively wildly superhuman. (Though it might not be obvious that a specific action produced by such an AI is inscrutable or has other problematic properties.) We expect that if we run into issues due to qualitatively wildly superhuman performance resulting from intentionally doing a bunch of training in a danger-adjacent domain, it is reasonably likely to at least be obvious that we might have issues.
Cases where we reach qualitatively wildly superhuman performance in a dangerous domain as a side effect of training for usefulness
As a trivial non-dangerous example, models are wildly superhuman at next-token prediction. We don't intrinsically care about next-token prediction; it just turns out that the easiest way to make models that do an adequate job of being chatbots is to train for next-token prediction. If superhuman next-token prediction skills were dangerous, we'd have a big problem.
Also, the current best LLMs are quite superhumanly knowledgeable, but this doesn't seem to result in wildly superhuman performance in any non-trivial domains (yet). We expect that the level of superhuman knowledgeability of future AIs probably doesn't suffice for wildly superhuman performance at any of the key problematic domains, but might result in wildly superhuman levels of performance in some less important domains.
For an example of when AIs might become qualitatively wildly superhuman at a problematic domain in the course of us trying to train them to be sufficiently useful, suppose the only sufficiently large pretraining corpus we have access to was chat logs between humans who are trying to extort and manipulate each other, from which the AI can learn to effectively model the world. If that's the only pretraining corpus you have, you might have a lot of trouble training a model to be very useful without also being really good at extortion.
In this case, you didn't want to make your model wildly superhuman at anything, but you got the wild superhumanness (at extortion) by default while trying to get it to human level at the thing you wanted. (In principle, unlearning techniques could resolve this issue, though we don't think any unlearning technique has been demonstrated to solve a problem like this.)
We think there are a relatively small number of core problematic domains (hacking, persuasion, and manipulating scientific results or conceptual arguments). Performance on realistic tasks within each of these domains is likely to be quite correlated with other tasks in that domain (in the absence of heavy specific training on that subset of the domain). Thus, it's not as though we need to ensure near the human capability profile on a huge number of uncorrelated domains.
Further, these problematic domains seem like the types of domains that would be at least somewhat correlated with general capabilities/intelligence as long as we don't train AIs on these specific tasks much more than we train them on the core tasks we care about. (It's important to caveat this argument such that it doesn't imply that AlphaGo isn't superhuman at Go!)
And, we probably won't need to use a training regime which massively disproportionately focuses on dangerous tasks or tasks in some problematic domain in order to make sufficiently capable AIs. If capabilities substantially come from predictive/generative pretraining, then it seems unlikely that this training data is relatively good at teaching probability abilities. If capabilities substantially come from outcomes based RL, then we can probably find an environment which isn't that much worse at teaching general competence while avoiding teaching dangerous abilities much more than bottlenecking useful abilities.
There are a few specific counterarguments we hear related to generative/predictive pretraining.
One argument is that by the time AIs trained like LLMs are good at science, they will be much better at persuasion than humans due to being trained on a huge amount of data of humans talking and interacting. This seems like a reason for concern, but humans are probably relatively more selected (by evolution) for resisting persuasion than for being good at modern science. So training might make LLMs relatively better at persuasion, but human genomes might have also been selected to some extent to be relatively better at resisting persuasion. Current state-of-the-art LLMs don’t seem particularly good at persuasion relative to science (compared to humans).
Another specific counterargument people give is that maybe programming models will instantly know excellent zero-day vulnerabilities from reading all the code in the world (e.g., the Linux kernel, our HTTP server implementations, etc). We are not persuaded that a model which is human-level at programming will automatically notice zero-days while being trained on code, and anyway it's probably affordable to write a new software stack with different zero-days and prevent most of our AIs from reading its source. (That said, rewriting the software stack doesn't stop AIs from understanding hardware vulnerabilities and noticing hardware zero days).
Our view is that the evidence from current state-of-the-art LLMs and other AIs is mostly reassuring. It seems as though LLMs (and other AIs in general) have an overall relative capability profile which isn't wildly different from that of the human capability profile on reasonably broad downstream applications (e.g., broad programming tasks, writing good essays, doing research). However, this is based on a small number of data points as AI just recently became capable of this level of performance.
We don't know of cases where LLMs are qualitatively wildly superhuman at non-trivial random tasks which they somehow picked up from pretraining (in the sense we defined via the list given in the Section “We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains”, but it would be great to look more carefully. Some such tasks probably exist. There might be some non-trivial classes of images on which the current best image models are qualitatively wildly superhuman, perhaps photorealism in some cases?
Overall, we think it's pretty unlikely for models to acquire qualitatively wildly superhuman skills in problematic domains that would break control. But, this is one of our main uncertainties about the viability of control and thus we're pretty interested in further investigation and counterarguments.
Aside: Capabilities from imitating human cognition
It currently seems plausible (perhaps 30-40% probability overall, more in shorter timelines) that a large fraction of capabilities in early transformatively useful AI will come via the mechanism of imitating human cognitive processes, which naturally yields an ability distribution relatively close to humans[37]. However, note even if a model gets its capabilities entirely from being trained on a prediction objective, it's still possible for most of its useful cognition to come from sources other than imitating human cognitive processes, see here for more discussion.
AI control could also be applicable to handling advanced data poisoning attacks, non-scheming misalignment concerns, and API-enabled catastrophic misuse, but we expect there are easier ways to sufficiently handle these concerns. ↩︎
In this post, we'll mostly use "alignment" to refer to "models aren't scheming". This is narrower than how people usually use the word–avoiding scheming is usually considered necessary but not sufficient for alignment. ↩︎
Note that we are using the term control differently than some prior uses of the term. For instance, we're using the term differently than "AI is easy to control". ↩︎
As one operationalization, suppose that you had two teams at your AI lab, one of which was much more competent, and you had to assign one to be in charge of minimizing the probability that the AIs you're about to train and deploy were scheming, and the other to be in charge of ensuring control. To address risk due to powerful AI in the next 8 years, we think it would be better to assign the more competent team to ensuring control, because we think that the quality of control interventions is more influential on this risk. ↩︎
In practice, we'll care about the exact probability of safety rather than a binary safety threshold. We also might care about actually getting useful work even if our AI is scheming competently rather than just more directly ensuring safety. So, putting this all together, we care about E[utility|scheming]: how good outcomes are in expectation given that an AI is competently trying to cause bad outcomes. Separately, we also care about E[utility|not scheming], but we won't talk about measuring and assessing this here. ↩︎
You can imagine that we're running our evaluation after the model has been trained, but prior to using it. One reasonable objection here is that the very act of training a scheming model or running our evaluation will be dangerous. First, note that we think the evaluation can be made much safer than normal utilization (see here for further discussion). Then, we can overall handle concerns due to danger during evaluations and training by periodically running evaluations throughout training (e.g., every 4x effective training compute increase) and ensuring our evaluations are conservative enough to ensure safety for interval of training prior to the next evaluation (as discussed in more detail here). ↩︎
More generally, we'd like to have a probability distribution over all the different possible sets of models which might be scheming (e.g., P(all models more powerful than GPT-5 scheme), P(all models more powerful than GPT-6 scheme), etc.) and take this into account when assessing an overall estimate of safety. It's not clear how to get a principled guess at the probabilities here, but we can at least conservatively assume that all models which could plausibly be scheming are scheming and then estimate safety in this case. ↩︎
Rather than just training the model to accomplish the tasks, we could also use some other approach to estimate the ability of the model to execute the red team strategy. For example, you could elicit the model's capabilities on another task which you expect the model to be just as good at, but where capability elicitation is easier. Or you could use a scaling law to extrapolate its performance from smaller models. Or you could try to conservatively upper bound its capabilities using a supervised training set which is clearly easier to fit than actually accomplishing the dangerous task. Because of these options, in some cases, capability estimation is considerably easier than capability elicitation, so this distinction could be important. ↩︎
One concern with these evals is that they might actually make models dangerous or misaligned even if they wouldn't be dangerous by default, or more generally that the evaluation is dangerous to run. We think this is probably not a substantial issue, because we can often use benign tasks as proxies for dangerous tasks. That said, we'll still need to train the model to exhibit dangerous capabilities like evading monitoring. See here for further discussion. And remember that we're just training these models as part of the evaluations; these models won't actually be deployed. ↩︎
These strategies found by the red team will be relatively high level strategies. In particular, the strategy should be written with realistic knowledge about the situation and shouldn't be insanely long (e.g., perhaps we limit it to about 4k words), so the task of causing problems given a good plan for causing problems is pretty non-trivial. We could conservatively assume the same knowledge as an AI lab employee prior to the start of training and deployment, or try to make more precise restrictions on knowledge based on the AI's exact situation. Note that we do need to take into account the AI learning things over the course of deployment from training or other sources. The plans found by the red team still might be unrealistically good, so we might want to further limit the plans if we can get a better understanding of how good of plans the AI could plausibly produce; see "Limit red team plan quality" and "Limit red team plan knowledge" here for more discussion. ↩︎
Our goal will be to measure how capably the model would perform if it was scheming and "actually trying" to do a given action and thus did its best to elicit its own capabilities; we don't need to know how capable the AI could be if a large amount of resources were spent on finding better elicitation techniques (e.g., better training datasets and better scaffolding). Additionally, in many cases, we care about how capable the AI is with important restrictions on its affordances and reasoning due to control techniques (e.g., using a weaker trusted model to monitor its natural language reasoning) which might make the search space for elicitation smaller. Our evaluation target is unlike assessments for how problematic it would be to have a containment failure (e.g., model theft); assessments aimed at determining danger due to containment failure need to account for the possibility that improvements in elicitation over a few years result in dangerously capable AIs and also can't assume that there are any restrictions on the affordances of the AI. ↩︎
This relies on the assumption that the AIs are unable to secretly develop a body of knowledge of good takeover strategies. To ensure this, you probably need to use anti-steganography techniques and avoid models having arbitrary access to uninterpretable memory. We think that some access to uninterpretable state can be acceptable so long as we avoid arbitrary accumulation by having "interpretable bottlenecks": via methods like siloing access to uninterpretable state (e.g., between different projects) and sometimes wiping state and forcing AIs to convert it into some more interpretable media such as a text database or a training set. See here for more detail. ↩︎
Some subset of AIs might have to reason about our countermeasures as part of their job (e.g., for assisting the red team on a future evaluation or helping to design future protocols). However, we can try to split up this reasoning into different parts and impose additional (expensive) countermeasures on these AIs. Also, the job of reasoning about countermeasures doesn't require many of the dangerous affordances like the ability to run ML experiments or add code to our codebase. ↩︎
There are also additional complications associated with what happens if we catch AIs red-handed. Most importantly, after catching the AI, the AI lab should very likely shut down (at least temporarily) and study what happened in more detail. Even if we are forced to continue using AIs, there now exist a set of mitigations and tests which don't look like control. So for control evaluations, we might not model what happens after catching the AI and instead make an educated guess about how good of an outcome it is to catch the AI. ↩︎
To keep the absolute level of risk quite low (e.g., 1%), there are a few possibilities. It might be that the chance of scheming is sufficiently low by default that overall risk can be kept moderately low via just control. (E.g. if the chance of scheming is 10% and control suffices to reduce risk conditional on scheming to 10%, then the overall probability of catastrophe due to scheming would be 1%.) Experiments that provide information about the likelihood of scheming could be quite useful in making informed decisions here. Alignment techniques which actively lower the risk of scheming could also combine with control to form two lines of defense. Even if we were certain our AIs were scheming, it might be possible to achieve an absolutely low level of safety by achieving a high level of measured safety according to very conservative control evaluations. We think it will be too hard to achieve a high level of measured safety according to very conservative control evaluations for transformatively useful AI, but passing very conservative control evaluations might be doable for earlier AIs which could still pose risk. ↩︎
If models don't scheme, then evaluating whether models have some other type of catastrophic misalignment issue is quite a bit easier, because these non-scheming models won't try to sabotage our evaluations. However, if models do scheme, then alignment evaluations become much harder, while capabilities evaluations like control evaluations might still be viable. As far as evaluating for the presence of non-scheming misalignment issues like reward hacking, we could try to measure reward hacking using held out validation either in our actual training setup or in analogous settings where measurement is easier. We could even test techniques to mitigate reward hacking. We're excited about this sort of direct measurement and analogous settings approach for non-scheming concerns, but we think these sorts of approaches are much less promising to address scheming. ↩︎
Some scalable oversight research has been done in a different adversarial setting, where e.g. you have two debaters who are each trying to trick the judge. This is different from the robustness property required here—for control, you need to know that your oversight scheme is acceptably reliable even when the two debaters are colluding to cause the judge to come to the wrong answer. ↩︎
In addition to being useful for control, better scalable oversight might also be slightly useful for reducing the probability of scheming[19] and considerably useful for addressing non-scheming misalignment concerns. Also, some approaches for adversarial training/evaluation which help with control might be useful for alignment. Thus, research into control techniques and alignment techniques can have considerable overlap. ↩︎
Perhaps better feedback makes thinking about the training process less salient than this in turn reduces the probability of scheming? It might also be the case that better oversight allows for using training methods which make long term prediction less salient and thus make scheming less plausible from this route. We think these arguments are overall pretty weak. ↩︎
What's the precise limiting factor on black-box control? We think that inaccessible information issues are sufficient to imply that it's exceedingly unlikely that we can achieve black-box control for an arbitrarily powerful system (this may not be the only problem). Overall, we haven't fully ruled out the existence of an arbitrarily scalable black-box control strategy, but it seems exceedingly unlikely to exist. That said, it seems more possible (though still quite unlikely) that clever black-box control could scale to some generally superhuman systems without being arbitrarily scalable. However, this isn't our reason for thinking control is promising. ↩︎
Or possibly reducing the risk due to already escaped AIs in the wild. ↩︎
Note that we’re using “uncontrollable” as jargon which specifically refers to being too capable for control approaches to be viable. So, we could, e.g., have AIs which are safe because they are aligned despite being capable enough that they are uncontrollable. ↩︎
Recall that “uncontrollable” specifically refers to being too capable for control approaches to be viable; uncontrollable AIs could be safe for other reasons. ↩︎
If problematic reasoning is this obvious, then simple interventions like trusted monitoring would catch the issue. ↩︎
In some cases, a long process of AI cognition might eventually produce superhuman outputs (e.g. using search). In this type of case, we don't necessarily have issues due to the concerns from the above list as long as humans can acquire a good and properly contextualized understanding of each step of the process that is used to create such a superhuman output. (In particular, we need a sufficient understanding of all the places where substantial optimization pressure was applied. Optimization pressure will be applied by AIs in each step of the process, but there might also be optimization pressure applied in constructing the overall process; in that case, we need to understand the overall process. See here for some discussion of why.) So, in the case of (e.g.) explicit Monte Carlo tree search, we won't necessarily have issues even if the overall tree search is superhuman as long as human teams can figure out valuations for nodes which are sufficiently accurate and distributions over nodes to search which are sufficiently good. Accordingly, we think that superhumanness just due to quantity (e.g. parallel search) or speed rather than quality can be manageable and we wouldn't consider it a case of one of the qualitatively wildly superhuman ability concerns we listed above. ↩︎
This part of the situation isn't really a control argument, but it might be pretty relevant regardless. ↩︎
By "black-box control evaluations" we just mean that the control evaluations are conservative about inductive biases and AI internal structure. Running the evaluations will require fine-tuning at a minimum. ↩︎
In particular, we think that techniques that involve constructing datasets of very bad actions and training a classifier (perhaps with some particular tweak to inductive biases like using a probe) to identify these bad actions might be predictably promising for reasons which aren't possible to assess with black-box control evaluations. The next most promising improvement which can't be assessed with black-box control evaluations and which doesn't require substantial advances (perhaps gradient-based adversarial attacks to catch AIs) seems notably less promising. ↩︎
There are plausibly some large scale modifications to AI training methods which reduce the probability of scheming by a decent amount while keeping the model capable enough to be at least a bit useful. Applying these modifications in the training of some or all models is a possible improvement to safety on top of baseline methods which isn't captured by black-box control evaluations. This would most naturally slot into the control evaluation methodology by effecting our estimates for P(scheming) for different models. We're uncertain about the importance of these sorts of intervention. ↩︎
By "uninterpretable" we just mean "as hard to interpret as the activations of current AI models". It isn't literally impossible to develop a serious understanding of these activations, but we think the current state of interpretability is quite far from more than shallow understanding. ↩︎
The key concern here is that training would have very low sample efficiency at utilizing these long serial chains of uninterpretable reasoning, at least for getting AIs to do tasks they don't want to do. An objection would then be: if training has very low sample efficiency, how did we even get an AI which can competently do long serial chains of reasoning when it wants to? It could be that we did a huge amount of RL long-horizon training to make the AI competent at doing this reasoning which results in the AI being able to use it itself however it would like, but we can't quickly elicit it in cases where the AI doesn't want us to do so. Alternatively, it could be that with only a small or moderate amount of training, AIs can effectively utilize some scheme for uninterpretable serial reasoning due to generalization from other capabilities. If our capabilities are due to a relatively far generalization, then further training might have low sample efficiency (at least when the AI wants training to fail) while still allowing for the AI to be dangerously capable. We might discuss threat models related to AIs having hard-to-elicit capabilities in future posts. Overall, we think sample efficiency will likely be reasonably high for early transformatively useful AI (even when scheming AIs want to sandbag), but there are some serious concerns. ↩︎
Some agendas do aim to understand the exact natural language reasoning process rather than just ruling out unintended reasoning and this would indeed produce additional value. We’re just claiming here that this isn’t required for one key source of hope. ↩︎
It might be the case that we need to use AIs (or have the option of using AIs) which are qualitatively weaker than humans, but which have structural advantages in terms of mind melding, learning speed, or extremely long durations of serial reasoning. In this case, it’s less clear whether this makes control harder or easier: the AIs are qualitatively weaker, but leveraging relatively problematic structural advantages. Overall, if we could pick we’d probably prefer to use AIs which are somewhat qualitatively smarter, but which don’t depend on potentially problematic structural properties. ↩︎
We can also apply non-black-box approaches which aim to reduce P(doom | scheming) such as coup probes. We don't currently know how to adequately evaluate these approaches. If in the future we were able to develop powerful non-black-box techniques and gain enough scientific understanding to confidently ensure safety, this would also eliminate the need for control. ↩︎
If we think that improving control techniques is worthwhile, then high-quality evaluations are very useful for the purpose of building better control techniques. But, it can still be the case that it's intractable to improve evaluations (though we think this is unlikely in practice, particularly for cases where control techniques are useful). ↩︎
What about the possibility that due to issues in capabilities elicitation, you aren't able to elicit human-level performance from any AI which isn't capable of wildly superhuman performance across most important domains? That is, in order to reach human-level performance in the domains we need, we must train such a powerful AI that if it was actually "trying" it would reach wildly superhuman performance in most important domains. We think this seems unlikely, because you're probably able to elicit human-level performance prior to this point by training on a high-quality dataset of human demonstrations combined with some amount of RL. ↩︎
Even if most of the capabilities come from imitating humans, it is still possible to achieve superhuman performance, but it makes it less likely that wildly superhuman hacking and persuasion abilities emerge before AIs are very useful. ↩︎