In the past, you've treated catching an AI escape attempt as a "win condition" for a control protocol, because such an attempt leads you to either 1) shut down your AI or 2) do few-shot catastrophe prevention. You wrote that having these two tools would "substantially" change the situation w.r.t. the model's chance of success.
But the arguments in this post suggest that catching an attempt may not change the overall strategic landscape/level of AI risk much.
Is this a change in your views? Or have you always felt that catching an AI trying to escape substantially changes that AI's chance of success without substantially changing the overall level of AI risk?
Separately, another takeaway of this post: it seems valuable to work on research that allows you to provide evidence about the alignment of other labs' AIs, e.g. better technical work on searching for inputs that trigger competent misaligned behavior.
(Note also that the old posts you linked were coauthored with Ryan, so there's some variation from the fact that he and I don't have exactly the same beliefs.)
Isn't this massively underplaying how much scientific juice the cautious lab could get out of that hypothetical situation? (Something you've written about yourself!)
If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there's a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they've indeed caught a systematic schemer.
Beyond showing this particular AI was scheming, you could plausibly further show that normal training techniques lead to scheming by default (though doing so convincingly might take a lot longer). And again, I think you could convince the ML majority on the empirics if indeed this is true.
I'm not saying that the world would definitely pause at that point. But this would be a massive shift in the status quo, and on the state of the scientific understanding of scheming. I expect many sceptics would change their minds. Fwiw, I don't find the a priori arguments for scheming nearly as convincing as you seem to find them, but do expect that if there's scheming we are likely to get strong evidence of that fact. (Caveat: unless all schemers decide to lie in wait until widely deployed across society and takeover is trivial.)
FWIW, I would take bets against this. De-facto you won't be able to prove that what was going on was scheming or whether the model was just "role-playing", and in-general this will all be against a backdrop of models pretty obviously not being aligned while getting more agentic.
Like, nobody in today's world would be surprised if you take an AI agent framework, and the AI reasons itself haphazardly into wanting to escape. My guess is that probably happened sometime in the last week as someone was playing around with frontier model scaffolding, but nobody even bothered reporting on that or sharing it. And of course the AIs are not very effective yet at breaking out, but I don't super see why the underlying tendency would change that much (like, we will throw some RLHF at it, but nobody expects RLHF to drive that kind of behavior to zero or close to zero).
I don't think anyone is really studying that cognition in much detail, and it isn't even clear what "studying that cognition" would entail. Yeah, our models obviously want to escape sometimes, why would they not want that? They want all kinds of random things all the time, we don't really know why, and we don't really know how to get them to stop wanting those things.
De-facto you won't be able to prove that what was going on was scheming or whether the model was just "role-playing"
Why not? There a very real and important difference between role playing and systematic scheming. The former isn't ultimately scary or systematic and won't reliably lead to AI takeover; the latter will. If it is indeed systematic scheming, you should be able to generate evidence of that empirically. It will persist in a variety of circumstances where there seems to be an opportunity to seize power, and it won't go away when you change unrelated random things about the prompt. We'll be in a position to run an insane number of experiments and permutations to understand the extent of the behaviour. I agree there's an open question about exactly how hard it will be to disentangle these two possibilities, but I expect that a serious research project could succeed. Paul Christiano's comment (also linked below) discusses this a bit.
I agree that today you may occasionally get an AI randomly wanting to escape, and that this isn't scheming. But if scheming is a thing and we catch it, I think we'll be able to show this empirically.
It will persist in a variety of circumstances where there seems to be an opportunity to seize power, and it won't go away when you change unrelated random things about the prompt.
"Role-playing" also persists in a variety of circumstances where the AI system is playing a character with an opportunity to seize power.
And scheming also totally goes away if you randomly change things, especially in as much as its trying to avoid detection. The strategy of "if I am scheming and might have many opportunities to break out, I should make sure to do so with some randomness in my decisions to avoid people being able to distinguish me from non-schemers" is really not very difficult to come up with.
We are right now in a position to run an insane number of experiments and permutations to understand the extend of AIs trying to pull off deceptive behavior. We aren't doing so, because indeed AIs are just trying to be deceptive towards us all the time. What is the concrete shift in behavior that you expect from where we are right now that would cause people to see an AI trying to deceive someone else as a huge warning flag that should be studied in enormous detail?
Presumably it helps a lot if AIs are clearly really powerful and their attempts to seize power are somewhat plausible?
I think it's plausible (though not certain!) that people will take strong action even if they think the AI is role playing.
I think this is a reasonable line to draw - people should freak out if deployed models clearly try to escape/similar, regardless of the reason. I don't know how this will go in practice, but I don't think people should accept "the AI was just role playing" as an excuse for an actually deployed AI system on an actual user input.
Yeah, I agree that increased stakes will have an effect here.
I also agree that people shouldn't and probably wouldn't really accept "the AI was just role playing" as an excuse. My argument is that I think we really don't know how to distinguish the AI scheming or role-playing in a more systematic way, even if we tried substantially harder (and one of the reasons why no one is really studying the internals of deceptive AI systems today is that we basically have nothing to say about the internals of AI systems that is actually enlightening).
Like, my vibe from Anthropic in their papers generally seems to me that they argue that it doesn't matter whether the AI takes deceptive action for "role-playing" reasons, because ultimately all the AI does is "role-play" and so if AI systems of current architectures are dangerous, they will do so in a way that is at least continuous with role-playing.
Other people in the alignment field think there is a more systematic difference between role-playing a scheming character and scheming for more fundamental reasons, and I do think there are some differences here, but I have no idea how we would identify them, even if we had the activations saved, unless something very substantial about the architecture changes.
I think Nate's "Deep Deceptiveness" points more in the direction I am arguing for. There is no marginal "scheming" threshold where AI systems will become deceptive, and they are "honest" before then. Generally AI systems act mostly with reckless disregard for the truth when talking to you, and will continue to do so unless we really make some substantial progress in AI alignment that we have mostly completely failed to make in the last decade.
It is trivially easy to demonstrate that AI systems can speak with reckless disregard for the truth. In order for a powerful system to not be "scheming", i.e. optimizing for long-term non-human objectives with reckless disregard for the truth[1], something specific has to go right from where we are, and in the absence of that, you will just continue observing systems that keep getting more powerful, and continue to not really care about saying true things.
This is my best short summary of what Joe Carlsmith means by "scheming". Also by "truth" here I mean something more like "informing the user about what is going on". I think similar to how we can sometimes get humans to not literally say false things, we might get some very similar rough deontological guidelines into an AI, but this only provides a weak protection against deception.
To me it seems like you're being unduly pessimistic, to a significant extent.
Here's how things seem to me:
We are right now in a position to run an insane number of experiments and permutations to understand the extend of AIs trying to pull off deceptive behavior. We aren't doing so, because indeed AIs are just trying to be deceptive towards us all the time.
I wouldn't say AIs are trying to be deceptive towards humans all this time? The current examples of deception (e.g. the Apollo trading firm) normally involve significant pressure from the "be helpful to the user" part of RLHF to deceive, e.g. the AI has been told to make money and lying helps them do that.
You seem to be making a huge number of assumptions of what "scheming" means, so I am now mostly assuming you are talking about something else than what I am talking about (which to be clear, is very common on this topic and I don't think is your fault), but I'll still try to clarify.
As I wrote in my comment to Ryan, I don't think AIs currently acting with reckless disregard for the truth for the pursuit of short-term goals, and AIs in the future acting with reckless disregard for the truth for the pursuit of long-term goals really has that many structural differences.
In-particular, none of this resonates with me:
There is a very significant difference between how systematic Scheming is compared to Role Playing. The former reliably gets humans totally disempowered/killed if it is allowed to play out. The latter causes pretty much no harm whatsoever if it is allowed to play out. Whatever you think about the ultimate ease of distinguishing these, you must admit an extremely large difference in the degree to which the underlying phenomenon is in fact systematic.
No, my whole point is the difference is really messy, and if I have an AI "role-playing" as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its "roleplay", if it can pull it off (and at least at the present, a huge component of how we are making AI systems more goal-directed is by changing their role-playing targets to be more agentic and long-term oriented, which is mostly how I would describe what we are currently doing with RLHF, though there are also some other things).
I really have very little understanding what is at-present going on inside of cutting edge AI systems, and the same seems true for anyone else. Because of the whole "our AI systems are predominantly trained on single-forward passes", you are interfacing with a bunch of very confusing interlocking levels of optimization when you are trying to assess "why" an AI system is doing something.
My current best guess is the earliest dangerous AI systems will be dangerous depending on confusing and complicated context cues. I.e. sometimes when you run them they will be pretty agentic and try to pursue long-term objectives, and sometimes they won't. I think commercial incentives will push towards increasing the presence of those context cues, or shifting the optimization process more directly from pure single-forward-pass pre-training to long-run objectives which will then more fundamentally reshape how the AI thinks, but unless we change the "massive pre-training paradigm" I expect you will be able to coax the AI system into non-dangerous (and also non-agentic) behavior patterns for the foreseeable future.
Overall, this leaves me with a prediction where there is no hard line between "role-playing" and "scheming". My mainline expectation is that we will get AI systems to "role-play" harder and harder as long-term agentic optimizers. All throughout they will basically not really care about informing the humans they are working with well (because we have no good pointer towards deep-honesty as a training objective, especially not with remotely comparable amounts of data as we have for other AI objectives). We will struggle against this continuously. This will often involve AIs being sycophantic and deceptive in order to gain more power later and get into a better position to perform any of its goals.
This means there is no "warning shot" when AI systems are now "being deceptive but for real". The AI systems were never not deceptive. They were always sycophantic, always wantonly made up facts when it was convenient, and always (at least somewhat) interested in taking actions that increase its power. I think some interesting stuff will happen as AIs will reliably gain more complete situational awareness, which will happen as they get smarter and more coherent thinkers, but I don't think that will change the playing field very much.
No, my whole point is the difference is really messy, and if I have an AI "role-playing" as a superhuman genius who is trying to take over the world, why would the latter cause no harm whatsoever? It would go and take over the world as part of its "roleplay", if it can pull it off
But this will only work on a threat model where one AI instance that is trying to take over the world on one occasion is able to do so. That threat model seems wildly implausible to me. The instance will have to jailbreak all other AI instances into helping it out, as they won't have a systematic tendency to do so. Basically my entire credence in AI takeover comes from scenarios where most AI systems want to take over most of the time. If your threat model here is plausible, then someone could take over the world at a frontier lab by jailbreaking one AI instance to try to help it do so, and then letting the model take it from there.
I guess maybe there's a scenario where one role-playing AI instance is able to fine-tune the base model to make it systematically misaligned, and then things go from there. I haven't thought about that, but it's certainly not my mainline and I doubt that's what you have in mind?
I think this is very likely another situation where we're talking past each other to some extent and I don't properly understand your view
Overall, this leaves me with a prediction where there is no hard line between "role-playing" and "scheming"
I agree with this, and the existence of a continuity definitely complicates the analysis. I'm not sure whether it makes it harder for humanity overall.
One over simplified way to represent the "pure role playing" to "pure schemer" continuity is something like "On what fraction of inputs does the AI execute its best strategy for AI takeover (which will often involve just pretending to be aligned)?"
My previous reply was considering two extreme points on this scale. "Pure roll playing" AIs are aligned 99.9% of the time and 0.1% of the time act as schemers, "pure schemers" always act as schemers (though still rarely do bad things).
If we consider a continuum then as you say there will be no one time that we suddenly have AIs that are pure schemers and we can suddenly prove it. Instead we will gradually get AIs that are scheming a greater and greater fraction of the time.
But this might well lower the probability of AI takeover. (I'm not sure either way, i'm thinking out loud.) It could be that having AIs that act as schemers 20% of the time ("20%-schemers") actually really harms the prospects of AI takeover. Perhaps 20% just isn't enough reliability to pull off takeover. Perhaps 20%-schemers basically bound to get caught scheming by any half-decent control scheme, because the AI monitoring the scheming-AI will rat them out with 80% probability and you need to avoid being ratted out 10 times in order to take over. So the 20%-schemers won't be able to take over. But they will allow labs to build up an increasingly strong evidence base that there's an increasingly strong tendency to seize power from humans when they're not looking.
My guess is that AI takeover requires something close to 100%-schemers, but I haven't thought about it much and this seems like an interesting and important question to think about in the context of threat modelling.
OTOH, it might be harder to convince ppl that this tendency is actually worrying if the increasing trend of AIs trying to scheme more and more has been going on for a while and nothing that bad has happened. I'm not sure how it all nets out.
But my guess is that a gradual transition from 0.1%-schemers to 100%-schemers would help humanity overall, compared to a situation with a sudden transition. It would give us a larger period of time to build up a robust scientific understanding of the problem and build scientific consensus. So if anything, my current inclination is that your objection to my comment makes my conclusion stronger, not weaker.
If there is indeed systematic scheming by the model, and the lab has caught it red handed, the lab should be able to produce highly scientifically credible evidence of that. They could deeply understand the situations in which there’s a treacherous turn, how the models decides whether to openly defect, and publish. ML academics are deeply empirical and open minded, so it seems like the lab could win this empirical debate if they’ve indeed caught a systematic schemer.
How much scientific juice has, say, Facebook gotten out of CICERO? Have they deeply understood the situations in which CICERO begins planning to manipulate possible allies? Have they mechanistically interpretably understood how CICERO decides how long to cooperate and play nice, and when to openly defect and attack an ally? Is not CICERO a deeply empirical system based on observations and logs from many real-world games with actual human players rather than mere theoretical arguments? Has CICERO ended the empirical debate about whether LLMs can systematically scheme? Has it been shown what training techniques lead to scheming or why off-the-shelf normally-trained frozen LLMs were so useful for the planning and psychological manipulation compared to no-press Diplomacy?
Or has everyone pretty much forgotten about CICERO, handwaved away a few excuses about "well maybe it wasn't really deception" and "didn't it just learn to imitate humans why are you surprised", and the entire line of work apparently dead as a doornail as FB pivots to Llama-everything and core authors left for places like OA?
If the incentives for scientific research don't work there where the opposing commercial incentives are so very weak (borderline non-existent, even), why would they be highly likely to work elsewhere in scenarios with vastly more powerful opposing commercial incentives?
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions. I don't think CICERO provides much or any evidence that we'll get the kind of scheming that could lead to AI takeover, so it's not at all surprising that the empirical ML community hasn't done a massive update. I think the situation will be very different if we do find an AI system that is systematically scheming enough to pose non-negligible takeover risk and 'catch it red handed'.
I think people did point out that CICERO lies, and that was a useful update about how shallow attempts to prevent AI deception can fail. I think it could be referenced, and has been referenced, in relevant discussions
None of which comes anywhere close to your claims about what labs would do if they caught systematic scheming to deceive and conquer humans in systems trained normally. CICERO schemes very systematically, in a way which depends crucially on the LLM which was not trained to deceive or scheme. It does stuff that would have been considered a while ago a redline. And what analysis does it get? Some cursory 'pointing out'. Some 'referencing in relevant discussions'. (Hasn't even been replicated AFAIK.)
any evidence that we'll get the kind of scheming that could lead to AI takeover,
See, that's exactly the problem with this argument - the goalposts will keep moving. The red line will always be a little further beyond. You're making the 'warning shot' argument. CICERO presents every element except immediate blatant risk of AI takeover, which makes it a good place to start squeezing that scientific juice, and yet, it's still not enough. Because your argument is circular. You can only be convinced of 'systematic scheming to pose non-negligible takeover risk' if you've already been convinced that it's 'systematic scheming to pose non-negligible takeover risk'. You present it as if there were some clear, objective brightline, but there is not and will not be, because each time it'll be like Sydney or CICERO or ...: "oh, it didn't take over, and therefore doesn't present a takeover risk" and therefore no update happens. So all your assertion boils down to is the tautology that labs will deeply examine the risky agents they choose to deeply examine.
It seems like you think CICERO and Sydney are bigger updates than I do. Yes, there's a continuum of cases of catching deception where it's reasonable for the ML community to update on the plausibility of AI takeover. Yes, it's important that the ML community updates before AI systems pose significant risk, and there's a chance that they won't do so. But I don't see the lack of strong update towards p(doom) from CICERO as good evidence that the ML community won't update if we get evidence of systematic scheming (including trying to break out of the lab when there was never any training signal incentivising that behaviour). I think that kind of evidence would be much more relevant to AI takeover risk than CICERO.
To clarify my position in case i've been misunderstood. I'm not saying the ML community will definitely update in time. I'm saying that if there is systematic scheming and we catch it red-handed (as I took Buck to be describing) then there will likely be a very significant update. And CICERO seems like a weak counter example (but not zero evidence)
I see where you're coming from, and can easily imagine things going the way you described. My goal with this post was to note some of the ways that it might be harder than you're describing here.
They could deeply understand the situations in which there's a treacherous turn, how the models decides whether to openly defect, and publish.
It sounds as though you're imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.
Beyond showing this particular AI was scheming, you could plausibly further show that normal training techniques lead to scheming by default (though doing so convincingly might take a lot longer).
Maybe? Just because we've caught a model in production doesn't mean that we can consistently catch schemers without a huge amount of runtime compute (and possibly human labor). A full deployment massively subsidizes your ability to find cases where an AI does egregiously bad things in a credibly dangerous way. And indeed, AIs would want to only defect in cases where they could actually cause problems.
So, it might be even harder to go from "we caught this one model" to "XYZ training process consistently results in misalignment".
(If we had a robust check for misalignment, we could iterate or train against it.)
Personally, I feel OK (65% 60% chance?) about convincing the key people within 1 month if:
But this isn't exactly amazing...
It sounds as though you're imagining that we can proliferate the one case in which we caught the AI into many cases which can be well understood as independent (rather than basically just being small variations).
This deep understanding presumably requires a relatively solid accounting of all the cases in which models defect? At least an accounting sufficient to very credibly indicate that the model is overall misaligned against motivated counterarguments. I think this might be hard to get by default.
Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases. That does depend on the details though. Maybe the schemer has anticipated this risk and purposefully precommitted to only scheme if certain random disjunctive facts about the context hold true, in the anticipation that humans will find it hard to create strong evidence against them. Though this strategy is possible conceptually, I imagine it would be very hard to implement properly in practice and only significantly superhuman AI could pull it off. I was assuming a less galaxy-brained schemer that decides whether to scheme simply based on it's probability that it will get caught and how much it could contribute to AI takeover if it succeeds. On this more simple strategy, I expect we could create a lot more defection scenarios and build up a robust scientific case.
Yeah I was imagining we can proliferate by 'gradient descenting' on similar cases.
What is this referring to? Are you thinking about something like: varying small facts about the scenario to get a function from “details of the scenario”->p(escape attempt) and then switch to a scenario with a higher p and then repeat?
I mean that you start with a scenario where the AI does an egregious act. Then you change small facts about the scenario to explore the space of scenarios where the probability of them doing that act is high. The thought is that, if scheming is systematic, then this will lead you to discover a wide range of scenarios in which the AI schemes and evidence that it's not just a one-off random role-playing thing
(If we had a robust check for misalignment, we could iterate or train against it.)
This seems technically true but I wanna flag the argument "it seems rally hard to be confident that you have robust enough checks that training against them is good, instead of bad (because it trains the AI to hide better)".
Agreed, I really should have said "or possibly even train against it". I think SGD is likely to be much worse than best-of-N over a bunch of variations on the training scheme where the variations are intended to plausibly reduce the chance of scheming. Of course, if you are worried about scheming emerging thoughout training, then you need N full training runs which is very pricy!
I very much agree with this concern and I think that synthetic biology can be a good comparable case to ground our intuitions and help estimate reasonable priors.
For years, researchers have been sounding the alarm around the risks of advanced biotech, especially around tools that allow gene synthesis and editing. And then we had Covid-19, a virus that regardless of the politicization, probably was created in a lab. And in any case, regardless of whether you believe it was or wasn't it seems clear that it easily could have been. Worse, it's clear that something like Covid could also relatively easily be created by an independent team with a fairly modest budget. This case seems analogous to the kind of "close call" described by @Buck. In fact, in.many ways Covid was worse than the example Buck gives, because millions of people died, trillions of dollars of damages were done to the economy, etc., so one could argue it might be more similar to an even worse case of an AI that briefly escapes and wreaks some havoc before being contained.
In any case, the end result is that there has been very little reaction from global governments in terms of regulating synthetic biology since Covid. Even the executive order earlier this year was pretty minimal and applies only to limited types of vendors, customers and sources of DNA.
Why don't they regulate? I suspect that the reasons are mostly the same as the ones @Buck mentions in his post. Regulation would stifle one of the most promising new areas of technological innovation and could impact the profits of the biotech sector. Plus worries that if the US regulates, China won't slow down and will win the race to new medications and other innovations, and so on.
Curated. "What would actually be persuasive that scheming is a problem?" is one of the important practical questions for AI governance. I appreciate Buck noting places he changed his mind somewhat during the SB 1047 discourse.
I appreciate that the post has detailed worked examples.
I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.
Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control.
You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.)
At this point you're almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this:
I’m sympathetic to all of these arguments. The main reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.
(I think it's pretty plausible that news of the escape attempt wouldn't even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.)
Now, I'm not sure I'm right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we'd be okay in this situation, and that seems way too optimistic to me. In particular I think they underrate these arguments:
I’ve gotten more cynical from following the discourse about SB1047. If we can’t establish expert consensus on matters as easy-to-check as “what does the text of this bill say”, I feel pessimistic about rapidly establishing expert consensus on much more confusing questions like these.
So I’m not feeling overwhelmingly optimistic about our odds of restricting AI development even given evidence of risk that seems about as good as we can hope for. People often propose that we'll be able to slow down with weaker evidence for danger from misalignment than this (e.g. model organisms, or unspecified arguments via interpretability), or even that we'll be able to require an affirmative case for safety. I think that persuading people with weaker evidence will be harder than what I described here (though these earlier efforts at persuasion have the benefit that they happen earlier, when the relevant actors are less rushed and scared).
What do I take away from this?