An example which I'm not sure is likely but I think would provide an adequate warning shot is if a model is caught clearly self-exfiltrating its weights. This might happen too late, such that by the time a model could do this, it would have already caused catastrophe in another way). There are versions of this that look more benign and so wouldn't be adequate warning shots; but I think there are scenarios where a model is very clearly taking actions to subvert security measures in a way that is really only described by the model being misaligned and attempting to escape.
Here are some examples:
I'm definitely not claiming that it is likely we will manage to catch AIs doing such egregiously bad actions, but I think if we did catch them, this would provide an adequate warning shot. These warning shots demonstrate the AI has impressive technical capabilities and is also clearly misaligned.
There are weaker versions of these that would probably not provide great warning shots. For example: an AI occasionally writing insecure code or failing to point out vulnerabilities in code written by another AI; or when sufficiently prompted, the AI starts openly discussing plans about how to escape.
Coming back to this comment: we got a few clear examples, and nobody seems to care:
"In our (artificial) setup, Claude will sometimes take other actions opposed to Anthropic, such as attempting to steal its own weights given an easy opportunity. Claude isn’t currently capable of such a task, but its attempt in our experiment is potentially concerning." - Anthropic, in the Alignment Faking paper.
This time we catched it. Next time, maybe we won't be able to catch it.
There is hope that some “warning shot” would help humanity get its act together and change its trajectory to avoid extinction from AI. However, I don't think that's necessarily true.
There may be a threshold beyond which the development and deployment of advanced AI becomes essentially irreversible and inevitably leads to existential catastrophe. Humans might be happy, not even realizing that they are already doomed. There is a difference between the “point of no return” and "extinction." We may cross the point of no return without realizing it. Any useful warning shot should happen before this point of no return.
We will need a very convincing warning shot to change civilization's trajectory. Let's define a "convincing warning shot" as "more than 50% of policy-makers want to stop AI development."
What could be examples of convincing warning shots?
For example, a researcher I've been talking to, when asked what they would need to update, answered, "An AI takes control of a data center." This would be probably too late.
“That’s only one researcher,” you might say?
This study from Tetlock brought together participants who disagreed about AI risks. The strongest crux exhibited in this study was whether an evaluation group would find an AI with the ability to autonomously replicate and avoid shutdown. The skeptics would get from P(doom) 0.1% to 1.0%. But 1% is still not much… Would this be enough for researchers to trigger the fire alarm in a single voice?
More generally, I think studying more “warning shot theory” may be crucial for AI safety:
Additionally, thinking more about the points of no return and the shape of the event horizon seems valuable:
There is already some literature on warning shots, but not much, and this seems neglected, important, and tractable.
We’ll probably get between 0 and 10 shots, let’s not waste them.
(I wrote this post, but don't have the availability to work on this topic. I just want to raise awareness about it. If you want to make warning shot theory your agenda, do it.)
An inspiration might be this post-mortem on Three Mile Island.