The most straightforward way to produce evidence of a model’s deception is to find a situation where it changes what it’s doing based on the presence or absence of oversight. If we can find a clear situation where the model’s behavior changes substantially based on the extent to which its behavior is being overseen, that is clear evidence that the model was only pretending to be aligned for the purpose of deceiving the overseer.
This isn't clearly true to me, at least for one possible interpretation: "If the AI knows a human is present, the AI's behavior changes in an alignment-relevant way." I think there's a substantial chance this isn't what you had in mind, but I'll just lay out the following because I think it'll be useful anyways.
For example, I behave differently around my boss than alone at home, but I'm not pretending to be anything for the purpose of deceiving my boss. More generally, I think it'll be quite common for different contexts to activate different AI values. Perhaps the AI learned that if the overseer is watching, then it should complete the intended task (like cleaning up a room), because historically those decisions were reinforced by the overseer. This implies certain bad generalization properties, but not purposeful deception.
Yeah, that's a good point—I agree that the thing I said was a bit too strong. I do think there's a sense in which the models you're describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you're describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.
There must be some evidence that the initial appearance of alignment was due to the model actively trying to appear aligned only in the service of some ulterior goal.
"trying to appear aligned" seems imprecise to me - unless you mean to be more specific than the RFLO description. (see footnote 7: in general, there's no requirement for modelling of the base optimiser or oversight system; it's enough to understand the optimisation pressure and be uncertain whether it'll persist).
Are you thinking that it makes sense to agree to test for systems that are "trying to appear aligned", or would you want to include any system that is instrumentally acting such that it's unaltered by optimisation pressure?
This post is a follow-up to “AI coordination needs clear wins.” Thanks to Ethan Perez, Kate Woolverton, Richard Ngo, Anne le Roux, Sam McCandlish, Adam Jermyn, and Danny Hernandez for useful conversations, comments, and feedback.
In this post, I want to propose a clear, concrete coordination task that I think might be achievable soon given the current landscape, would generate a clear coordination win, and that I think would be highly useful in and of itself. Specifically:
I want DeepMind, OpenAI, and Anthropic to commit to actively monitor and look for evidence of deceptive alignment in their models—as well as run experiments to try to predict when and where deceptive alignment might occur before it does.
Notably, I am specifically referring only to the narrow case of deceptive alignment here, not just any situation where models say false things. Deceptive alignment is specifically a situation where the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal.[1]
I think this is a pretty minimal ask that would nevertheless be a clear win. Among all AI failure modes, deceptive alignment is one of the most unambiguously bad, which means, though I expect lots of disagreement on its likelihood, there should be little disagreement regarding its severity. Furthermore, things that I’m not asking for:
Nevertheless, as I stated, I think this would still be quite useful.
First, I think deceptive alignment is clearly one of the most concerning problems in AI alignment. Though they don’t always use the framing of deceptive alignment specifically, most presentations of AI risk feature the phenomenon of an AI that wants something in the world deceiving our oversight procedures to get that something. See: treacherous turns, schemers, influence-seeking behavior, power-seeking AI. Furthermore, I think there is a strong case for why deceptive alignment is reasonably likely to appear by default in advanced machine learning systems.
Second, I think that it is plausible that, if deceptive alignment is eventually going to be a problem, we’ll be able to see it and produce demonstrations of it before it becomes existentially dangerous. I’m less confident in this premise than the previous one, but my current sense is that the space of deceptively aligned models is large enough to include lots of deceptively aligned models that would be willing to display their deception in the presence of the right sort of honeypot. For example, you might be able to find deceptively aligned models that:[2]
Of course, just being able to detect deception doesn’t mean we’ll be able to solve it—and in fact in many cases it seems likely that our default ways of addressing deception would result in deception that just hides itself better—but I still think that being able to detect it before it becomes catastrophic would be a huge win, since it’d give us an opportunity to study the phenomenon directly in the laboratory. At the very least, having an example of the problem to study directly seems substantially better than having to get it right on the first try.
In particular, even though I’m skeptical that, once you have a deceptive model, you could plausibly train that deception away—rather than just making the model better at hiding its deception—knowing that a particular path to AGI leads to deceptive alignment gives us the opportunity to pivot to a substantially different path where the likelihood of deceptive alignment is as uncorrelated as possible relative to the previous path. For example, we could discover that RLHF (Reinforcement Learning from Human Feedback) schemes often yield deception in a way that e.g. language model pre-training + prompt-chaining doesn’t and switch over.
Furthermore, even if we can’t produce direct demonstrations of deceptive alignment prior to it becoming dangerous, I think we should at least be able to produce demonstrations of some of its precursors. Such deceptive precursor demonstrations could serve a similar role to a direct deception demonstration in helping us choose between different paths to AGI based on the presence of deceptive precursors in each path. Such precursor demonstrations might test questions like:
Though not directly evidence of deceptive alignment itself, deceptive alignment relies on the exact same sort of instrumental reasoning as would be required to exhibit the behaviors above. Thus, understanding when they appear could help us understand when deception is likely to appear as well—and thus looking for such precursors is a task that I would want any team looking for deceptive alignment to also be looking for as well.
Ideally, if every major lab had a team tasked with running experiments like the above, those results could be used to help guide us in terms of which paths to AGI are likely to be the least likely to yield deceptive alignment.
All of that being said, I do think that there are some pretty big potential downsides to pushing for a commitment like the above. In particular:
Nevertheless, I think there are a number of things that we can do to try to mitigate the above risks—specifically:
Particularly tricky is the third condition here that actually requires us to hash out right now as good an understanding as we can of what a demonstration of deceptive alignment might look like—and what sort of experiments we might be able to use to look for one. To that end, I’ll try to at least lay out some baseline criteria and scenarios for such a demonstration.
To start with, we have to establish what we’re looking for in terms of what would convince us that a model is deceptively aligned. Primarily, I think we’re looking for two things:
The key question, then, is what mechanisms there might be for producing such evidence. I think such mechanisms are mostly going to fall into two broad categories:
In my opinion, I think that (2) here—uncovering deception in advance—would be ideal, but seems to require relatively advanced interpretability tools that we might not ever actually get.[3] Though I think we should certainly be doing everything we can to work our way up to interpretability tools that are capable of something like that, it’s also worth planning for what to do if we don’t get there.
In the situation where the model in question takes the form of a large language model (or a large language model + a prompt), one strategy that might help substitute for interpretability tools is chain-of-thought-style techniques. For example, from “Conditioning Generative Models:”
To try to get a bit more concrete, here is a very incomplete list of some example demonstrations of deceptive alignment, ordered approximately from most obviously dangerous to least obviously dangerous:
Trying to do an experiment right now to get any of the demonstrations on the above list—with the exception of the last one—is almost certain to be negative right now. As I stated previously, however, I think there are a lot of very related experiments that someone working on the above would want to perform alongside testing for deceptive alignment directly that would be more likely to yield positive results in the short term, specifically those looking for deceptive precursors rather than deception directly. Furthermore, in my opinion, running lots of negative experiments is what safety is supposed to look like: aggressively testing for failure modes even if you think they’re probably not present.
Training signal here need not be the literal loss (though it could be)—it's just anything that we're trying to get the model to do. ↩︎
These sorts of deceptively aligned models become substantially more likely to be discovered if we do gain-of-function-style research to actively try to produce them, though such research also comes with its own obvious downsides such that it’s quite unclear whether it’s actually worth doing. ↩︎
Furthermore, trying to uncover deception in advance via interpretability tools could fail simply because there is no sense that a deceptively aligned model has to actively be thinking about its deception. A model that has never seen a situation where there is an opportunity to seize power need not be carefully planning out what it would do in such a situation any more than a factory cleaning robot need be planning for what to do if it someday found itself in a jungle instead of a factory. Nevertheless, the fact that the model has not been previously planning to seize power doesn’t imply that it wouldn’t if given the opportunity. In particular, a model could be deceptively aligned simply because it reasons that, in situations where there is a clear overseer, doing what it wants is a good general strategy for gaining power and influence in the world—without needing any explicit plans for later deception. ↩︎