This is helpful, thanks!
In my ideal world those labs would have large “adversarial evaluation departments” that try extremely hard to find inputs (or random seeds, or “pseudo” inputs) where a powerful model attempts to deliberately cause harm, or do anything that even vaguely smells like causing harm or deliberately undermining safety measures, or trying to deceptively hide their capabilities, or etc. ... This won’t be enough on its own to be confident that models don’t do anything bad, and ideally this would be just one piece of a machine that created much more confidence, but it does make life meaningfully harder for a deceptively aligned model looking to cause trouble.
Our current world seems very far from this ideal world. As you know I have 10-year timelines. Do you think something like this ideal world may be realized by then? Do you think the EA community, perhaps the AI governance people, could bring about this world if we tried?
I think it's pretty realistic to have large-ish (say 20+ FTE at leading labs?) adversarial evaluation teams within 10 years, and much larger seems possible if it actually looks useful. Part of why it's unrealistic is just that this is a kind of random and specific story and it would more likely be mixed in a complicated way with other roles etc.
If AI is exciting as you are forecasting then it's pretty likely that labs are receptive to building those teams and hiring a lot of people, so the main question is whether safety-concerned people do a good enough job of scaling up those efforts, getting good at doing the work, recruiting and training more folks, and arguing for / modeling why this is useful and can easily fit into a rapidly-scaling AI lab. (10 years is also a relatively long lead time to get hired and settle in at a lab.)
I think the most likely reason this doesn't happen in the 10-year world is just that there's too many other appealing aspects of the ideal world and people who care about alignment will focus attention on making other ones happen (and some of them might just be much better ideas than this one). But if this was all we had to do I would feel extremely optimistic about making it happen.
I feel like this is mostly about technical work rather than AI governance.
Nice. I'm tentatively excited about this... are there any backfire risks? My impression was that the AI governance people didn't know what to push for because of massive strategic uncertainty. But this seems like a good candidate for something they can do that is pretty likely to be non-negative? Maybe the idea is that if we think more we'll find even better interventions and political capital should be conserved until then?
Redwood Research's current project is to train a model that completes short snippets of fiction without outputting text where someone gets injured. I'm excited about this direction and wanted to explain why.
(Disclaimer: I originally proposed this project, and am on Redwood’s board. So I may be biased in favor of Redwood and especially in favor of this problem being a good one to work on.)
Relevance to deceptive alignment
I think about alignment risk as having two big drivers :
I spend most of my time working on problem #1, but I think both problems are very important.
One of the main concrete candidate techniques for fixing problem #2 is to run the AI on tons of simulated inputs designed to look like the kind of test-time situation where your AI could kill you and take your stuff (adversarial training). But in fact these are just simulations, and if your model tries to do something really bad you just do a gradient update and move on. As long as you've solved problem #1, you can hopefully recognize if the model tries to do something bad in simulation without needing to run it in the real world.
Depending on the details you may not actually want to train on these examples---once you see a problem you may want to do something different or else run the risk of overfitting---but at any rate you definitely want to see them in order to know that there is a problem.)
I think adversarial training or evaluation look like key ingredients whether we are pursuing a principled approach to alignment (a more sophisticated version of this process are a key ingredient in my own "best guess" alignment plan) or just "winging it" (in which case I think adversarial evaluation is especially important).
Redwood’s current project
I’m discussing what Redwood’s post calls “step 1,” namely training a classifier that is able to recognize violent completions with an extremely low false negative rate. Ideally they would go for 0 false negatives, but in practice they are just going for as many 9s as they can get (think more like 10-30 than 99.99%---way beyond what people are normally going for in ML). They are OK with a relatively high false positive rate since most inputs in practice are “obviously safe” and it’s not that bad to force the model to behave cautiously on the riskiest inputs.
I think this overall structure is pretty similar to the long-term problem. We just need enough 9s that the model will probably never fail catastrophically even if you run it on many times more examples than it was trained on and even if the input distribution shifts in an unlucky way. And it’s pretty OK if you have high false positive rates off distribution so that your system behaves conservatively until you retrain.
Step 1 is particularly hard because it’s the place where we don’t have a precise specification and instead need to use expensive oversight. I’m also interested in step 2---I think the two involve slightly different techniques but that both are quite relevant to the long-term problem. I'm genuinely unsure whether "learn a specification with low false negative rate" or "verify that your policy meets the learned specification" is the harder part.
"Try to find inputs where your classifier fails to recognize someone getting obviously injured" isn't a perfect analogy for "try to find inputs where your policy deliberately kills everyone," but I think it's actually surprisingly good. I think the core common features are that the task is extremely easy, so that the model ought to be able to do it without needing to learn very much new about the world, but the way ML systems are typically trained makes it hard to get good robustness in the tails.
Why the concrete steps towards this goal look interesting
I think that the concrete steps you would need to take to solve the problem today are usefully analogous to the steps you’ll need to take in the future, and when I look at what Redwood folks are actually doing day-to-day I feel pretty happy about it:
Many of those techniques appear in the academic literature, but the details change a lot when you start aiming for no failures rather than reducing the rate of failures.
I think that all of those will be relevant in the future. Human red-teaming will become less important relative to automated red-teaming, but I think that {human red-teaming with machine help} is more similar to what we’ll be doing in the future than simpler automated techniques like gradient descent.
I don’t think this is by any means a complete list of techniques that would be needed. But still, when I look at the actual activities and the project that puts them all together, I feel like this is the kind of work which I really want to be happening (in a massively scaled-up way) for models that are capable enough to be really risky.
Limitations of the analogy
There's one deep disanalogy between the current project and deceptive alignment: the current model is making “honest mistakes” when it completes a prompt in an injurious way, rather than biding its time until it can do something treacherous. That leads to two important differences:
I think that these reduce the value of the project, but given the early stages it's more like 2x than 100x less useful. In many respects the current problem is like an easy version of the long-term problem, and some of the gap is closed by having adversaries working with ML tools that will scale up over time.
Why it matters
What I want to happen in the world overall
As we approach transformative and catastrophically risky AI, I expect there to be multiple industry labs working on training ever-larger models.
In my ideal world those labs would have large “adversarial evaluation departments” that try extremely hard to find inputs (or random seeds, or “pseudo” inputs) where a powerful model attempts to deliberately cause harm, or do anything that even vaguely smells like causing harm or deliberately undermining safety measures, or trying to deceptively hide their capabilities, or etc. We can allow the adversaries all kinds of unrealistic flexibility to create implausible situations, remove or relax all the realistic safety thresholds that might prevent bad behavior, and so on.
Ideally by the time a model is deployed we’d be pretty confident it’s very hard to get it to do anything close to problematic. This won’t be enough on its own to be confident that models don’t do anything bad, and ideally this would be just one piece of a machine that created much more confidence, but it does make life meaningfully harder for a deceptively aligned model looking to cause trouble.
These teams would grow in size rapidly as AI became more dangerous, but I think things are much more likely to go well if people aren’t trying to spin them up from scratch---if there are existing teams and individuals that are excellent at this activity, if they’ve been testing and developing methods for years, if we collectively understand what it looks like for such a team to be doing its job well, and so on. Ideally that scale-up will be continuous, and will start up in something like the world of today, rather than being some hypothetical that we start executing on much closer to crunch time.
If all of that goes well, then the most likely way for us to fail is if the deceptively aligned model is a lot smarter than the red team (and all of the red team’s aligned helpers)---so that the deceptive model can cause trouble in ways too subtle for the red team to notice. This basically puts us back to bullet point #1 in my original dichotomy. It’s not that I’m not worried about that problem---I personally spend nearly 100% of my time on it---but I do think that it would be extremely comforting if that was all we had to worry about.
There are a lot of complicated details in that story. For example, it’s not clear if you want to train on the data. And in practice you’d love to red-team the red-team by looking for cases where their methodology fails to catch what would be critical problems in synthetic cases. And for really powerful models there isn’t going to be a clean development/deployment split--we will want to be red teaming checkpoints of the model as we go (and even then you may need to exert more care). And this entire rendition of the story is a bit of an unrealistic caricature in more pragmatic ways e.g. by positing a single dedicated team and so on.
Details aside, I think there is an unusually clear connection between this kind of work on characterizing/improving worst-case performance and a plausibly critical task for reducing catastrophic risk from very capable models.