In my ideal world those labs would have large “adversarial evaluation departments” that try extremely hard to find inputs (or random seeds, or “pseudo” inputs) where a powerful model attempts to deliberately cause harm, or do anything that even vaguely smells like causing harm or deliberately undermining safety measures, or trying to deceptively hide their capabilities, or etc. ... This won’t be enough on its own to be confident that models don’t do anything bad, and ideally this would be just one piece of a machine that created much more confidence, but it does make life meaningfully harder for a deceptively aligned model looking to cause trouble.

Our current world seems very far from this ideal world. As you know I have 10-year timelines. Do you think something like this ideal world may be realized by then? Do you think the EA community, perhaps the AI governance people, could bring about this world if we tried?

Reply

[-]paulfchristiano4y70

I think it's pretty realistic to have large-ish (say 20+ FTE at leading labs?) adversarial evaluation teams within 10 years, and much larger seems possible if it actually looks useful. Part of why it's unrealistic is just that this is a kind of random and specific story and it would more likely be mixed in a complicated way with other roles etc.

If AI is exciting as you are forecasting then it's pretty likely that labs are receptive to building those teams and hiring a lot of people, so the main question is whether safety-concerned people do a good enough job of scaling up those efforts, getting good at doing the work, recruiting and training more folks, and arguing for / modeling why this is useful and can easily fit into a rapidly-scaling AI lab. (10 years is also a relatively long lead time to get hired and settle in at a lab.)

I think the most likely reason this doesn't happen in the 10-year world is just that there's too many other appealing aspects of the ideal world and people who care about alignment will focus attention on making other ones happen (and some of them might just be much better ideas than this one). But if this was all we had to do I would feel extremely optimistic about making it happen.

I feel like this is mostly about technical work rather than AI governance.

Reply

[-]Daniel Kokotajlo4y30

Nice. I'm tentatively excited about this... are there any backfire risks? My impression was that the AI governance people didn't know what to push for because of massive strategic uncertainty. But this seems like a good candidate for something they can do that is pretty likely to be non-negative? Maybe the idea is that if we think more we'll find even better interventions and political capital should be conserved until then?

Reply

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

55

Why I'm excited about Redwood Research's current project

55

Relevance to deceptive alignment

Redwood’s current project

Why the concrete steps towards this goal look interesting

Limitations of the analogy

Why it matters

What I want to happen in the world overall