I strong-upvoted this post.
Here's a specific, zoomed-in version of this game proposed by Nate Soares:
like, we could imagine playing a game where i propose a way that it [the AI] diverges [from POUDA-avoidance] in deployment, and you counter by asserting that there's a situation in the training data where it had to have gotten whacked if it was that stupid, and i counter either by a more-sophisticated deployment-divergence or by naming either a shallower or a factually non-[Alice]like thing that it could have learned instead such that the divergence still occurs, and we go back and forth. and i win if you're forced into exotic and unlikely training data, and you win if i'm either forced into saying that it learned unnatural concepts, or if my divergences are pushed so far out that you can fit in a pivotal act before then.
Might be useful as a standalone or as a mini-game within the overall game of building and breaking an alignment proposal, which is itself a mini-game in the overall game of building and breaking success stories.
Meta: I really like ideas and concrete steps for how to practice the skill of thinking about something. I think there are at least three methods for learning how to think productively about a particular problem:
And I think the last option is often neglected (in all fields, not just AGI safety) because there's not a lot of written material on how to actually do it. Note that it is a different thing than the more general skill of learning to learn and meta-cognition - different kinds of technical problems can require learning different, domain-specific kinds of cognitive motions.
If you've absorbed enough of the Sequences and other rationality material through osmosis, you might be able to figure out the kind of cognitive motions you need, and how to practice and develop them on your own (or maybe you've done some of the exercises in the CFAR handbook and learned to generalize the lessons they try to teach).
But having someone more experienced write down the kind of cognitive motions you need, along with exercises for how to learn and practice them, can probably get more people up to speed much more quickly. I think posts like this are a great step in that direction.
Object-level tip for the breaker phase: thinking about how a literal human might break your alignment proposal can be a useful way for building intuitions and security mindset. A lot of real alignment schemes involve doing something with human-ish level intelligence, and thinking about how an actual human would break or escape from something is often more natural and less prone to veering into vague or magical thinking than positing capabilities that a hypothetical super-intelligent AI system might have.
If you can't figure out how an actual human can break things, you can relax the constraint a bit by thinking about what a human with the ability to make 10 copies of themselves, think 10x as fast, write code with superhuman accuracy and speed, etc. could do instead.
Threat modelling is the term for this kind of thinking in the field of computer security.
Epistemic status: confident that the underlying idea is useful; less confident about the details, though they're straightforward enough that I expect they're mostly in the right direction.
TLDR: This post describes a pre-mortem-like exercise that I find useful for thinking about AGI risk. It is the only way I know of to train big-picture intuitions about what solution attempts are more or less promising and what the hard parts of the problem are. The (simple) idea is to iterate between constructing safety proposals ('builder step') and looking for critical flaws in a proposal ('breaker step').
Introduction
The way that scientists-in-training usually develop research taste is to smash their heads against reality until they have good intuitions about things like which methods tend to work, how to interpret experimental results, or when to trust their proof of a theorem. This important feedback loop is mostly absent in AGI safety research, since we study a technology that does not exist yet (AGI). As a result, it is hard to develop a good understanding of which avenues of research are most promising and what the hard bits of the problem even are.[1]
The best way I know of to approximate that feedback loop is an iterative exercise with two steps: 1) propose a solution to AGI safety, and 2) look for flaws in the proposal. The idea is simple, but most people don’t do it explicitly or don’t do it often enough.
Multiple rounds of this exercise tend to bring up details about one’s assumptions and predictions that would otherwise stay implicit or unnoticed. Writing down specific flaws of a specific proposal helps ground more general concepts like instrumental convergence or claims like ‘corrigibility is unnatural’. And after some time, the patterns in the flaws (the ‘hard bits’) become visible on their own.
I ran an earlier version of this exercise as a workshop (an important component is to discuss your ideas with others, so a workshop format is convenient). Here are the slides.
The exercise
The exercise consists of two phases:[2] a builder phase in which you write down a best guess / proposal for how we might avoid existential risk from AGI, and a breaker phase in which you dig into the details until you understand how the proposal fails.
Importantly, in the context of this exercise the only thing that counts is your own inside view, that is your own understanding of the technical or political feasibility of the proposal. You might have thoughts like “There’s smart people who have thought about this much longer than I have, and they think X; why should I disagree?”. Put that aside for now; the point is to develop your own views, and that works best when you don’t think too much about other people’s views except to inform your own thoughts.
Builder phase
Write down the proposal: a plausible story for how we might avoid human extinction or disempowerment due to AGI.[3] It doesn’t need to be very detailed yet; that comes in the breaker phase.
The proposal does not need to be purely technical; e.g. governance approaches are fair game.
Example builder phase (oracle AI): Instead of building an “agent AI” that acts in the world, we could build a system that just tries to make good predictions (an “oracle”). An oracle would be very useful and economically valuable while avoiding existential risk from AGI, because an oracle has no agency and thus no reason to act against us.
If you get stuck, i.e. you can’t come up with an AGI safety proposal:
(Don’t worry, this is a common problem).
→ Write down in broad outlines what you expect to happen if we develop AGI. If that inevitably ends badly, start with the breaker phase: describe a failure scenario, then try to find a fix.
→ Talk to an AGI optimist, if you can find one. If they have an idea that doesn’t seem to you like it has obvious flaws, start with that. Alternatively, look for written proposals like the OpenAI alignment plans.
Breaker phase
Make the proposal detailed and concrete. Try to find flaws. Adopt a security mindset / AI safety mindset.
Example breaker phase (oracle AI): Let’s say we go ahead and build an oracle AGI. What exactly are we planning to do with this oracle? If the runner-up AI lab builds an agentic AGI 6 months later, their AGI might cause a catastrophe even if we’re careful. It’s not enough for the idea to be safe; it needs to be useful for alignment somehow, or otherwise help us prevent disaster from a competitor AGI. The current proposal doesn’t say anything about how to do that, which is a critical flaw.[4]
If you get stuck, i.e. it seems like the proposal works:
→ Consider different kinds of ways the proposal might fail. A useful resource here is this very appropriately titled essay.
→ Write up your proposal and get others to critique it.
Iterate
If the proposal seemed promising to start with, it’s plausible that a single serious flaw will not be enough to wreck it beyond repair. If you can see a way to adapt the proposal to fix the flaw, go to step 1 and repeat.
Example fix (Oracle AI): So we need to adapt the proposal to make sure we can do something useful with the Oracle AI that prevents a less careful competitor lab from causing a disaster. Maybe an oracle can help us by evaluating our plans to convince other companies to not build AGI?
→ Adapted proposal (Oracle AI 2): Instead of building an “agent AI” that acts in the world, we could build a system that just tries to make good predictions (an “oracle”). An oracle would be very useful and economically valuable while avoiding existential risk from AGI, because an oracle has no agency and thus no reason to act against us. We train the oracle to be good at answering questions such as “will research program X have catastrophic consequences?” and at evaluating the consequences of actions such as “talk to person X to convince them they should stop research program X”. The oracle will warn us if another lab gets close to deploying a dangerous AGI, and if so it can tell us how to convince them to stop.
If you get stuck, i.e. it seems like the proposal is unfixable.
→ Talk to others about your idea, in particular if you know people who are optimistic about ideas similar to the proposal you’re working with. Send them your notes and ask for opinions.
→ If that fails: congratulations, you have completed the exercise! Start again from scratch with a new idea :)
Details
Resources
Writing on AGI safety
If you decide to do this exercise, you’ll probably (depending on how much you’ve already read) find it useful to read other people’s thoughts on the topic. I’ve compiled some resources that you might find useful to read through for inspiration at various points in this exercise. The list is very incomplete - it’s just what I could come up with from the top of my head.
Breakers (criticisms of AGI Safety proposals & arguments for why safety is harder than one might otherwise think):
Builders (solution proposals):
Lists / collections of posts and papers:
Other writing on how to learn about / work in AGI safety
After I wrote this post I noticed that there’s already a post by Abram Demski that describes basically the same exercise, and later people pointed out to me that John Wentworth runs a similar exercise that is briefly described here. Both of those seem worth reading if you want more perspectives on the builder/breaker exercise, as is Paul Christiano’s post on his research methodology.
Neel Nanda has a good post on forming your own views in AGI safety.
The MIRI alignment research field guide covers some useful basics for doing research and discussion groups with others.
Of course, AGI safety researchers do build research experience in adjacent fields like deep learning and maths, but there are intuitions and ways of thinking specific to AGI safety that one doesn’t typically inherit from other fields.
I adopt the terms “builder / breaker” from the ELK report, though I may not be using the terms in exactly the same way.
If helpful, you can choose a more concrete disaster scenario, such as “an autonomous human-level AGI breaks containment”.
I'm somewhat dissatisfied with this example because the flaw is obvious enough that there's no need to go into much concrete detail. Usually you'd do more of that, e.g. if the plan is to use the oracle or 'tool-AI' to prevent a dangerous AGI from being built, how exactly might that work?