A Playbook for AI Risk Reduction (focused on misaligned AI)

HoldenKarnofsky

I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?”

This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.¹

A plan connotes something like: “By default, we ~definitely fail. To succeed, we need to hit multiple non-default goals.” If you want to start a company, you need a plan: doing nothing will definitely not result in starting a company, and there are multiple identifiable things you need to do to pull it off.
I don’t think that’s the situation with AI risk.
- As I argued before, I think we have a nontrivial chance of avoiding AI takeover even in a “minimal-dignity” future - say, assuming essentially no growth from here in the size or influence of the communities and research fields focused specifically on existential risk from misaligned AI, and no highly surprising research or other insights from these communities/fields either. (This statement is not meant to make anyone relax! A nontrivial chance of survival is obviously not good enough.)
- I think there are a number of things we can do that further improve the odds. My favorite interventions are such that some success with them helps a little, and a lot of success helps a lot, and they can help even if other interventions are badly neglected. I’ll list and discuss these interventions below.
- So instead of a “plan” I tend to think about a “playbook”: a set of plays, each of which might be useful. We can try a bunch of them and do more of what’s working. I have takes on which interventions most need more attention on the margin, but think that for most people, personal fit is a reasonable way to prioritize between the interventions I’m listing.

Below I’ll briefly recap my overall picture of what success might look like (with links to other things I’ve written on this), then discuss four key categories of interventions: alignment research, standards and monitoring, successful-but-careful AI projects, and information security. For each, I’ll lay out:

How a small improvement from the status quo could nontrivially improve our odds.
How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly.
Common concerns/reservations about the intervention.

Overall, I feel like there is a pretty solid playbook of helpful interventions - any and all of which can improve our odds of success - and that working on those interventions is about as much of a “plan” as we need for the time being.

The content in this post isn’t novel, but I don’t think it’s already-consensus: two of the four interventions (standards and monitoring; information security) seem to get little emphasis from existential risk reduction communities today, and one (successful-but-careful AI projects) is highly controversial and seems often (by this audience) assumed to be net negative.

Many people think most of the above interventions are doomed, irrelevant or sure to be net harmful, and/or that our baseline odds of avoiding a catastrophe are so low that we need something more like a “plan” to have any hope. I have some sense of the arguments for why this is, but in most cases not a great sense (at least, I can’t see where many folks’ level of confidence is coming from). A lot of my goal in posting this piece is to give such people a chance to see where I’m making steps that they disagree with, and to push back pointedly on my views, which could change my picture and my decisions.

As with many of my posts, I don’t claim personal credit for any new ground here. I’m leaning heavily on conversations with others, especially Paul Christiano and Carl Shulman.

My basic picture of what success could look like

I’ve written a number of nearcast-based stories of what it might look like to suffer or avoid an AI catastrophe. I’ve written a hypothetical “failure story” (How we might stumble into AI catastrophe); two “success stories” that assume good decision-making by key actors; and an outline of how we might succeed with “minimal dignity.”

The essence of my picture has two phases:

Navigate the initial alignment problem:² getting to the first point of having very powerful (human-level-ish), yet safe, AI systems. For human-level-ish AIs, I think it’s plausible that the alignment problem is easy, trivial or nonexistent. It’s also plausible that it’s fiendishly hard.
Navigating the deployment problem:³ reducing the risk that someone in the world will deploy dangerous systems, even though the basic technology exists to make powerful (human-level-ish) AIs safe. (This is often discussed through the lens of “pivotal acts,” though that’s not my preferred framing.⁴)
1. You can think of this as containing two challenges: stopping misaligned human-level-ish AI, and maintaining alignment as AI goes beyond human level.
2. The basic hope (discussed here) is that “safe actors”⁵ team up to the point where they outnumber and slow/stop “unsafe actors,” via measures like standards and monitoring - as well as alignment research (to make it easier for all actors to be effectively “cautious”), threat assessment research (to turn incautious actors cautious), and more.
3. If we can get aligned human-level-ish AI, it could be used to help with all of these things, and a small lead for “cautious actors” could turn into a big and compounding advantage. More broadly, the world will probably be transformed enormously, to the point where we should consider ~all outcomes in play.

4 key categories of interventions

Here I’ll discuss the potential impact of both small and huge progress on each of 4 major categories of interventions.

For more detail on interventions, see Jobs that can help with the most important century;

What AI companies can do today to help with the most important century; and How major governments can help with the most important century.

Alignment research

How a small improvement from the status quo could nontrivially improve our odds. I think there are various ways we could “get lucky” such that aligning at least the first human-level-ish AIs is relatively easy, and such that relatively small amounts of progress make the crucial difference.

If we can get into a regime where AIs are being trained with highly accurate reinforcement - that is, there are few (or no) opportunities to perform well by deceiving, manipulating and/or overpowering sources of supervision - then it seems like we have at least a nontrivial hope that such AIs will end up aligned, in the sense that they generalize to some rule like “Do what the supervisor intends, in the ordinary (hard to formalize) sense that most humans would mean it” and wouldn’t seek takeover even with opportunities for it. (And at least for early human-level-ish systems, it seems like the probability might be pretty high.) Relatively modest progress on things like debate or task decomposition/amplification/recursive reward modeling could end up making for much more accurate reinforcement. (A bit more on this in a previous piece.)
A single really convincing demonstration of something like deceptive alignment could make a big difference to the case for standards and monitoring (next section). Interpretability research is one potential path here - it could be very valuable to have even one hard-won observation of the form, “This system initially misbehaved, behaved better as its misbehavior was ‘trained out,’ appeared to become extremely well-behaved, but then was revealed by interpretability techniques to be examining each situation for opportunities to misbehave secretly or decisively.”
It doesn’t seem like anyone has gotten very far with adversarial training yet, but it seems possible that a relatively modest amount of progress could put us in a position to have something like “human-level-ish AI systems that just can’t tell whether takeover opportunities are fake.”
The more existing work has been done on a given alignment agenda, the more hope I see for automating work on that agenda if/when there are safe-to-use, human-level-ish systems. This could be especially important for interpretability work, where it seems like one could make productive use of a huge number of “automated researchers.”

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. The big win here would be some alignment (or perhaps threat assessment) technique that is both scalable (works even for systems with far-beyond-human capabilities) and cheap (can be used by a given AI lab without having to pay a large “alignment tax”). This seems pretty unlikely to be imminent, but not impossible, and it could lead to a world where aligned AIs heavily outnumber misaligned AIs (a key hope).

Concerns and reservations. Quoting from a previous piece, three key reasons people give for expecting alignment to be very hard are:

AI systems could quickly become very powerful relative to their supervisors, which means we have to confront a harder version of the alignment problem without first having human-level-ish aligned systems.

I think it’s certainly plausible this could happen, but I haven’t seen a reason to put it at >50%.
To be clear, I expect an explosive “takeoff” by historical standards. I want to give Tom Davidson’s analysis more attention, but it implies that there could be mere months between human-level-ish AI and far more capable AI (but that could be enough for a lot of work by human-level-ish AI).
One key question: to the extent that we can create a feedback loop with AI systems doing research to improve hardware and/or software efficiency (which then increases the size and/or capability of the “automated workforce,” enabling further research ...), will this mostly be via increasing the number of AIs or by increasing per-AI capabilities? There could be a feedback loop with human-level-ish AI systems exploding in number, which seems to present fewer (though still significant) alignment challenges than a feedback loop with AI systems exploding past human capability.¹¹

It’s arguably very hard to get even human-level-ish capabilities without ambitious misaligned aims. I discussed this topic at some length with Nate Soares - notes here. I disagree with this as a default (though, again, it’s plausible) for reasons given at that link.
Expecting “offense-defense” asymmetries (as in this post) such that we’d get catastrophe even if aligned AIs greatly outnumber misaligned ones. Again, this seems plausible, but not the right default guess for how things will go, as discussed at the end of the previous section.

Standards and monitoring

How a small improvement from the status quo could nontrivially improve our odds. Imagine that:

Someone develops a very hacky and imperfect - and voluntary - “dangerous capabilities” standard, such as (to oversimplify): if an AI seems⁷ capable of doing everything needed to autonomously replicate in the wild,⁸ then (to be standard-compliant) it cannot be deployed (and no significant scaleup can be done at all) without strong assurances of security (assessed via penetration testing by reputable third parties) and alignment (assessed via, say, a public explanation of why the AI lab believes its system to be aligned, including required engagement with key reasons this might be hard to assess and a public comment period, and perhaps including an external review).
Several top AI labs declare that they intend to abide by the standard - perhaps out of genuine good intentions, perhaps because they think regulation is inevitable and hope to legitimize approaches to it that they can gain experience with, perhaps due to internal and external pressure and a desire for good PR, perhaps for other reasons.
Once several top AI labs have committed, it becomes somewhat odd-seeming for an AI lab not to commit. Some do hold out, but they tend to have worse reputations and more trouble attracting talent and customers, due partly to advocacy efforts. A cascade along the lines of what we’ve seen in farm animal welfare seems plausible.
The standard is fairly “squishy”; there are various ways to weasel out by e.g. selecting an overly “soft” auditor or violating the spirit of the “no deployments, no significant scaleup” rules, etc. and there are no consequences if a lab abandons the standard beyond disclosure of that decision.

I think this kind of situation would bring major benefits to the status quo, if only via incentives for top AI labs to move more carefully and invest more energy in alignment. Even a squishy, gameable standard, accompanied by mostly-theoretical possibilities of future regulation and media attention, could add to the risks (bad PR, employee dissatisfaction, etc.) and general pain of scaling up and releasing models that can’t be shown to be safe.

This could make it more attractive for companies to do their best with less capable models while making serious investments in alignment work (including putting more of the “results-oriented leadership effort” into safety - e.g., “We really need to make better alignment progress, where are we on that?” as opposed to “We have a big safety team, what more do you want?”) And it could create a big financial “prize” for anyone (including outside of AI companies) who comes up with an attractive approach to alignment.

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. A big potential win is something like:

Initially, a handful of companies self-regulate by complying with the standard..
This situation creates an ecosystem for standards setters, evaluation designers (e.g., designing evaluations of dangerous capabilities and alignment), auditors, etc.
When the government decides to regulate AI, they default to poaching people from that ecosystem and copying over its frameworks. My impression is that governments generally prefer to poach/copy what’s already working when feasible. Now that regulation is official, standards are substantially less squishy (though not perfect) - perhaps via government-authorized auditors being given a lot of discretion to declare AI systems unsafe.
The US government, and/or other governments, unilaterally enforces standards (and/or just blocks development of AI) internationally, with methods ranging from threats of sanctions to cyberwarfare or even more drastic measures.
It’s not impossible to build a dangerous AI at this point, but it’s quite difficult and risky, and this slows everyone down a lot and greatly increases investment in alignment. If the alignment investment still doesn’t result in much, it might at least be the case that limited AI becomes competitive and appealing.
This all could result in early deployed human-level-ish AI systems being “safe enough” and used largely to develop better standards, better ways of monitoring and enforcing them, etc.

Concerns and reservations. A common class of concerns is along the lines of, “Any plausible standards would be squishy/gameable”; I think this is significantly true, but squishy/gameable regulations can still affect behavior a lot.⁹

Another concern: standards could end up with a dynamic like “Slowing down relatively cautious, high-integrity and/or law-abiding players, allowing less cautious players to overtake them.” I do think this is a serious risk, but I also think we could easily end up in a world where the “less cautious” players have trouble getting top talent and customers, which does some combination of slowing them down and getting them to adopt standards of their own (perhaps weaker ones, but which still affect their speed and incentives). And I think the hope of affecting regulation is significant here.

I think there’s a pretty common misconception that standards are hopeless internationally because international cooperation (especially via treaty) is so hard. But there is precedent for the US enforcing various things on other countries via soft power, threats, cyberwarfare, etc. without treaties or permission, and in a high-stakes scenario, it could do quite a lot of this..

Successful, careful AI lab

Conflict of interest disclosure: my wife is co-founder and President of Anthropic and owns significant equity in both Anthropic and OpenAI. This may affect my views, though I don't think it is safe to assume specific things about my takes on specific AI labs due to this.¹⁰

How a small improvement from the status quo could nontrivially improve our odds. If we just imagine an AI lab that is even moderately competitive on capabilities while being substantially more concerned about alignment than its peers, such a lab could:

Make lots of money and thus support lots of work on alignment as well as other things (e.g., standards and monitoring).
Establish general best practices - around governance, security, and more - that other labs can learn from. (It’s dramatically easier and more likely for a company to copy something that’s already working somewhere else, as opposed to experimenting with their own innovative ways of e.g. protecting AI model weights.)
Be a place for lots of alignment-concerned folks to gain credibility and experience with AI systems and companies - positioning them to be influential at other companies, in government, etc. in the future.
Have a relatively small marginal impact on speeding up and/or hyping AI, simply by not releasing anything that’s more advanced than what other labs have released. (I think it should still be possible to make big profits despite this practice.)

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. If an AI lab ends up with a several-month “lead” on everyone else, this could enable huge amounts of automated alignment research, threat assessment (which could create very strong demonstrations of risk in the event that automated alignment research isn’t feasible), and other useful tasks with initial human-level-ish systems.

Concerns and reservations. This is a tough one. AI labs can do ~unlimited amounts of harm, and it currently seems hard to get a reliable signal from a given lab’s leadership that it won’t. (Up until AI systems are actually existentially dangerous, there’s ~always an argument along the lines of “We need to move as fast as possible and prioritize fundraising success today, to stay relevant so we can do good later.”) If you’re helping an AI lab “stay in the race,” you had better have done a good job deciding how much you trust leadership, and I don’t see any failsafe way to do that.

That said, it doesn’t seem impossible to me to get this right-ish (e.g., I think today’s conventional wisdom about which major AI labs are “good actors” on a relative basis is neither uninformative (in the sense of rating all labs about the same) nor wildly off), and if you can, it seems like there is a lot of good that can be done by an AI lab.

I’m aware that many people think something like “Working at an AI lab = speeding up the development of transformative AI = definitely bad, regardless of potential benefits,” but I’ve never seen this take spelled out in what seems like a convincing way, especially since it’s pretty easy for a lab’s marginal impact on speeding up timelines to be small (see above).

I do recognize a sense in which helping an AI lab move forward with AI development amounts to “being part of the problem”: a world in which lots of people are taking this action seems worse than a world in which few-to-none are. But the latter seems off the table, not because of Molochian dynamics or other game-theoretic challenges, but because most of the people working to push forward AI simply don’t believe in and/or care about existential risk ~at all (and so their actions don’t seem responsive in any sense, including acausally, to how x-risk-concerned folks weigh the tradeoffs). As such, I think “I can’t slow down AI that much by staying out of this, and getting into it seems helpful on balance” is a prima facie plausible argument that has to be weighed on the merits of the case rather than dismissed with “That’s being part of the problem.”

I think helping out AI labs is the trickiest and highest-downside intervention on my list, but it seems quite plausibly quite good in many cases.

Information security

How a small improvement from the status quo could nontrivially improve our odds. It seems to me that the status quo in security is rough (more), and I think a small handful of highly effective security people could have a very large marginal impact. In particular, it seems like it is likely feasible to make it at least difficult and unreliable for a state actor to steal a fully-developed powerful AI system.

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. I think this doesn’t apply so much here, except for a potential somewhat far-fetched case in which someone develops (perhaps with assistance from early powerful-but-not-strongly-superhuman AIs) a surprisingly secure environment that can contain even misaligned AIs significantly (though probably not unboundedly) more capable than humans.

Concerns and reservations. My impression is that most people who aren’t excited about security think one of these things:

The situation is utterly hopeless - there’s no path to protecting an AI from being stolen.
Or: this isn’t an area to focus on because major AI labs can simply hire non-x-risk-motivated security professionals, so why are we talking about this?

I disagree with #2 for reasons given here (I may write more on this topic in the future).

I disagree with #1 as well.

I think imperfect measures can go a long way, and I think there are plenty of worlds where stealing dangerous AI systems is quite difficult to pull off, such that a given attempt at stealing takes months or more - which, as detailed above, could be enough to make a huge difference.
Additionally, a standards-and-monitoring regime could include provisions for retaliating against theft attempts, and stealing model weights without much risk of getting caught could be especially difficult thanks to serious (but not extreme or perfect) security measures.
I also think it’s pretty likely that stealing the weights of an AI system won’t be enough to get the full benefit from it - it could also be necessary to replicate big parts of the scaffolding, usage procedures, dev environment, etc. which could be difficult.

Notes

After drafting this post, I was told that others had been making this same distinction and using this same term in private documents. I make no claim to having come up with it myself! ↩
Phase 1 in this analysis ↩
Phase 2 in this analysis ↩
I think there are ways things could go well without any particular identifiable “pivotal act”; see the “success stories” I linked for more on this. ↩
“Safe actors” corresponds to “cautious actors” in this post. I’m using a different term here because I want to include the possibility that actors are safe mostly due to luck (slash cheapness of alignment) rather than caution per se. ↩
The latter, more dangerous possibility seems more likely to me, but it seems quite hard to say. (There could also of course be a hybrid situation, as the number and capabilities of AI grow.) ↩
In the judgment of an auditor, and/or an internal evaluation that is stress-tested by an auditor, or simply an internal evaluation backed by the risk that inaccurate results will result in whistleblowing. ↩
I.e, given access to its own weights, it could plausibly create thousands of copies of itself with tens of millions of dollars at their disposal, and make itself robust to an attempt by a few private companies to shut it down. ↩
A comment from Carl Shulman on this point that seems reasonable: "A key difference here seems to be extremely rapid growth, where year on year effective compute grows 4x or more. So a defector with 1/16th the resources can produce the same amount of danger in 1-2 years, sooner if closer to advanced AGI and growth has accelerated. The anti-nuclear and anti-GMO movements cut adoption of those technologies by more than half, but you didn't see countries with GMO crops producing all the world's food after a few years, or France making so much nuclear power that all electricity-intensive industries moved there.
For regulatory purposes you want to know if the regulation can block an AI capabilities explosion. Otherwise you're buying time for a better solution like intent alignment of advanced AI, and not very much time. That time is worthwhile, because you can perhaps get alignment or AI mind-reading to work in an extra 3 or 6 or 12 months. But the difference with conventional regulation interfering with tech is that the regulation is offsetting exponential growth; exponential regulatory decay only buys linear delay to find longer-term solutions.
There is a good case that extra months matter, but it's a very different case from GMO or nuclear power. [And it would be far more to the credit of our civilization if we could do anything sensible at scale before the last few months or years.]" ↩
We would still be married even if I disagreed sharply with Anthropic’s strategy. In general, I rarely share my views on specific AI labs in public. ↩

[-]Wei Dai3y98

One way that things could go wrong, not addressed by this playbook: AI may differentially accelerate intellectual progress in a wrong direction, or in other words create opportunities for humanity to make serious mistakes (by accelerating technological progress) faster than wisdom to make right choices (philosophical progress). Specific to the issue of misalignment, suppose we get aligned human-level-ish AI, but it is significantly better at speeding up AI capabilities research than the kinds of intellectual progress needed to continue to minimize misalignment risk, such as (next generation) alignment research and coordination mechanisms between humans, human-AI teams, or AIs aligned to different humans.

I think this suggests the intervention of doing research aimed at improving the philosophical abilities of the AIs that we'll build. (Aside from misalignment risk, it would help with many other AI-related x-risks that I won't go into here, but which collectively outweigh misalignment risk in my mind.)

[-]HoldenKarnofsky3y10

I agree that this is a major concern. I touched on some related issues in this piece.

This post focused on misalignment because I think readers of this forum tend to be heavily focused on misalignment, and in this piece I wanted to talk about what a playbook might look like assuming that focus (I have pushed back on this as the exclusive focus elsewhere).

I think somewhat adapted versions of the four categories of intervention I listed could be useful for the issue you raise, as well.

AI ALIGNMENT FORUM
AF