Recent advances in machine learning—in reinforcement learning, language modeling, image and video generation, translation and transcription models, etc.—without similarly striking safety results, have rather dampened the mood in many AI Safety circles. If I was any less concerned by extinction risks from AI, I would have finished my PhD[1] as planned before moving from Australia to SF to work at Anthropic; I believe that the situation is both urgent and important.[2]

On the other hand, despair is neither instrumentally nor terminally valuable.[3] This essay therefore lays out some concrete reasons for hope, which might help rebalance the emotional scales and offer some directions to move in.

Background: a little about Anthropic

I must emphasize here that this essay represents only my own views, and not those of my employer. I’ll try to make this clear by restricting we to actions, and using I for opinions to avoid attributing my own views to my colleagues. Please forgive any lapses of style or substance.

Anthropic’s raison d’etre is AI Safety. It was founded in early 2021, as a public benefit corporation,[4] and focuses on empirical research with advanced ML systems. I see our work as having four key pillars:

  1. Training near-SOTA models.
    This ensures that our safety work will in fact be relevant to cutting-edge systems, and we’ve found that many alignment techniques only work at large scales.[5] Understanding how capabilities emerge over model scale and training-time seems vital for safety, as a basis to proceed with care or as a source of evidence that continuing to scale capabilities would be immediately risky.
  2. Direct alignment research. There are many proposals for how advanced AI systems might be aligned, many of which can be tested empirically in near-SOTA (but not smaller) models today. We regularly produce the safest model we can with current techniques,[6] and characterize how it fails in order to inform research and policy. With RLHF as a solid baseline and building block, we're investigating more complicated but robust schemes such as constitutional AI, scalable supervision, and model-assisted evaluations.
  3. Interpretability research. Fully understanding models could let us rule out learned optimizers, deceptive misalignment, and more. Even limited insights would be incredibly valuable as an independent check on other alignment efforts, and might offer a second chance if they fail.
  4. Policy and communications. I expect AI capabilities will continue to advance, with fast-growing impacts on employment, the economy, and cybersecurity. Having high-trust relationships between labs and governments, and more generally ensuring policy-makers are well-informed, seems robustly positive.

If you want to know more about what we’re up to, the best place to check is anthropic.com for all our published research. We’ll be posting more information about Anthropic throughout this year, as well as fleshing out the website.

Concrete reasons for hope

My views on alignment are similar to (my understanding of) Nate Soares’. I think the key differences are because I don’t think there’s enough evidence to confidently predict the difficulty of future problems, and I do think it’s possible for careful labs to avoid active commission of catastrophe. We also seem to have different views on how labs should respond to the situation, which this essay does not discuss.

Language model interventions work pretty well

I wasn’t expecting this, but our helpful/harmless/honest research is in fact going pretty well! The models are far from perfect, but we’ve made far more progress than I would have expected a year ago, and no signs of slowing down yet. HHH omits several vital pieces of the full alignment problem, but if it leads to AI that always shuts down on command and never causes a catastrophe I’ll be pretty happy.

As expected, we’ve also seen a range of failures on more difficult tasks or where train-time supervision was relatively weak – such as inventing a series of misleading post-hoc justifications when inconsistent responses are questioned.[7] The ‘treacherous turn’ scenario is still concerning, but I find it plausible that work on e.g. scalable supervision and model-based red-teaming could help us detect it early.

Few attempts to align ML systems

There are strong theoretical arguments that alignment is difficult, e.g. about convergent instrumental goals, and little empirical progress on aligning general-purpose ML systems. However, the latter only became possible a few years ago with large language models, and even then only in a few labs! There’s also a tradition of taking theoretically very hard problems, and then finding some relaxation or subset which is remarkably easy or useful in practice – for example SMT solvers vs most NP-complete instances, CAP theorem vs CRDTs or Spanner, etc. I expect that increasing hands-on alignment research will give us a similarly rich vein of empirical results and praxis from which to draw more abstract insights.

Interpretability is promising!

It feels like we’re still in the fundamental-science stage, but interpretability is going much better than I expected. We’re not in the ‘best of all possible worlds’ where polysemanticity just isn’t a thing, but compared to early 2020, transformer interpretability is going great. I’m also pretty optimistic about transfer to new architectures, if one comes along – there are some shared motifs between the imagenet and transformer circuits threads, and an increasing wealth of tooling and experience.

Mechanistic interpretability is also popping up in many other places! I’ve recently enjoyed reading papers from Redwood Research, Conjecture, and DeepMind, for example, and it feels more like a small but growing field than a single research project. Mechanistic interpretability might hit a wall before it becomes useful for TAI/AGI safety, or simply fail to get there in time; or it might not.

Outcome-based training can be limited or avoided

Supervised learning seems to be working really well, and process-based techniques could plausibly scale to superhuman performance; they might even be better for capabilities in regimes with very scarce feedback or reward signals. I’d expect this to be good for safety relative to outcome-based RL systems, and more amenable to audits and monitoring. “Just” being as skilled as the best humans in every domain – with perfect synergy between every skill-set, at lower cost and wildly higher speed, with consistently-followed and constantly-refined playbooks – would be incredibly valuable.

Good enough is good enough

The first TAI system doesn’t have to be a perfectly aligned sovereign—so long as it’s corrigible it can be turned back off, reengineered, and tried again. The goal is to end the acute risk period, which might be possible via direct assistance with alignment research, by enabling policy interventions, or whatever else.

Training can probably stop short of catastrophe

Model capabilities increase with scale and over the course of training, and evaluation results on checkpoints are quite precisely predictable from scaling laws, even for models more-capable than those from which the scaling laws were derived. You can write pretty sensitive evaluations for many kinds of concerning behavior; check log-probs or activations as well as sampled tokens, use any interpretability techniques you like (never training against them!) – and just stop training unless you have a strong positive case for safety!

I’d aim to stop before getting concrete reason to think I was training a dangerous model, obviously, but also value defense in depth against my own mistakes.

People do respond to evidence

Not everyone, and not all institutions, but many. The theme here is that proposals which need some people to join are often feasible, including those which require specific people. I often talk to people who have been following AI Safety from a distance for a few years, until some recent result[8] convinced them that there was a high-impact use for their skills[9] and it was time to get more involved.

If a major lab saw something which really scared them, I think other labs would in fact agree to a moratorium on further capabilities until it could be thoroughly investigated. Publicly announcing a pause would more or less declare that AGI was imminent, risk a flood of less safety-conscious entrants to the field, and there are questions of antitrust law too, but I’m confident that these are manageable issues.

I don’t expect a ‘sharp left turn’

The ‘sharp left turn’ problem derives from the claim that capabilities generalize better than alignment. This hasn’t been argued so much as asserted, by analogy to human evolution as our only evidence on how human-level general intelligence might develop. I think this analogy is uninformative, because human researchers are capable of enormously better foresight and supervision than was evolution, and anticipate careful empirical studies of this question in silico despite the difficulty of interpreting the results.

Anecdotally, I’ve seen RLHF generalize alignmentish properties like helpfulness and harmlessness across domains at least as well as it generalizes capabilities at their present levels, and I don’t share the intuition that this is very likely to change in future. I think ‘alignment generalization failures’ are both serious and likely enough to specifically monitor and mitigate, but not that a sharp left turn is anywhere near certain.

Conclusion: high confidence in doom is unjustified

By “doom” I mean a scenario in which all humans are comprehensively disempowered by AI before the end of this century. I expect that human extinction would follow from comprehensive disempowerment within decades. It Looks Like You’re Trying To Take Over The World is a central example; human accident or misuse might lead to similarly bad outcomes but I’m focusing on technical alignment here.

Estimates of “P(doom)” are based on only tenuous evidence. I think a range is consistent with available evidence and reasonable priors; largely because in my view it’s unclear whether the problem is difficult like the steam engine, or net-energy-gain from fusion, or proving whether , or perhaps more difficult still. I tried writing tighter estimates but couldn’t construct anything I’d endorse.

Emotionally, I’m feeling pretty optimistic. While the situation is very scary indeed and often stressful, the x-risk mitigation community is a lovely and growing group of people, there’s a large frontier of work to be done, and I’m pretty confident that at least some of it will turn out to be helpful. So let’s get (back) to work!


  1. Contrary to stereotypes about grad school, I was really enjoying it! I’m also pretty sad to shut down the startup I’d spun out (hypofuzz.com), though open-sourcing it is some consolation - if it wasn’t for x-risk I wouldn’t be in AI at all. ↩︎

  2. I work directly on AI x-risk, and separately I give 10% of my income to GiveWell’s “All Grants” fund (highest EV), and a further 1% to GiveDirectly (personal values). I value this directly; I also believe that credible signals of altruism beyond AI are important for the health of the community. ↩︎

  3. This is mostly a question of your emotional relationship to the facts of the matter; an accurate assessment of the situation is of course instrumentally vital and to me also terminally desirable. ↩︎

  4. The effect is that the company can raise money from investors and prioritize the mission over shareholder profits. ↩︎

  5. I'm thinking here of the helpful/harmless tradeoff from (eg) fig 26a of our RLHF paper; P(IK) calibration, scalable supervision results, etc. – places where you need pretty good capabilities to do useful experiments. ↩︎

  6. I also think that this is a very important habit for anyone working with SOTA or near-SOTA systems. When whoever it is eventually tries building a TAI or AGI system, I would strongly prefer that they have a lot of hands-on practice aligning weaker AI systems as well as an appreciation that this time is likely to be different. ↩︎

  7. (e.g. here) This likely arose from RLHF training on a pattern of single-step justifying or modifying claims when challenged by shallow train-time human supervision; when pressed harder in deployment the generalization is to repeat and escalate this pattern to the point of absurdity. ↩︎

  8. e.g. GPT-3, Minerva, AlphaFold, code models – all language- rather than image-based, which seems right to me. Since I first drafted this essay there's also ChatGPT, which continues to make headlines even in mainstream newspapers. ↩︎

  9. For example software and systems engineering, law, finance, recruiting, ops, etc. – the scale and diversity of AI safety projects in the large language model era demands a wider diversity of skills and experience than earlier times. ↩︎

New Comment