Some small corrections/additions to my section ("Altair agent foundations"). I'm currently calling it "Dovetail research". That's not publicly written anywhere yet, but if it were listed as that here, it might help people who are searching for it later this year.
Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake
I wouldn't put number 9. Not intended to "solve" most of these problems, but is intended to help make progress on understanding the nature of the problems through formalization, so that they can be avoided or postponed, or more effectively solved by other research agenda.
Target case: worst-case
definitely not worst-case, more like pessimistic-case
Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K
The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor.
The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).
“AI safety” means many things. We’re targeting work that intends to prevent very competent cognitive systems from having large unintended effects on the world.
This time we also made an effort to identify work that doesn’t show up on LW/AF by trawling conferences and arXiv. Our list is not exhaustive, particularly for academic work which is unindexed and hard to discover. If we missed you or got something wrong, please comment, we will edit.
The method section is important but we put it down the bottom anyway.
One commenter said we shouldn’t do this review, because the sheer length of it fools people into thinking that there’s been lots of progress. Obviously we disagree that it’s not worth doing, but be warned: the following is an exercise in quantity; activity is not the same as progress; you have to consider whether it actually helps.
A smell of ozone. In the last month there has been a flurry of hopeful or despairing pieces claiming that the next base models are not a big advance, or that we hit a data wall. These often ground out in gossip, but it’s true that the next-gen base models are held up by something, maybe just inference cost.
The pretraining runs are bottlenecked on electrical power too. Amazon is at present not getting its nuclear datacentre.
But overall it’s been a big year for capabilities despite no pretraining scaling.
I had forgotten long contexts were so recent; million-token windows only arrived in February. Multimodality was February. “Reasoning” was September. “Agency” was October.
I don’t trust benchmarks much, but on GPQA (hard science) they leapt all the way from chance to PhD-level just with post-training.
FrontierMath launched 6 weeks ago; o3 moved it 2% → 25%. This is about a year ahead of schedule (though25% of the benchmark is “find this number!” International Math Olympiad/elite undergrad level.) Unlike the IMO there’s an unusual emphasis on numerical answers and computation too.
LLaMA-3.1 only used 0.1% synthetic pretraining data, Hunyuan-Large supposedly used 20%.
The revenge of factored cognition? The full o1 (descriptive name GPT-4-RL-CoT) model is uneven, but seems better at some hard things.The timing is suggestive: this new scaling dimension is being exploited now to make up for the lack of pretraining compute scaling, and so keep the excitement/investment level high. See also Claude apparently doing a little of this.
Moulton: “If the paths found compress well into the base model then even the test-time compute paradigm may be short lived.”
You can do a lot with a modern 8B model, apparently more than you could with 2020’s GPT-3-175B. This scaling of capability density will cause other problems.
There’s still some room for scepticism on OOD capabilities. Here’s a messy case which we don’t fully endorse.
The revenge of RL safety. After LLMs ate the field, the old safety theory (which thought in terms of RL) was said to be less relevant.[1] But the training process for o1 / R1 involves more RL than RLHF does, which ispretty worrying. o3 involves more RL still.
Some parts of AI safety are now mainstream.
So it’s hard to find all of the people who don’t post here. For instance, here’s a random ACL paper which just cites the frontier labs.
The AI governance pivot continues despite the SB1047 setback. MIRI is a policy org now. 80k made governance researcher its top recommended career, after 4 years of technical safety being that.
The AISIs seem to be doing well. The UK one survived a political transition and the US one might survive theirs. See also the proposed AI Safety Review Office.
Mainstreaming safety ideas has polarised things of course; an organised opposition has stood up at last. Seems like it’s not yet party-political though.
Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.
Alignment evals with public test sets will probably be pretrained on, and as such will probably quickly stop meaning anything. Maybe you hope that it generalises from post-training anyway?
Safety casesare a mix of scalable oversight and governance; if it proves hard to make a convincing safety case for a given deployment, then – unlike evals – the safety case gives an if-then decision procedure to get people to stop; or if instead real safety cases are easy to make, we can make safety cases for scalable oversight, and then win.
Grietzer and Jha deprecate the word “alignment”, since it means too many things at once:
“P1: Avoiding takeover from emergent optimization in AI agents
P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us
P3: Ensuring AIs are good at solving problems as specified (by user or designer)
P4: Ensuring AI systems enhance, and don’t erode, human agency
P5: Ensuring that advanced AI agents learn a human utility function
P6: Ensuring that AI systems lead to desirable systemic and long term outcomes”
I note in passing the for-profit ~alignment companies in this list: Conjecture, Goodfire, Leap, AE Studio. (Not counting the labs.)
We don’t comment on quality. Here’s one researcher’s opinion on the best work of the year (for his purposes).
From December 2023: you should read Turner and Barnett alleging community failures.
The term “prosaic alignment” violates one good rule of thumb: that one should in general name things in ways that the people covered would endorse.[2] We’re calling it “iterative alignment”. We liked Kosoy’s description of the big implicit strategies used in the field, including the “incrementalist” strategy.
Quite a lot of the juiciest work is in the "miscellaneous" category, suggesting that our taxonomy isn't right, or that tree data structures aren't.
Agendas with public outputs
1. Understand existing models
Evals
(Figuring out how trained models behave. Arguably not itself safety work but a useful input.)
One-sentence summary: make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.
Theory of change: keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary (and hopefully they extrapolate). You can’t regulate without them.
One-sentence summary: let’s attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Which orthodox alignment problems could it help with?: 12. A boxed AGI might exfiltrate itself by steganography, spearphishing, 4. Goals misgeneralize out of distribution
One-sentence summary: try to reverse-engineer models in a principled way and use this understanding to make models safer. Break it into components (neurons, polytopes, circuits, feature directions, singular vectors, etc), interpret the components, check that your interpretation is right.
Theory of change: Iterate towards things which don’t scheme. Most bottom-up interp agendas are not seeking a full circuit-level reconstruction of model algorithms, they're just aiming at formal models that are principled enough to root out e.g. deception. Aid alignment through ontology identification, auditing for deception and planning, targeting alignment methods, intervening in training, inference-time control to act on hypothetical real-time monitoring.
This is a catch-all entry with lots of overlap with the rest of this section. See also scalable oversight, ambitious mech interp.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify; 7. Superintelligence can fool human supervisors; 12. A boxed AGI might exfiltrate itself.
Target case: pessimistic
Broad approach: cognitive
Some names: Chris Olah, Neel Nanda, Trenton Bricken, Samuel Marks, Nina Panickssery
One-sentence summary: decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.
Theory of change: get a principled decomposition of an LLM's activation into atomic components → identify deception and other misbehaviors. Sharkey’s version has much more detail.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 7. Superintelligence can fool human supervisors
Target case: pessimistic
Broad approach: cognitive
Some names: Senthooran Rajamanoharan, Arthur Conmy, Leo Gao, Neel Nanda, Connor Kissane, Lee Sharkey, Samuel Marks, David Bau, Eric Michaud, Aaron Mueller, Decode
One-sentence summary: Computational mechanics for interpretability; what structures must a system track in order to predict the future?
Theory of change: apply the theory to SOTA AI, improve structure measures and unsupervised methods for discovering structure, ultimately operationalize safety-relevant phenomena.
One-sentence summary: develop the foundations of interpretable AI through the lens of causality and abstraction.
Theory of change: figure out what it means for a mechanistic explanation of neural network behavior to be correct → find a mechanistic explanation of neural network behavior
One-sentence summary: if a bottom-up understanding of models turns out to be too hard, we might still be able to jump in at some high level of abstraction and still steer away from misaligned AGI.
Theory of change: build tools that can output a probable and predictive representation of internal objectives or capabilities of a model, thereby enabling model editing and monitoring.
See also: high-level interpretability, model-agnostic interpretability, Cadenza, Leap.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 7. Superintelligence can fool human supervisors
Target case: optimistic / pessimistic
Broad approach: cognitive
Some names: Wes Gurnee, Max Tegmark, Eric J. Michaud, David Baek, Josh Engels, Walter Laurito, Kaarel Hänni
One-sentence summary: research startup selling an interpretability API (model-agnostic feature viz of vision models). Aiming for data-independent (“want to extract information directly from the model with little dependence on training or test data”) and global (“mech interp isn’t going to be enough, we need holistic methods that capture gestalt”) interpretability methods.
Theory of change: make safety tools people want to use, stress-test methods in real life, develop a strong alternative to bottom-up circuit analysis.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify
One-sentence summary: Build tools for detecting, locating, and interpreting key moments (saddle-to-saddle dynamics, groks) that govern training and in-context learning in models.
build an idealized model of NNs, measure high-level phenomena, derive interesting predictions about real models. Apply these measures to real models: "pretend they're idealized, then apply the high-level measurements you'd apply to idealized models, then see if you can interpret the results".
Theory of change: structures forming in neural networks can leave traces we can interpret to figure out where and how that structure is implemented. This could automate interpretability. It may be hopeless to intervene at the end of the learning process, so we want to catch and prevent deceptiveness and other dangerous capabilities and values as early as possible.
One-sentence summary: toy models (e.g. of induction heads) to understand learning in interesting limiting examples; only part of their work is safety related.
Theory of change: study interpretability and learning in DL (for bio insights, unrelated to AI) → someone else uses this work to do something safety related
Which orthodox alignment problems could it help with?: We don’t know how to determine an AGI’s goals or values
Target case: optimistic?
Broad approach: cognitive
Some names: Andrew Saxe, Basile Confavreux, Erin Grant, Stefano Sarao Mannelli, Tyler Boyd-Meredith, Victor Pedrosa
Theory of change (imputed): things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature. Assume that task reliability is enough (that tuning for what we want will also get us avoidance of what we don't want). Maybe assume that thoughts are translucent.
One-sentence summary: don’t evaluate the model, evaluate the humans / organisation / deployment method instead: assess the risk of a particular protocol for deploying your model in a particular scenario, by silently swapping in an intentionally misaligned model. Play out the scenario in sim until 1) a catastrophe occurs, 2) your protocol confidently spots the red team model, or 3) the scenario ends without catastrophe. Also do adversarial evaluation (monitoring untrusted model outputs with both trusted and untrusted models). Predecessor.
Theory of change: prevent high-stakes failures by automating risk assessment of eval and deployment plans.
See also: safety cases.
Which orthodox alignment problems could it help with?: 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
Target case: worst-case
Broad approach: engineering / behavioural
Some names: Redwood, Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen
One-sentence summary: This is a multi-team agenda with some big differences. Something like: formally model the behavior of cyber-physical systems, define precise constraints on what actions can occur, and require AIs to provide safety proofs for their recommended actions (correctness and uniqueness). Get AI to (assist in) building a detailed world simulation which humans understand, elicit preferences over future states from humans, verify[4] that the AI adheres to coarse preferences[5]; plan using this world model and preferences.
Theory of change: make a formal verification system that can act as an intermediary between a human user and a potentially dangerous system and only let provably safe actions through. Notable for not requiring that we solve ELK. Does require that we solve ontology though.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake, 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
Target case: (nearly) worst-case
Broad approach: cognitive
Some names: Yoshua Bengio, Max Tegmark, Steve Omohundro, David "davidad" Dalrymple, Joar Skalse, Stuart Russell, Ohad Kammar, Alessandro Abate, Fabio Zanassi
One-sentence summary: reorient the general thrust of AI research towards provably beneficial systems.
Theory of change: understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.
See also RLHF and recursive reward modelling, the industrialised forms.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe
Target case: varies
Broad approach: engineering, cognitive
Some names: Joar Skalse, Anca Dragan, Stuart Russell, David Krueger
One-sentence summary: Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. Newest iteration of a sustained and novel agenda.
Theory of change: Fairly direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; "understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients".
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify.
One-sentence summary: understand what an LLM’s normal (~benign) functioning looks like and detect divergence from this, even if we don't understand the exact nature of that divergence.
Theory of change: build models of normal functioning → find and flag behaviors that look unusual → match the unusual behaviors to problematic outcomes or shut it down outright.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
One-sentence summary: now focusing on developing robust white-box dishonesty-detection methods for LLM's and model evals. Previously working on concept-based interpretability.
Theory of change: Build and benchmark strong white-box methods to assess trustworthiness and increase transparency of models, and encourage open releases / evals from labs by demonstrating the benefits and necessity of such methods.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
Target case: pessimistic / worst-case
Broad approach: cognitive
Some names: Kieron Kretschmar, Walter Laurito, Sharan Maiya, Grégoire Dhimoïla
One-sentence summary: shoggoth/face + paraphraser. Avoid giving the model incentives to hide its deceptive cognition or steganography. You could do this with an o1-like design, where the base model is not optimised for agency or alignment.
Theory of change: keep the CoT unoptimised and informative so that it can be used for control. Make it so we can see (most) misalignment in the hidden CoT.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors
Publicly-announced funding 2023-4: $505,000 for the AI Futures Project
Indirect deception monitoring
One-sentence summary: build tools to find whether a model will misbehave in high stakes circumstances by looking at it in testable circumstances. This bucket catches work on lie classifiers, sycophancy, Scaling Trends For Deception.
Theory of change: maybe we can catch a misaligned model by observing dozens of superficially unrelated parts, or tricking it into self-reporting, or by building the equivalent of brain scans.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors[7]
Target case: pessimistic
Broad approach: engineering
Some names: Anthropic, Monte MacDiarmid, Meg Tong, Mrinank Sharma, Owain Evans, Colognese
One-sentence summary: a sort of interpretable finetuning. Let's see if we can programmatically modify activations to steer outputs towards what we want, in a way that generalises across models and topics. As much an intervention-based approach to interpretability as about control.
Theory of change: test interpretability theories; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
See also: representation engineering, SAEs.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution, 5. Instrumental convergence, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
Target case: pessimistic
Broad approach: engineering/cognitive
Some names: Jan Wehner, Alex Turner, Nina Panickssery, Marc Carauleanu, Collin Burns, Andrew Mack, Pedro Freire, Joseph Miller, Andy Zou, Andy Arditi, Ole Jorgensen.
One-sentence summary: avoid Goodharting by getting AI to satisfice rather than maximise.
Theory of change: if we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent’s utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution
Target case: pessimistic
Broad approach: cognitive
Some names: Jobst Heitzig,Simon Fischer, Jessica Taylor
One-sentence summary: make tools to write, execute and deploy cognitive programs; compose these into large, powerful systems that do what we want; make a training procedure that lets us understand what the model does and does not know at each step; finally, partially emulate human reasoning.
Theory of change: train a bounded tool AI to promote AI benefits without needing unbounded AIs. If the AI uses similar heuristics to us, it should default to not being extreme.
Which orthodox alignment problems could it help with?: 2. Corrigibility is anti-natural, 5. Instrumental convergence
Target case: pessimistic
Broad approach: engineering, cognitive
Some names: Connor Leahy, Gabriel Alfour, Adam Shimi
One-sentence summary: use weaker models to supervise and provide a feedback signal to stronger models.
Theory of change: find techniques that do better than RLHF at supervising superior models → track whether these techniques fail as capabilities increase further
Which orthodox alignment problems could it help with?: 8. Superintelligence can hack software supervisors
Target case: optimistic
Broad approach: engineering
Some names: Jan Leike, Collin Burns, Nora Belrose, Zachary Kenton, Noah Siegel, János Kramár, Noah Goodman, Rohin Shah
One-sentence summary: scalable tracking of behavioural drift, benchmarks for self-modification.
Theory of change: early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading; left unchecked this will likely cause problems, so we need a better iterative improvement process.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
Target case: pessimistic
Broad approach: behavioural
Some names: Roman Engeler, Akbir Khan, Ethan Perez
One-sentence summary: Train human-plus-LLM alignment researchers: with humans in the loop and without outsourcing to autonomous agents. More than that, an active attitude towards risk assessment of AI-based AI alignment.
Theory of change: Cognitive prosthetics to amplify human capability and preserve values. More alignment research per year and dollar.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
One-sentence summary: Make open AI tools to explain AIs, including agents. E.g. feature descriptions for neuron activation patterns; an interface for steering these features; behavior elicitation agent that searches for user-specified behaviors from frontier models
Theory of change:Introducing Transluce; improve interp and evals in public and get invited to improve lab processes.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba.
Publicly-announced funding 2023-4: N/A
Task decomp
Recursive reward modelling is supposedly not dead but instead one of the tools OpenAI will build. Another line tries to make something honest out of chain of thought / tree of thought.
One-sentence summary: “make highly capable agents do what humans want, even when it is difficult for humans to know what that is”.
Theory of change: [“Give humans help in supervising strong agents”] + [“Align explanations with the true reasoning process of the agent”] + [“Red team models to exhibit failure modes that don’t occur in normal use”] are necessary but probably not sufficient for safe AGI.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 7. Superintelligence can fool human supervisors
Target case: worst-case
Broad approach: engineering, cognitive
Some names: Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras
One-sentence summary: scalable oversight of truthfulness: is it possible to develop training methods that incentivize truthfulness even when humans are unable to directly judge the correctness of a model’s output? / scalable benchmarking how to measure (proxies for) speculative capabilities like situational awareness.
Theory of change: current methods like RLHF will falter as frontier AI tackles harder and harder questions → we need to build tools that help human overseers continue steering AI → let’s develop theory on what approaches might scale → let’s build the tools.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors
Target case: pessimistic
Broad approach: behavioural
Some names: Sam Bowman, Ethan Perez, He He, Mengye Ren
One-sentence summary: try to formalise a more realistic agent, understand what it means for it to be aligned with us, translate between its ontology and ours, and produce desiderata for a training setup that points at coherent AGIs similar to our model of an aligned agent.
Theory of change: fix formal epistemology to work out how to avoid deep training problems.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 9. Humans cannot be first-class parties to a superintelligent value handshake
One-sentence summary: Get the thing to work out its own objective function (a la HCH).
Theory of change: make a fully formalized goal such that a computationally unbounded oracle with it would take desirable actions; and design a computationally bounded AI which is good enough to take satisfactory actions.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution
One-sentence intro: use causal models to understand agents. Originally this was to design environments where they lack the incentive to defect, hence the name.
Theory of change: as above.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution
Target case: pessimistic
Broad approach: behavioural/maths/philosophy
Some names: Tom Everitt, Matt McDermott, Francis Rhys Ward, Jonathan Richens, Ryan Carey
One-sentence summary: Develop formal models of subagents and superagents, use the model to specify desirable properties of whole-part relations (e.g. how to prevent human-friendly parts getting wiped out).
Theory of change: Solve self-unalignment, prevent destructive alignment, allow for scalable noncoercion.
See also Alignment of Complex Systems, multi-scale alignment, scale-free theories of agency, active inference, bounded rationality.
Which orthodox alignment problems could it help with?: 5. Instrumental convergence, 9. Humans cannot be first-class parties to a superintelligent value handshake
Target case: pessimistic
Broad approach: cognitive
Some names: Jan Kulveit, Roman Leventov, Scott Viteri, Michael Levin, Ivan Vendrov, Richard Ngo
One-sentence summary: model the internal components of agents, use humans as a model organism of AGI (humans seem made up of shards and so might AI). Now more of an empirical ML agenda.
Theory of change: If policies are controlled by an ensemble of influences ("shards"), consider which training approaches increase the chance that human-friendly shards substantially influence that ensemble.
Theory of change: generalize theorems → formalize agent foundations concepts like the agent structure problem → hopefully assist other projects through increased understanding
Which orthodox alignment problems could it help with?: "intended to help make progress on understanding the nature of the problems through formalization, so that they can be avoided or postponed, or more effectively solved by other research agenda."
Target case: pessimistic
Broad approach: maths/philosophy
Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K, José Pedro Faustino
One-sentence summary: what is “optimisation power” (formally), how do we build tools that track it, and how relevant is any of this anyway. See also developmental interpretability.
Theory of change: existing theories are either rigorous OR good at capturing what we mean; let’s find one that is both → use the concept to build a better understanding of how and when an AI might get more optimisation power. Would be nice if we could detect or rule out speculative stuff like gradient hacking too.
Which orthodox alignment problems could it help with?: 5. Instrumental convergence
Target case: pessimistic
Broad approach: maths/philosophy
Some names: Alex Flint, Guillaume Corlouer, Nicolas Macé
One-sentence summary: predict properties of AGI (e.g. powerseeking) with formal models. Corrigibility as the opposite of powerseeking.
Theory of change: figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold, test them when feasible.
(Figuring out how AI agents think about the world and how to get superintelligent agents to tell us what they know. Much of interpretability is incidentally aiming at this. See also latent knowledge.)
One-sentence summary: check the hypothesis that our universe “abstracts well” and that many cognitive systems learn to use similar abstractions. Check if features correspond to small causal diagrams corresponding to linguistic constructions.
Theory of change: find all possible abstractions of a given computation → translate them into human-readable language → identify useful ones like deception → intervene when a model is using it. Also develop theory for interp more broadly; more mathematical analysis. Also maybe enables “retargeting the search” (direct training away from things we don’t want).
See also:causal abstractions, representational alignment, convergent abstractions
Which orthodox alignment problems could it help with?: 5. Instrumental convergence, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
Target case: worst-case
Broad approach: cognitive
Some names: John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat
One-sentence summary: mech interp plus formal verification. Formalize mechanistic explanations of neural network behavior, so to predict when novel input may lead to anomalous behavior.
Theory of change: find a scalable method to predict when any model will act up. Very good coverage of the group’s general approach here.
See also: ELK, mechanistic anomaly detection.
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution, 8. Superintelligence can hack software supervisors
Target case: worst-case
Broad approach: cognitive, maths/philosophy
Some names: Jacob Hilton, Mark Xu, Eric Neyman, Dávid Matolcsi, Victor Lecomte, George Robinson
One-sentence summary: future agents creating s-risks is the worst of all possible problems, we should avoid that.
Theory of change: make present and future AIs inherently cooperative via improving theories of cooperation and measuring properties related to catastrophic conflict.
See also: FOCAL
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 3. Pivotal processes require dangerous capabilities, 4. Goals misgeneralize out of distribution
Target case: worst-case
Broad approach: maths/philosophy
Some names: Jesse Clifton, Caspar Oesterheld, Anthony DiGiovanni, Maxime Riché, Mia Taylor
One-sentence summary: make sure advanced AI uses what we regard as proper game theory.
Theory of change: (1) keep the pre-superintelligence world sane by making AIs more cooperative; (2) remain integrated in the academic world, collaborate with academics on various topics and encourage their collaboration on x-risk; (3) hope that work on “game theory for AIs”, which emphasises cooperation and benefit to humans, has framing & founder effects on the new academic field.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe
Target case: pessimistic
Broad approach: maths/philosophy
Some names: Vincent Conitzer, Caspar Oesterheld, Vojta Kovarik
One-sentence summary: remain ahead of the capabilities curve/maintain ability to figure out what’s up with state of the art models, keep an updated risk profile, propagate flaws to relevant parties as they are discovered.
One-sentence summary: model evaluations and conceptual work on deceptive alignment. Also an interp agenda (decompose NNs into components more carefully and in a more computation-compatible way than SAEs). Also deception evals in major labs.
Theory of change: “Conduct foundational research in interpretability and behavioural model evaluations, audit real-world models for deceptive alignment, support policymakers with our technical expertise where needed.”
Which orthodox alignment problems could it help with?: 2. Corrigibility is anti-natural, 4. Goals misgeneralize out of distribution
Target case: pessimistic
Broad approach: behavioural/cognitive
Some names: Marius Hobbhanh, Lee Sharkey, Lucius Bushnaq, Mikita Balesni
One-sentence summary: do what needs doing, any type of work
Theory of change: make the field more credible. Make really good benchmarks, integrate academia into the field, advocate for safety standards and help design legislation.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe, 13. Fair, sane pivotal processes
Target case: mixed
Broad approach: mixed
Some names: Dan Hendrycks, Andy Zou, Mantas Mazeika, Jacob Steinhardt, Dawn Song (some of these are not full-time at CAIS though).
One-sentence summary: theory generation, threat modelling, and toy methods to help with those. “Our main threat model is basically a combination of specification gaming and goal misgeneralisation leading to misaligned power-seeking.” See announcement post for full picture.
Theory of change: direct the training process towards aligned AI and away from misaligned AI: build enabling tech to ease/enable alignment work → apply said tech to correct missteps in training non-superintelligent agents → keep an eye on it as capabilities scale to ensure the alignment tech continues to work.
See also (in this document): Process-based supervision, Red-teaming, Capability evaluations, Mechanistic interpretability, Goal misgeneralisation, Causal alignment/incentives
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution, 7. Superintelligence can fool human supervisors
Target case: pessimistic
Broad approach: engineering
Some names: Rohin Shah, Anca Dragan, Allan Dafoe, Dave Orr, Sebastian Farquhar
One-sentence summary: “a) improved reasoning of AI governance & alignment researchers, particularly on long-horizon tasks and (b) pushing supervision of process rather than outcomes, which reduces the optimisation pressure on imperfect proxy objectives leading to “safety by construction”.
Theory of change: “The two main impacts of Elicit on AI Safety are improving epistemics and pioneering process supervision.”
One-sentence summary: a science of robustness / fault tolerant alignment is their stated aim, but they do lots of interpretability papers and other things.
Theory of change: make AI systems less exploitable and so prevent one obvious failure mode of helper AIs / superalignment / oversight: attacks on what is supposed to prevent attacks. In general, work on overlooked safety research others don’t do for structural reasons: too big for academia or independents, but not totally aligned with the interests of the labs (e.g. prototyping moonshots, embarrassing issues with frontier models).
Some names: Adrià Garriga-Alonso, Adam Gleave, Chris Cundy, Mohammad Taufeeque, Kellin Pelrine
One-sentence summary: funds academics or near-academics to do ~classical safety engineering on AIs. A collaboration between the NSF and OpenPhil. Projects include
“Neurosymbolic Multi-Agent Systems”
“Conformal Safe Reinforcement Learning”
“Autonomous Vehicles”
Theory of change: apply safety engineering principles from other fields to AI safety.
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution
# FTEs: “80”. But this includes lots working on bad-words prevention and copyright-violation prevention.
They just lost Lilian Weng, their VP of safety systems.
OpenAI Alignment Science
use reasoning systems to prevent models from generating unsafe outputs. Unclear if this is a decoding-time thing (i.e. actually a control method) or a fine-tuning thing.
Some names: Mia Glaese, Boaz Barak, Johannes Heidecke, Melody Guan. Lost its head, John Schulman.
One-sentence summary: Fundamental research in LLM security, plus capability demos for outreach, plus workshops.
Theory of change: control is much easier if we can secure the datacenter / if hacking becomes much harder. The cybersecurity community need to be alerted.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors, 8. Superintelligence can hack software supervisors, 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
Target case: worst-case
Broad approach: engineering
Some names: Jeffrey Ladish, Charlie Rogers-Smith, Ben Weinstein-Raun, Dmitrii Volkov
One-sentence summary: figure out how a model works, automatically. Diagnose its trustworthiness, improve its trustworthiness, guarantee its trustworthiness.
Theory of change: automatically extracting the knowledge learned during training, then reimplement it in an architecture where we can formally verify that it will do what we want. Replace AGI.
See also: SAEs, concept-based interp, provably safe systems, program synthesis, this.
Which orthodox alignment problems could it help with?: most, by avoiding opaque AGI.
Target case: worst-case
Broad approach: cognitive
Some names: Ziming Liu, Peter Park, Eric Michaud, Wes Gurnee
One-sentence summary: technical research to enable sensible governance, with leverage from government mandates.
Theory of change:improve evals, measure harms, develop a method for real AI safety cases, help governments understand the current safety situation, build an international consensus.
Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake, 13. Fair, sane pivotal processes
Target case: pessimistic
Broad approach: behavioural
Some names: Geoffrey Irving, Benjamin Hilton, Yarin Gal, JJ Allaire
Some of them continue to work on alignment elsewhere.
Old-school OpenAI
The team “OpenAI AGI Readiness Team”.
The name “OpenAI Superalignment Team”.
Ilya Sutskever, Alec Radford, Jacob Hilton, Richard Ngo, Miles Brundage, Lilian Weng, Jan Leike, John Schulman, Andrej Karpathy again, Daniel Kokotajlo, William Saunders, Cullen O’Keefe, Carroll Wainwright, Ryan Lowe.
We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.
We started with last year’s list and moved any agendas without public outputs this year. We also listed agendas known to be inactive in the Graveyard.
An agenda is an odd unit; it can be larger than one team and often in a many-to-many relation of researchers and agendas. It also excludes illegible or exploratory research – anything which doesn’t have a manifesto.
All organisations have private info; and in all cases we’re working off public info. So remember we will be systematically off by some measure.
We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.
The target case: what part of the distribution over alignment difficulty do they aim to help with? (via Ngo)
“optimistic-case”[9]: if CoT is faithful, pretraining as value loading, no stable mesa-optimizers, the relevant scary capabilities are harder than alignment, zero-shot deception is hard, goals are myopic, etc
pessimistic-case: if we’re in-between the above and the below
worst-case: if power-seeking is rife, zero-shot deceptive alignment, steganography, gradient hacking, weird machines, weird coordination, deep deceptiveness
The broad approach: roughly what kind of work is it doing, primarily? (via Ngo)
engineering: iterating over outputs
behavioural: understanding the input-output relationship
cognitive: understanding the algorithms
maths/philosophy[10]: providing concepts for the other approaches
As they are largely outside the scope of this review, subproblem 6 - Pivotal processes likely require incomprehensibly complex plans - does not appear in this review and the following only appear scarcely with large error bars for accuracy:
11. Someone else will deploy unsafe superintelligence first
13. Fair, sane pivotal processes
We added some new agendas, including by scraping relevant papers from arXiv and ML conferences. We scraped every Alignment Forum post and reviewed the top 100 posts by karma and novelty. The inclusion criterion is vibes: whether it seems relevant to us.
We dropped the operational criteria this year because we made our point last year and it’s clutter.
Lastly, we asked some reviewers to comment on the draft.
Thanks to Vanessa Kosoy, Nora Ammann, Erik Jenner, Justin Shovelain, Gabriel Alfour, Raymond Douglas, Walter Laurito, Shoshannah Tekofsky, Jan Hendrik Kirchner, Dmitry Vaintrob, Leon Lang, Tushita Jha, Leonard Bereska, and Mateusz Bagiński for comments. Thanks to Joe O’Brien for sharing their taxonomy. Thanks to our Manifund donors and to OpenPhil for top-up funding.
Vanessa Kosoy notes: ‘IMHO this is a very myopic view. I don't believe plain foundation models will be transformative, and even in the world in which they will be transformative, it will be due to implicitly doing RL "under the hood".’
This is fine as a standalone description, but in practice lots of interp work is aimed at interventions for alignment or control. This is one reason why there’s no overarching “Alignment” category in our taxonomy.
Nora Ammann notes: “I typically don’t cash this out into preferences over future states, but what parts of the statespace we define as safe / unsafe. In SgAI, the formal model is a declarative model, not a model that you have to run forward. We also might want to be more conservative than specifying preferences and instead "just" specify unsafe states -- i.e. not ambitious intent alignment.”