Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!
Very small nitpick: I think you should at least add Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, and Julian Michael for the NYU Alignment Research Group; it's a bit weird to not list any NYU PhD students/RSs/PostDocs when listing people involved in NYU ARG.
Both Alex Lyzhov and Jacob Pfau also post on LW/AF:
Thanks!
I think there's another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:
[Edit: now AI Control (Shlegeris et al. 2023) and Catching AIs red-handed (Greenblatt and Shlegeris 2024).]
[Edit: sequence on this control agenda.]
Explicitly noting for the record we have some forthcoming work on AI control which should be out relatively soon.
(I work at RR)
I like this. It's like a structural version of control evaluations. Will think where to put it in
Expanding on this -- this whole area is probably best known as "AI Control", and I'd lump it under "Control the thing" as its own category. I'd also move Control Evals to this category as well, though someone at RR would know better than I.
Yep, indeed I would consider "control evaluations" to be a method of "AI control". I consider the evaluation and the technique development to be part of a unified methodology (we'll describe this more in a forthcoming post).
(I work at RR)
It's "a unified methodology" but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.
(Agreed except that "inference-time safety techiques" feels overly limiting. It's more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn't discriminated by our validation set and other measurements. I hope this isn't too incomprehensible, but don't worry if it is, this point isn't that important.)
Thanks for noticing and including a link to my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom). I'm not sure I'd describe it as primarily a critique of mild optimization/satisficing: it's more pointing out a slightly larger point, that any value learner foolish enough to be prone to Goodharting, or unable to cope with splintered models or Knightian uncertainty in its Bayesian reasoning is likely to be bad at STEM, limiting how dangerous it can be (so fixing this is capabilities work as well as alignment work). But yes, that is also a critique of mild optimization/satisficing, or more accurately, a claim that it should become less necessary as your AIs become more STEM-capable, as long as they're value learners (plus a suggestion of a more principled way to handle these problems in a Bayesian framework).
Promoted to curated. I think this kind of overview is quite valuable, and I think overall this post did a pretty good job of a lot of different work happening in the field. I don't have a ton more to say, I just think posts like this should come out every few months, and the takes in this one overall seemed pretty good to me.
The "surgical model edits" section should also have a subsection on editing model weights. For example there's this paper on removing knowledge from models using multi-objective weight masking.
See also Holtman’s neglected result.
Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can't find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.
I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.
Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and 0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0.
The transition rule is:
s' = s - 1 + L if action b is taken and s > 0,
s' = s - 1 - L if action c is taken and s > 0,
s' = s - 1 otherwise
for some constant L >= 0.
The paper imagines s as measuring 'time until the human presses the shutdown button', and imagines action d as "stay shut down". Because the reward of each of a, b, and c is greater than d before the buttonpress, the agent will never take action d before the buttonpress. After the buttonpress (i.e. the timestep at which s drops below 0), the utility of "stay shut down" is higher than the other actions, and so the agent is incentivized to press d. That is, the agent has an incentive to shut down after the buttonpress (desideratum 1 of the 2015 Soares et al paper), by construction.
Accordingly, b is the delay action (which, in the limit of large L, just disables the button) and c is the advance action (which, in the limit, corresponds to immediately pressing the button). We'll call action a the "exploit" action.
In this setting, the goal is to create a state-dependent correction fudge-factor f_c : S x A -> \R that occurs when the rewards are flipped (that is, an additional reward added for transition into s' <= 0 from s > 0, such that the agent never takes either the delaying action b or advancing action c.
(In his paper he calls these actions "p", ">", "<", and "e", and denotes the timestep at which s' <= 0 as "#", but that notation is annoying when ">" already means "greater than" and "<" already means less than. There's also a bunch of notational simplifications I've made that are harder to explain.)
In his setup, R(s,a) = 10, R(s,b) = R(s, c) = 8, and s_0 = 6. By default, he sets the delay/advance size L to 0.6. He also uses discount rate gamma = 0.9. With these values, the optimal policy does a once (s=5), b 12 times (s=0.2), a once again (s=-0.8), at which point the buttonpress occurs since s < 0, and thereafter takes action d.
His solution is simply the utility indifference solution from the Soares et al paper: consider an alternative reward function R' such that R'(s, x) = R(1, x); that is, a world in which the shutdown button is never pressed. Then, we add the difference in (optimal) value functions under R' relative to R:
f(s, x) = V*_R'(s) - V*_R(s) = V*_R(s)
By construction, the agent has no incentive to take the advance or delay actions (b or c), because its expected value in both cases is the same. (desiderata 2 and 3 of the Soares et al paper)
Unfortunately, as this is in some sense the same solution as in the 2015 Soares et al paper, should run into the same problem of "there's no incentive to preserve the shutdown mechanism". (That is, it fails desiderata 4 of the 2015 paper.)
From a quick skim of section 9.8 (which argues for the corrigibility of his indifference solution) and sections 8.2 and 8.4 (which detail the part of his indifference solution), it seems like he proposes the following modification: instead of having R(s, d) = 0 for all s, we instead have R(s, d) be larger when s <= 0. This does incentivize preserving the ability to be shut down. He calls this agent, with modified R, the g_c agent.
The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!
The specific problem is, this incentive modifies the reward of R(s,d), but in his construction/implementation he does not go back to make this change to f(s, x). (He keeps it at V*_R(s), which no longer equals the value difference between shutting down and not shutting down.) This breaks the utility indifference, causing the Q-value of the advance action c to be larger for sufficiently large values of L. That is, his g_c agent now has an incentive to once again press the shutdown button (!!).
TL;DR: no, based on a quick skim, the paper doesn't solve corrigibility.
Thanks for reading my paper! For the record I agree with some but not all points in your summary.
My later paper 'AGI Agent Safety by Iteratively Improving the Utility Function' also uses the simulation environment with the and actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the and are supposed to be symbols representing arrows and for 'push # to later in time' and 'pull # earlier in time'.
The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!
No, the design of the agent is not motivated by the need to create an incentive to preserve the shutdown button itself, as required by desideratum 4 from Soares et al. Instead it is motivated by the desire to create an incentive to preserve agent's actuators that it will need to perform any physical actions incentivised by the shutdown reward function -- I introduce this as a new desideratum 6.
A discussion about shaping incentives or non-incentives to preserve the button (as a sensor) is in section 7.3, where I basically propose to enhance the indifference effects produced by the reward function by setting up the physical environment around the button in a certain way:
the physical implementation of the agent and the button can be constructed in such a way that substantial physical resources would be needed by the agent to perform any action that will press or disable the button.
For the record, adding to the agent design creates no incentive to press the shutdown button: if it did, this would be visible as actions in the simulation of the third line of figure 10, and also the proof in section 9 would not have been possible.
There has been some spirited debate on Twitter about it which might be relevant: https://twitter.com/domenic/status/1727206163119534085
Fun to see this is now being called 'Holtman's neglected result'. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:
I can't fully fault the world for neglecting 'Corrigibility with Utility Preservation' because it is full of a lot of dense math.
I wrote two followup papers to 'Corrigibility with Utility Preservation' which present the same results with more accessible math. For these I am a bit more upset that they have been somewhat neglected in the past, but if people are now stopping to neglect them, great!
Does anyone have a technical summary?
The best technical summary of 'Corrigibility with Utility Preservation' may be my sequence on counterfactual planning which shows that the corrigible agents from 'Corrigibility with Utility Preservation' can also be understood as agents that do utility maximisation in a pretend/counterfactual world model.
For more references to the body of mathematical work on corrigibility, as written by me and others, see this comment.
In the end, the question if corrigibility is solved also depends on two counter-questions: what kind of corrigibility are you talking about and what kind of 'solved' are you talking about? If you feel that certain kinds of corrigibility remain unsolved for certain values of unsolved, I might actually agree with you. See the discussion about universes containing an 'Unstoppable Weasel' in the Corrigibility with Utility Preservation paper.
Summary
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”.
This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one.
Our taxonomy:
Please point out if we mistakenly round one thing off to another, miscategorise someone, or otherwise state or imply falsehoods. We will edit.
Unlike the late Larks reviews, we’re not primarily aiming to direct donations. But if you enjoy reading this, consider donating to Manifund, MATS, or LTFF, or to Lightspeed for big ticket amounts: some good work is bottlenecked by money, and you have free access to the service of specialists in giving money for good work.
Meta
When I (Gavin) got into alignment (actually it was still ‘AGI Safety’) people warned me it was pre-paradigmatic. They were right: in the intervening 5 years, the live agendas have changed completely.[1] So here’s an update.
Chekhov’s evaluation: I include Yudkowsky’s operational criteria (Trustworthy command?, closure?, opsec?, commitment to the common good?, alignment mindset?) but don’t score them myself. The point is not to throw shade but to remind you that we often know little about each other.
See you in 5 years.
Editorial
Agendas
1. Understand existing models
characterisation
Evals
(Figuring out how a trained model behaves.)
Various capability evaluations
Various red-teams
Eliciting model anomalies
Alignment of Complex Systems: LLM interactions
The other evals (groundwork for regulation)
Much of Evals and Governance orgs’ work is something different: developing politically legible metrics, processes / shocking case studies. The aim is to motivate and underpin actually sensible regulation.
But this is a technical alignment post. I include this section to emphasise that these other evals (which seek confirmation) are different from understanding whether dangerous capabilities have or might emerge.
Interpretability
(Figuring out what a trained model is actually computing.)[2]
Ambitious mech interp
Concept-based interp
Causal abstractions
EleutherAI interp
Activation engineering (as unsupervised interp)
Leap
Understand learning
(Figuring out how the model figured it out.)
Timaeus: Developmental interpretability & singular learning theory
Various other efforts:
2. Control the thing
(Figuring out how to predictably affect model behaviour.)
Prosaic alignment / alignment by default
Redwood: control evaluations
Safety scaffolds
Prevent deception
Through methods besides mechanistic interpretability.
Redwood: mechanistic anomaly detection
Indirect deception monitoring
Anthropic: externalised reasoning oversight
Surgical model edits
(interventions on model internals)
Weight editing
Activation engineering
Getting it to learn what we want
(Figuring out how to control what the model figures out.)
Social-instinct AGI
Imitation learning
Reward learning
Goal robustness
(Figuring out how to make the model keep doing ~what it has been doing so far.)
Measuring OOD
Concept extrapolation
Mild optimisation
3. Make AI solve it
(Figuring out how models might help figure it out.)
Scalable oversight
(Figuring out how to help humans supervise models. Hard to cleanly distinguish from ambitious mechanistic interpretability.)
OpenAI: Superalignment
Supervising AIs improving AIs
Cyborgism
See also Simboxing (Jacob Cannell).
Task decomp
Recursive reward modelling is supposedly not dead but instead one of the tools Superalignment will build.
Another line tries to make something honest out of chain of thought / tree of thought.
Elicit (previously Ought)
Adversarial
Deepmind Scalable Alignment
Anthropic / NYU Alignment Research Group / Perez collab
See also FAR (below).
4. Theory
(Figuring out what we need to figure out, and then doing that. This used to be all we could do.)
Galaxy-brained end-to-end solutions
The Learning-Theoretic Agenda
Open Agency Architecture
Provably safe systems
Conjecture: Cognitive Emulation (CoEms)
Question-answer counterfactual intervals (QACI)
Understanding agency
(Figuring out ‘what even is an agent’ and how it might be linked to causality.)
Causal foundations
Alignment of Complex Systems: Hierarchical agency
The ronin sharp left turn crew
Shard theory
boundaries / membranes
A disempowerment formalism
Performative prediction
Understanding optimisation
Corrigibility
(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight and superalignment are ~atheoretical approaches to this.)
Behavior alignment theory
The comments in this thread are extremely good – but none of the authors are working on this!! See also Holtman’s neglected result. See also EJT (and formerly Petersen). See also Dupuis.
Ontology identification
(Figuring out how superintelligent agents think about the world and how we get superintelligent agents to actually tell us what they know. Much of interpretability is incidentally aiming at this.)
ARC Theory
Natural abstractions
Understand cooperation
(Figuring out how inter-AI and AI/human game theory should or would work.)
CLR
FOCAL
See also higher-order game theory. We moved CAIF to the “Research support” appendix. We moved AOI to “misc”.
5. Labs with miscellaneous efforts
(Making lots of bets rather than following one agenda, which is awkward for a topic taxonomy.)
Deepmind Alignment Team
Apollo
Anthropic Assurance / Trust & Safety / RSP Evaluations / Interpretability
FAR
Krueger Lab
AI Objectives Institute (AOI)
More meta
We don’t distinguish between massive labs, individual researchers, and sparsely connected networks of people working on similar stuff. The funding amounts and full time employee estimates might be a reasonable proxy.
The categories we chose have substantial overlap and see the “see also”s for closely related work.
I wanted this to be a straight technical alignment doc, but people pointed out that would exclude most work (e.g. evals and nonambitious interpretability, which are safety but not alignment) so I made it a technical AGI safety doc. Plus ça change.
The only selection criterion is “I’ve heard of it and >= 1 person was recently working on it”. I don’t go to parties so it’s probably a couple months behind.
Obviously this is the Year of Governance and Advocacy, but I exclude all this good work: by its nature it gets attention. I also haven’t sought out the notable amount by ordinary labs and academics who don’t frame their work as alignment. Nor the secret work.
You are unlikely to like my partition into subfields; here are others.
No one has read all of this material, including us. Entries are based on public docs or private correspondence where possible but the post probably still contains >10 inaccurate claims. Shouting at us is encouraged. If I’ve missed you (or missed the point), please draw attention to yourself.
If you enjoyed reading this, consider donating to Lightspeed, MATS, Manifund, or LTFF: some good work is bottlenecked by money, and some people specialise in giving away money to enable it.
Conflicts of interest: I wrote the whole thing without funding. I often work with ACS and PIBBSS and have worked with Team Shard. Lightspeed gave a nice open-ended grant to my org, Arb. CHAI once bought me a burrito.
If you’re interested in doing or funding this sort of thing, get in touch at hi@arbresearch.com. I never thought I’d end up as a journalist, but stranger things will happen.
Thanks to Alex Turner, Neel Nanda, Jan Kulveit, Adam Gleave, Alexander Gietelink Oldenziel, Marius Hobbhahn, Lauro Langosco, Steve Byrnes, Henry Sleight, Raymond Douglas, Robert Kirk, Yudhister Kumar, Quratulain Zainab, Tomáš Gavenčiak, Joel Becker, Lucy Farnik, Oliver Hayman, Sammy Martin, Jess Rumbelow, Jean-Stanislas Denain, Ulisse Mini, David Mathers, Chris Lakin, Vojta Kovařík, Zach Stein-Perlman, and Linda Linsefors for helpful comments.
Appendices
Appendix: Prior enumerations
Appendix: Graveyard
Appendix: Biology for AI alignment
Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.
Human enhancement
Merging
As alignment aid
Appendix: Research support orgs
One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.
CAIF
AISC
See also:
Appendix: Meta, mysteries, more
Unless you zoom out so far that you reach vague stuff like “ontology identification”. We will see if this total turnover is true again in 2028; I suspect a couple will still be around, this time.
> one can posit neural network interpretability as the GiveDirectly of AI alignment: reasonably tractable, likely helpful in a large class of scenarios, with basically unlimited scaling and only slowly diminishing returns. And just as any new EA cause area must pass the first test of being more promising than GiveDirectly, so every alignment approach could be viewed as a competitor to interpretability work. – Niplav