I liked Evan’s post on 11 proposals for safe AGI. However, I was a little confused about why he chose these specific proposals; it feels like we could generate many more by stitching together the different components he identifies, such as different types of amplification and different types of robustness tools. So I’m going to take a shot at describing a set of dimensions of variation which capture the key differences between these proposals, and thereby describe an underlying space of possible approaches to safety.

Firstly I’ll quickly outline the proposals. Rohin’s overview of them is a good place to start - he categorises them as:

  • 7 proposals of the form “recursive outer alignment technique” plus “robustness technique”.
    • The recursive outer alignment technique is either debate, recursive reward modelling, or amplification.
    • The robustness technique is either transparency tools, relaxed adversarial training, or intermittent oversight by a competent supervisor.
  • 2 proposals of the form “non-recursive outer alignment technique” plus “robustness technique”.
  • 2 other proposals: Microscope AI; STEM AI.

More specifically, we can describe the four core recursive outer alignment techniques as variants of iterated amplification, as follows: let Amp(M) be the procedure of a human answering questions with access to model M. Then we iteratively train M* (the next version of M) by:

  • Imitative amplification: train M* to imitate Amp(M).
  • Approval-based amplification: train M* on an approval signal specified by Amp(M).
  • Recursive reward modelling: train M* on a reward function specified by Amp(M).
  • Debate: train M* to win debates against Amp(M).

Here are six axes of variation which I claim underlie Evan’s proposals. Each proposal is more or less:

  1. Supervised
  2. Structured
  3. Adversarial
  4. Language-based
  5. Interpretability-dependent
  6. Environment-dependent

In more detail:

  1. Supervised: this axis measures how much the proposal relies on high-quality supervision by a (possibly amplified) human. Imitative amplification places the heaviest burden on the supervisor, since they need to identify a good action at each timestep. Approval-based amplification requires the supervisor to recognise good actions at each timestep, which should be easier; and standard RL only requires the supervisor to recognise good outcomes. Multi-agent, microscope and STEM AI don’t even require that.
  2. Structured: this axis measures how much the work of alignment relies on instantiating a specific structure. Recursive reward modelling, as explained here, is highly structured because it constructs a tree of agents implementing specific subtasks. Debate is somewhat less so, because the debate tree is defined implicitly, and only one path through it is actually taken. In other versions of amplification, it depends on how the calls to Amp(M) work - they might involve the human just asking M a couple of clarifying questions (in which case the structure is very minimal), or else spinning up many copies of M in a hierarchical and structured way. By contrast, multi-agent approaches are by default highly unstructured, since many of the agents’ incentives will be emergent ones that arise from flexible interactions. However, I should be clear that these classifications are intended merely as rough guidelines - comparisons of such a vague concept as “structuredness” will always be very subjective.
  3. Adversarial: this axis measures how much the proposal depends on AIs competing with each other during the training process. Debate and multi-agent training are strongly adversarial; other proposals are more or less adversarial the more or less they depend on adversarial training. STEM AI and Microscope AI are the least adversarial.
  4. Language-based: this axis measures how much the proposal relies on using natural language as a means of interaction. Debate (along with amplification, in practice) is heavily language-based; STEM AI is not very language-based; everything else is in the middle (depending on what types of tasks they’re primarily trained on).
  5. Interpretability-dependent: this axis measures how much the proposal relies on our ability to interpret the internal workings of neural networks. Some require this not at all; others (like microscope AI) require a detailed understanding of cognition; (relaxed) adversarial training requires the ability to generate examples of misbehaviour, which I expect to be even harder. Another source of variance along this axis is how scalable our interpretability tools need to be - adversarial training requires interpretability tools to run frequently during training, whereas in theory we could just analyse a microscope AI once.
  6. Environment-dependent: this axis measures how much the proposal depends on which environments or datasets we use to train AGIs (excluding the supervision component). Multi-agent safety and STEM AI are heavily environment-dependent; everything else less so.

I intend this breakdown to be useful not just in classifying existing approaches to safety, but also in generating new ones. For example, I’d characterise this paper as arguing that AI training regimes which are less structured, less supervised and more environmentally-dependent will become increasingly relevant (a position with which I strongly agree), and trying to come up with safety research directions accordingly. Another example: we can take each variant of iterated amplification and ask how we could improve them if we had better interpretability techniques (such as the ability to generate adversarial examples which display specific misbehaviours). More speculatively, since adversarial interactions are often useful in advancing agent capabilities, I’d be interested in versions of STEM AI which add an adversarial component - perhaps by mimicking in some ways the scientific process as carried out by humans.

There’s one other important question about navigating this space of possibilities - on what metric should we evaluate the proposals within it? We could simply do so based on their overall probability of working. But I think there are enough unanswered questions about what AGI development will look like, and what safety problems will arise, that these evaluations can be misleading. Instead I prefer to decompose evaluations into two components: how much does a proposal improve our situation given certain assumptions about what safety problems we’ll face along which branches of AGI development; and how likely are those assumptions to be true? This framing might encourage people to specialise in approaches to safety which are most useful conditional on one possible path to AGI, even if that’s at the expense of generality - a tradeoff which will become more worthwhile as the field of AI safety grows.

Thanks to the DeepMind safety reading group and Evan Hubinger for useful ideas and feedback.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 7:55 AM

Debate: train M* to win debates against Amp(M).

I think Debate is closer to "train M* to win debates against itself as judged by Amp(M)".

Wouldn't it just be "train M* to win debates against itself as judged by H"? Since in the original formulation of debate a human inspects the debate transcript without assistance.

Anyway, I agree that something like this is also a reasonable way to view debate. In this case, I was trying to emphasise the similarities between Debate and the other techniques: I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.

Maybe an easier way to visualise this is that, given some question, M* answers that question, and then Amp(M) tries to identify any flaws in the argument by interrogating M*, and rewards M* if no flaws can be found.

I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.

This seems like a reasonable way to think of debate.

I think, in practice (if this even means anything), the power of debate is quite bounded by the power of the human, so some other technique is needed to make the human capable of supervising complex debates, e.g. imitative amplification.

This strikes me as a really interesting and innovative post, proposing a framework for systematically categorizing existing alignment proposals as well as helping to generate new ones.

I'm kind of surprised that this post is almost 2 years old and yet only has one pingback and a few comments.

Is there some other framework which has superseded this one, or did people just forget about it / there isn't much comparative alignment work going on?

One other framework I've seen kind of like this is "Training stories" from Evan Hubinger's How do we become confident in the safety of a machine learning system?. But that is more about evaluating alignment proposals (i.e. the very last part of the present post) rather than categorizing alignment proposals along a consistent set of dimensions, which is the main focus here. So it actually serves a different purpose and isn't much like this framework after all.