Buck Shlegeris

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Wiki Contributions

Comments

Sorted by

This project has now been released; I think it went extremely well.

This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.

I don't think this is right. To train an untrusted monitor, you need to synthetically generate dangerous actions (because by assumption these actions are rare). So during training, you'll always know the correct label for an action before you show it to the model.

You might want to use something debate-like in the synthetic input generation process, but that's structurally different.

I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.

I don’t understand your last sentence, can you rephrase?

> You could also use the resulting debaters at inference time to audit especially suspicious actions (run the debate again, ideally it points out why the given action is good / bad)

In cases where you're worried about the model taking small numbers of catastrophic actions (i.e. concentrated failures a.k.a. high-stake failures), this is basically equivalent to what I usually call untrusted monitoring, which means you have to worry about collusion.

IMO it's good to separate out reasons to want good reward signals like so:

  • Maybe bad reward signals cause your model to "generalize in misaligned ways", e.g. scheming or some kinds of non-myopic reward hacking
    • I agree that bad reward signals increase the chance your AI is importantly misaligned, though I don't think that effect would be overwhelmingly strong.
  • Maybe bad reward signals cause your model to exploit those reward signals even on-distribution. This causes problems in a few ways:
    • Maybe you think that optimizing against a flawed reward signal will produce catastrophically dangerous results. X-risk concerned people have talked about this for a long time, but I'm not sold it's that important a factor. In particular, I expect that a model would produce catastrophically dangerous actions because it's exploiting a flawed reward signal, I expect that you would have noticed bad (but non-catastrophic) outcomes from earlier AIs exploiting flawed reward signals. So it's hard to see how this failure mode would strike you by surprise.
    • Optimizing against a flawed reward signal will mean your AI is less useful than it would otherwise be,  because which is bad because you presumably were training the model because you wanted it to do something useful for you.

I think this last one is the most important theory of change for research on scalable oversight.

I am curious whether @Rohin Shah disagrees with me, or whether he agrees but just phrased it (from my perspective) weirdly.

Isn't it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.

This post was an early articulation of many of the arguments and concepts that we mostly workshopped into the AI control research direction.

In particular, I think the final paragraph of the conclusion holds up really well:

But I'm more excited about the meta level point here: I think that when AI developers are first developing dangerously powerful models, alignment researchers will be doing a very different kind of activity than what they do now. Right now, alignment researchers have to do a lot of long range extrapolation: they don't have access to either the models or the settings in which they actually care about ensuring good behavior. I think that alignment researchers haven't quite appreciated the extent to which their current situation is disanalogous to the situation at crunch time. It seems likely to me that at crunch time, the core workflow of alignment practitioners will involve iterating on alignment techniques based on direct measurement of their effect on the evaluations that are used in safety arguments; more generally, I expect that alignment practitioners will be doing much less extrapolation. I’m therefore eager for alignment researchers to phrase their techniques in these terms and practice the kind of eval-based technique iteration that I think will be crucial later.

I think this is indeed an important insight that people were underrating.

This post was also an important step in the evolution of my philosophy for reasoning about AI safety research. I used to have a kind of "gotcha" attitude towards safety research, where if there's any conceivable situation in which a technique wouldn't work, I felt some disdain for the technique. We wrote this post around the time that Ryan was first really getting out of this mindset and saying "no, I want to understand the limitations of techniques in detail; we'll write down all the strategies we can try as the blue team, because there are a lot of them and together they're probably pretty strong; we will write down the limitations for each, but we won't treat these techniques as contemptible just because they're not indefinitely scalable".

My main regret is the length of the title, which Ryan mercilessly mocks me for every time I try to cite the post to someone.

I also am sad that no-one has really developed these arguments further. There still isn't much empirical research on this topic, which I'd now call the AI control perspective on scalable oversight, despite the fact that it seems very important. (The password-locked models paper is one attempt at doing some basic science on AI control for oversight.)

I also think it would be good if someone really carefully wrote out the connections between these arguments and various other vaguely similar ideas, e.g. prover-verifier games.

Load More