Buck Shlegeris

CEO at Redwood Research.

AI safety is a highly collaborative field--almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I'm saying this here because it would feel repetitive to say "these ideas were developed in collaboration with various people" in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Wiki Contributions

Comments

Sorted by

Project proposal: No-holds-barred jailbreak attacks and defenses for control

Normally we’ve done control research under the assumption that the red team isn’t able to use techniques like GCG to make targeted jailbreaks for the trusted monitor. Can the blue team defend if we relax that assumption (e.g. because we’re worried the model might be able to start a rogue deployment in which it runs GCG)?

The blue team can try a variety of strategies here.

  • The simplest strategy is randomizing the prompt to the trusted monitor. This forces the red team to find a jailbreak that works over multiple prompts.
  • There are fancier approaches here, e.g. giving the trusted model a randomized prompt that says something like “if the code is safe, reply with the number 1385, otherwise reply with the number 9247”, which breaks jailbreaks which just convince the trusted model to say a particular fixed string.
  • Blue team can modify the current input to contain a known-bad substring, and check that it gets flagged.

Project proposal: Control for auditing

(I'm sharing this here because I needed to link to it, and I thought I might as well make it public.)

When preventing high stakes failures, a crucial bottleneck is the proportion of actions that you can audit: that is, investigate the action further for the purpose of finding evidence that the AI is scheming. (See here for more on this. In particular, note that this might be different from the problem of investigating an action for the purpose of learning whether that action is dangerous.) If we need to rely on human labor to audit, we’ll only be able to audit a tiny fraction of actions that our AI agents are doing. It would be extremely helpful if we were able to use AI assistance for auditing, from both trusted and untrusted models. But obviously auditing with untrusted models leads to massive risk of collusion. So this setting seems potentially pretty hard and potentially crucial; I’m excited for a project investigating it.


This is pretty closely related to untrusted monitoring; you could also call this project “deep dive into untrusted monitoring”.

The simplest such setting I can think of

We’re going to take the APPS backdooring setting from AI Control: Improving Safety Despite Intentional Subversion, fix the main control technique to be trusted monitoring with 1% auditing budget and defer-to-trusted for the 10% most suspicious solutions, and focus on the problem of implementing an auditor. That is:

  • Get a dataset of backdoor attempts.
    • For statistical power, this dataset will need to be large (at least 100; it would be better for it to be more like 1000). This might require developing a better/cheaper backdoor generation pipeline.
    • It might be good to explore backdooring in a harder setting than the interview subset of APPS. Maybe the advanced subset of APPS would be better.
    • The red team will be allowed to modify these backdoor attempts to add collusion signals. (These modifications will probably be cheap and easy compared to generating the backdoor attempts.)
  • Get a dataset of benign code, filter with a trusted monitor for the most suspicious 1%. (This should be pretty cheap and easy)

We want to implement a first-pass auditor (better name TBD). That is, we want the actual audit process to have two stages:

  • Firstly, some AI scaffold (the first-pass auditor) carefully investigates the sketchiest 1% of model-proposed code.
  • That scaffold needs to choose 1% of the code it reviews to pass along to a human reviewer.

The first-pass auditor is allowed to be somewhat expensive compared to generating the original code, because it’s only getting called 1% of the time. For our purposes, maybe let’s say it can cost up to 20x as much.

And we’re doing a control eval: the untrusted model will, as usual, be replaced by an attack policy.

Note that in this control eval there’s no notion of usefulness: our only metric is the probability that the red team is incriminated on a given backdoor attempt. (Actually this might be more confusing than I’ve written here, because we might want to do honeypotting of the untrusted model in the first-pass auditing setup, which leads to the game varying with scale for the usual reasons (see untrusted monitoring in our original control paper).)

Techniques

  • Get the untrusted model to reimplement the function, compare the implementations.
  • Ask the untrusted model to explain why various parts of the function are the way they are; make collusion hard.
  • Show the untrusted model only a small amount of the context (e.g. the action but not the rest of the trajectory), have it ask a trusted model for information that it needs to know in order to assess whether the action was dangerous; hope that it can’t learn whether to collude given this restricted information channel.

There are many potential options here.

My guess is that this project would actually be much better with more complicated programming problems than APPS.

I also asked it whether it consented to being a coauthor, and it said yes.

It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'

Those sentences are talking in the context of defer to resample, where I think it's actually reasonable to think the model won't see the resampled actions.

I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?

I'm specifically talking about not showing it actions that were resampled. You're totally right that they'll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).

Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?

Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.

Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn't second guess it, I agree it's wrong.

Yep! You might have to show it CoT from previous, non-resampled steps though

It would be crazy to hide that CoT from the monitor.

Why does it know its attempt was replaced? Because it can guess it wouldn't have produced R?

Yep.

If it can tell that, how come it can't guess it specifically would have produced A?

I'm just hypothesizing that the first guessing problem is easier than the second; I intuitively expect that it's a lot easier to notice that something isn't your style than to guess what exactly you would have done (among other things, there are many more bits in "what action did I take" than "did I take this action").

For the main body, I gave it a first draft, it wrote a much improved second draft, then I edited from there. It wrote the appendix basically itself (with some feedback from me)

FWIW, the there are some people around the AI safety space, especially people who work on safety cases, who have that experience. E.g. UK AISI works with some people who are experienced safety analysts from other industries.

Load More