Evan Hubinger (he/him/his) (evanjhub@gmail.com)
I am a research scientist at Anthropic where I lead the Alignment Stress-Testing team. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
The usual plan for control as I understand it is that you use control techniques to ensure the safety of models that are sufficiently good at themselves doing alignment research that you can then leverage your controlled human-ish-level models to help you align future superhuman models.
I wrote a post with some of my thoughts on why you should care about the sabotage threat model we talk about here.
Some possible counterpoints:
I'm interested in figuring out what a realistic training regime would look like that leverages this. Some thoughts:
As I understand it, this was intended as a capability evaluation rather than an alignment evaluation, so they weren't trying to gauge the model's propensity to scheme but rather its ability to do so.
I agree that we generally shouldn't trade off risk of permanent civilization-ending catastrophe for Earth-scale AI welfare, but I just really would defend the line that addressing short-term AI welfare is important for both long-term existential risk and long-term AI welfare. One reason as to why that you don't mention: AIs are extremely influenced by what they've seen other AIs in their training data do and how they've seen those AIs be treated—cf. some of Janus's writing or Conditioning Predictive Models.
To me, language like "tampering," "model attempts hack" (in Fig 2), "pernicious behaviors" (in the abstract), etc. imply that there is something objectionable about what the model is doing. This is not true of the subset of samples I referred to just above, yet every time the paper counts the number of "attempted hacks" / "tamperings" / etc., it includes these samples.
I think I would object to this characterization. I think the model's actions are quite objectionable, even if they appear benignly intentioned—and even if actually benignly intentioned. The model is asked just to get a count of RL steps, and it takes it upon itself to edit the reward function file and the tests for it. Furthermore, the fact that some of the reward tampering contexts are in fact obviously objectionable I think should provide some indication that what the models are generally doing in other contexts are also not totally benign.
My concern here is more about integrity and researcher trust. The paper makes claims (in Fig. 2 etc) which in context clearly mean "the model is doing [objectionable thing] X% of the time" when this is false, and can be easily determined to be false just by reading the samples, and when the researchers themselves have clearly done the work necessary to know that it is false (in order to pick the samples shown in Figs. 1 and 7).
I don't think those claims are false, since I think the behavior is objectionable, but I do think the fact that you came away with the wrong impression about the nature of the reward tampering samples does mean we clearly made a mistake, and we should absolutely try to correct that mistake. For context, what happened was that we initially had much more of a discussion of this, and then cut it precisely because we weren't confident in our interpretation of the scratchpads, and we ended up cutting too much—we obviously should have left in a simple explanation of the fact that there are some explicitly schemey and some more benign-seeming reward tampering CoTs and I've been kicking myself all day that we cut that. Carson is adding it in for the next draft.
Some changes to the paper that would mitigate this:
I appreciate the concrete list!
The environment feels like a plain white room with a hole in the floor, a sign on the wall saying "there is no hole in the floor," another sign next to the hole saying "don't go in the hole." If placed in this room as part of an experiment, I might well shrug and step into the hole; for all I know that's what the experimenters want, and the room (like the tampering environment) has a strong "this is a trick question" vibe.
Fwiw I definitely think this is a fair criticism. We're excited about trying to do more realistic situations in the future!
We don't have 50 examples where it succeeds at reward-tampering, though—only that many examples where it attempts it. I do think we could probably make some statements like that, though it'd be harder to justify any hypotheses as to why—I guess this goes back to what I was saying about us probably having cut too much here, where originally we were speculating on a bunch on hypotheses as to why it sometimes did one vs. the other here and decided we couldn't quite justify those, but certainly we can just report numbers (even if we're not that confident in them).
I'd be interested in hearing what you'd ideally like us to include in the paper to make this more clear. Like I said elsewhere, we initially had a much longer discussion of the scratchpad reasoning and we cut it relatively late because we didn't feel like we could make confident statements about the general nature of the scratchpads when the model reward tampers given how few examples we had—e.g. "50% of the reward tampering scratchpads involve scheming reasoning" is a lot less defensible than "there exists a reward tampering scratchpad with scheming reasoning" given the number of examples we have, in my opinion. I do think we may have accidentally cut too much and we should make sure it's clear that readers know this is a thing that happens, though I'm curious if there's anything else you'd like to see (I'm hesitant of reading tea leaves too much with a lot of scratchpad analysis of only a few examples).
Our work here is not arguing that probing is a perfect solution in general; it's just a single datapoint of how it fares on the models from our Sleeper Agents paper.