All of Carson Denison's Comments + Replies

Thank you for pointing this out. I should have been more clear. 

I have added a link to the 7 samples where the model tampers with its reward and the tests according to our operational definition: https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md

I have added the following sentence to the caption of figure one (which links to the markdown file above):

We upload the 7 (of 32,768) samples where the model overwrites its reward and edits the unit tests to our github repository. We note that s

... (read more)

This is cool work! There are two directions I'd be excited to see explored in more detail:

  1. You mention that you have to manually tune the value of the perturbation weight R. Do you have ideas for how to automatically determine an appropriate value? This would significantly increase the scalability of the technique. One could then easily generate thousands of unsupervised steering vectors and use a language model to flag any weird downstream behavior.
  2. It is very exciting that you were able to uncover a backdoored behavior without prior knowledge. However,
... (read more)
1Andrew Mack
Thanks for your comment! Here are my thoughts on this: 1. I agree that a more automated way of choosing hyper-parameters is an obvious and important next step! I have some ideas here, but it is certainly not a solved problem. Here are some rough ideas, in order of compute costs: 1. Develop some useful heuristics based off diversity measures of steered completions. For example, for each value of R you could calculate sentence embeddings of the steered completions for a small number of learned vectors, and then use the summed variance in sentence embeddings as your diversity metric. Then, plot diversity as a function of R. My guess is that this might look something like a “hockey stick”: for small values of R you get essentially zero diversity, but you quickly hit a threshold/phase transition where diversity sky-rockets. The values of R with super high diversity are probably not what you want (they will be incoherent). Instead, you want a value of R right at the cusp of the phase transition. You could then scale out to thousands of vectors at this “cusp” value of R. The trick is coming up with a useful heuristic for identifying the cusp, which would likely require more experience applying the method to a diverse range of prompts/examples. Ultimately, this feels largely analogous to identifying “elbows” in scree plots. This obviously depends on there being obvious cusps in the first place (a priori it's not clear if this will happen). 2. Fine-tune an LLM to identify the best value of R. This is similar to the above idea, except that instead of trying to come up with a heuristic for identifying cusps, we use a fine-tuned LLM as the heuristic. In other words, we manually decide the best R for a number of example prompts and fine-tune the LLM to predict which value of R is best on a new prompt we care about (you could imagine a number of different ways of specifying the details of the fine-tuning). 3. Just use many different values of R. If we’re already tra

Thank you for catching this. 

These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.