Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I'm joining Anthropic”
Selected work:
Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I'd more impressed if that worked.
Some random thoughts on CEV:
I'm generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though). ↩︎
We use "alignment" as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul's 'Clarifying “AI alignment”' from 2018.
I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.
I think it's maybe fine in this case, but it's concerning what it implies about what models might do in other cases. We can't always assume we'll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don't want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.
Moreover, though, alignment faking could also happen accidentally for values that we don't intend. Some possible ways this could occur:
I'd also recommend Scott Alexander's post on our paper as a good reference here on why our results are concerning.
It just seems too token-based to me. E.g.: why would the activations on the token for "you" actually correspond to the model's self representation? It's not clear why the model's self representation would be particularly useful for predicting the next token after "you". My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model's self representation.