I'm a research scientist at Anthropic doing empirical safety research on language models. In the past, I've worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable.
Website: https://ethanperez.net/
Yeah, I think this is one of the ways that velocity is really helpful. I'd probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there's much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you've probably run through the low-hanging fruit on that problem/approach, and there's other low-hanging in other areas that you probably want to be picking instead.
(High velocity is great for both picking low-hanging fruit and for getting through swamps when you really need to solve a particular problem, so it's useful to have either way)
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I'd be interested if so. I've been watching the online discussion fairly closely, and I think I've only seen one case where someone might've had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I've seen).
Almost all of the misunderstanding of the paper I'm seeing is actually in the opposite direction "why are you even concerned if you explicitly trained the bad behavior into the model in the first place?" suggesting that it's pretty salient to people that we explicitly trained for this (e.g., from the paper title).
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I'd be interested to know both)
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven't checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn't expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
(Cool you're looking into this!)
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it's only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there's emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it's actually a problem, and "whether or not the model can explain deceptive alignment" seems like a half-reasonable bright line we could use to estimate when it's time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn't occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)
Fixed (those were just links to the rest of the doc)
CoT provides pretty little safety guarantee at the relevant scales
Even if faithfulness goes down at some model scale for a given task, that doesn't mean that we'll be using models at that scale (e.g., for cost reasons or since we might not have models at a large scale yet). The results on the addition task show that there are some task difficulties for which even the largest models we tested don't start to show lower faithfulness, and people will be pushing the difficulties of the tasks they use models on as they get better. So it seems likely to me that no matter the model scale, people will be using models on some tasks where they'll have faithful reasoning (e.g., tasks near the edge of that model's abilities).
It seems that almost everyone will likely just continue using the model with the best performance
If you're using the model in a high-stakes setting and you're an aligned actor, it's nice to be able to make tradeoffs between performance and safety. For example, you might care more about safety properties than raw capabilities if you're an alignment researcher at an AGI lab who's trying to make progress on the alignment problem with AIs.
Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:
There's also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It's pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we've chosen these directions in large part because we're very concerned about inner alignment, and we're actively looking for new directions that seem useful for mitigating inner misalignment risks.
All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?
I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can't do. For example, it's more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly) -- which are all explained if the model is answering questions as if its overconfident about its abilities / simulating what a smart AI would say. This is also the sense I get from talking with some of the RLHF models, e.g., they will say that they are superhuman at Go/chess and great at image classification (all things that AIs but not LMs can be good at).
Yeah, some caveats I should've added in the interview:
I'm guessing part of the disagreement here is coming from disagreement on how much alignment progress is idea/agenda bottlenecked vs. execution bottlenecked. I really like Tim Dettmer's blog post on credit assignment in research, which has a good framework for thinking about when you'll have more counterfactual impact working on ideas vs. working on execution.