Agreed EG model that is corrigible, fairly aligned but knows there's some imperfections in its alignment that the humans wouldn't want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it's doing gradient hacking while also in some meaningful sense being aligned
I disagree re the way we currently use understand - eg I think that SAE reconstructions have the potential to smuggle in lots of things via EG the exact values of the continuous activations, latents that don't quite mean what we think, etc.
It's plausible that a future and stricter definition of understand fixes this though, in which case I might agree? But I would still be concerned that 99.9% understanding involves a really long tale of heuristics and I don't know what may emerge from combining many things that individually make sense. And I probably put >0.1% that a super intelligence could adversarially smuggle things we don't like into a system we don't think we understand.
Anyway, all that pedantry aside, my actual concern is tractability. If addressed, this seems plausibly helpful!
Agreed in principle, my goal in the section on black box stuff is to lay out ideas that I think could work in spite of this optimisation pressure
I think the correct question is how much of an update should you make in an absolute sense rather than a relative sense? Many people in this community are overconfident and if you decide that every person is less worth listening to than you thought this doesn't change who you listen to, but it should make you a lot more uncertain in your beliefs
Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard
Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I'm not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...
I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It's also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn't surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I'd guess an LLM classifier is more effective. Did they do that in the paper?
(To be clear, I think it's a great paper, and to the degree that there's a disagreement here it's that I think your predictions weren't covering the right comparative advantages of interp)
This is fantastic work, I'd be very excited to see more work in the vein of auditing games. It seems like the one of the best ways so far to test how useful different techniques for understanding models are
In my incredibly biased opinion, the GDM AGI safety team is great and an effective place to work on reducing AI x-risk, and I would love to get applications from people here
Adding some clarifications re my personal perspective/takes on how I think about this from an AGI Safety perspective: I see these ideas as Been's brainchild, I largely just helped out with the wording and framing. I do not currently plan to work on agentic interpretability myself, but still think the ideas are interesting and plausibly useful, and I’m glad the perspective is written up! I still see one of my main goals as working on robustly interpreting potentially deceptive AIs and my guess is this is not the comparative strength of agentic interpretability.
Why care about it? From a scientific perspective, I'm a big fan of baselines and doing the simple things first. "Prompt the model and see what happens" or "ask the model what it was doing" are the obvious things you should do first when trying to understand a behaviour. In internal experiments, we often find that we can just solve a problem with careful and purposeful prompting, no need for anything fancy like SAEs or transcoders. But it seems kinda sloppy to “just do the obvious thing”, I’m sure there’s a bunch of nuance re doing this well, and in training models for this to be easy to do. I would be excited for there to be a rigorous science of when and how well these kinds of simple black box approaches actually work. This is only part of what agentic interpretability is about (there’s also white box ideas, more complex multi-turn stuff, an emphasis on building mental models of each other, etc) but it’s a direction I find particularly exciting – If nothing else, we need to answer to know where other interpretability methods can add value.
It also seems that, if we're trying to use any kind of control or scalable oversight scheme where we're using weak trusted models to oversee strong untrusted models, that the better we are at having high fidelity communication with the weaker models the better. And if the model is aligned, I feel much more excited about a world where the widely deployed systems are doing things users understand rather than inscrutable autonomous agents.
Naturally, it's worth thinking about negative externalities. In my opinion, helping humans have better models of AI Psychology seems robustly good. AIs having better models of human psychology could be good for the reasons above, but there's the obvious concern that it will make models better at being deceptive, and I would be hesitant to recommend such techniques to become standard practice without better solutions to deception. But I expect companies to eventually do things vaguely along the lines of agentic interpretability regardless, and so either way I would be keen to see research on the question of how such techniques affect model propensity and capability for deception.