- C* What is the role of Negative/ Backup/ regular Name Movers Heads outside IOI? Can we find examples on which Negative Name Movers contribute positively to the next-token prediction?
So, it turns out that negative prediction heads appear ~everywhere! For example, Noa Nabeshima found them on ResNeXt
s trained on ImageNet: there seem to be heads that significantly reduce the probability of certain outputs. IIRC the explanation we settled on was calibration; ablating these heads seemed to increase log loss via overconfident predictions on borderline cases?
Further, it’s helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. Having some
See a Twitter thread of some brief explorations I and Alex Silverstein did on this
I think you cut yourself off there both times.
You're welcome, though did you miss a period here or did you want to write more?
See a Twitter thread of some brief explorations I and Alex Silverstein did on this
Missed a period (I'm impressed I didn't miss more tbh, I find it hard to remember that you're supposed to have them at the end of paragraphs)
This is the third post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. Look up jargon in my Mechanistic Interpretability Explainer
Motivation
Motivating paper: Interpretability In The Wild
Our ultimate goal is to be able to reverse engineer real, frontier language models. And the next step up from toy language models is to look for circuits in the wild. That is, taking a real model that was not designed to be tractable to interpret (albeit much smaller than GPT-3), take some specific capability it has, and try to reverse engineer how it does it. I think there’s a lot of exciting low-hanging fruit here, and that enough groundwork has been laid to be accessible to newcomers to the field! In particular, I think there’s a lot of potential to better uncover the underlying principles of models, and to leverage this to build better and more scalable interpretability techniques. We currently have maybe three published examples of well-understood circuits in language models - I want to have at least 20!
This section is heavily inspired by Redwood Research’s recent paper: Interpretability In The Wild. They analyse how GPT-2 Small solves the grammatical task of Indirect Object Identification: e.g. knowing that “John and Mary went to the store, then John handed a bottle of milk to” is followed by Mary, not John. And in a heroic feat of reverse engineering, they rigorously deciphered the algorithm. In broad strokes, the model first identifies which names are repeated, inhibits any repeated names, and then predicts that the name which is not inhibited comes next. The full circuit consists of 26 heads, sorted into 7 distinct groups, as shown below - see more details in my MI explainer. A reasonable objection is that this is a cherry-picked task on a tiny (100M parameter) model, so why should we care about it? But I’ve learned a lot from this work!
It’s helped uncover some underlying principles of networks. Notably that the model has built in redundancy: if the name mover heads (which attend to the correct name) are knocked-out, then there are backup heads which take over and do the task instead! And that the model has learned to repurpose existing functionality: the model has a component (induction heads) for the more complex task of continuing repeated subsequences, but here they also get re-purposed to detect whether a name is repeated. It would have been much harder to understand what was going on here without the prior work of reverse-engineering induction circuits! And there’s many open questions building on the work that could clarify other principles like universality: when another model learns this task, does it learn the same circuit?
Further, it’s helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. If we have some well understood circuits, and can see how well these techniques work in those settings (and what they miss!), this gives grounding to find the good techniques and how to understand their results.
Finally, I think that enough groundwork has been laid that finding circuits in the wild is tractable, even by people new to the field! There are sufficient techniques and demos to build off of that you can make significant progress without much background. I’m excited to see what more we can learn!
Resources
Tips
Problems
This spreadsheet lists each problem in the sequence. You can write down your contact details if you're working on any of them and want collaborators, see any existing work or reach out to other people on there! (thanks to Jay Bailey for making it)
Name: Jess Smith
Email: last name dot first name k @ gmail