I expect to have more detailed thoughts worth sharing as I spend more time with this content, but one thing stands out brightly as a first: This is, head-and-shoulders, the best language model interpretability work to date. I'm impressed at the thoroughness of the theory combined with detailed real examples.
This also seems like a good motivation to go back and study layer reordering (a'la Sandwich Transformers) as a treatment affecting the induced circuits of a model.
(h/t Kevin Wang for pointing out the sandwich transformer paper to me recently)
Thanks for the writeup! The first paper covers the first half of the video series, more or less. I've been working on a second paper which will focus primarily on the induction bump phenomenon (and other things described in the second half of the video series), so much more to come there!
Chris Olah, Neel Nanda, Catherine Olsson, Nelson Elhage, and a bunch of other people at Anthropic just published “Transformer Circuits,” an application of the Circuits-style interpretability paradigm to transformer-based language models. From their very top level summary:
They've chosen to publish their work in an interestingly novel format, publishing their first paper, “A Mathematical Framework for Transformer Circuits,” alongside a set of YouTube videos that go into even more detail on their findings. I watched the full YouTube playlist, and I found it absolutely fascinating and would highly recommend it as a way to engage with this research.
Some of my high-level takeaways:
Overall, I think this is clearly the most exciting progress in transparency and interpretability in general since Circuits—and I'm really happy to see it happening in language models, as I've previously emphasized I think is important for us to focus on. One thing that I think really sets this sort of transparency and interpretability work apart from the rest is the authors' emphasis on understanding the mechanistic building blocks underlying their models—with the hope of eventually being able to reverse-engineer them—rather than just, for example, trying to give humans tools to predict what models are doing (without the output of those tools necessarily having any correspondence to what the models are actually doing).
Some possible future research directions related to understanding how induction heads compose:
One thing I'm still struggling to understand related to the induction bump is how it can be the case that, after the induction bump, large models don't do relatively better at meta-learning compared to small models. I found that extremely surprising and I feel like I almost don't believe that that can be the end of the story—qualitatively, we certainly observe much more interesting meta-learning in larger models, and it seems really strange for all of that to just be reflected in an overall loss decrease rather than an increase in the amount of meta-learning. At the very least, I feel like this fact deserves some sort of further explanation—perhaps there are other interesting phase changes hiding in later parts of the loss function that might help explain what's going on here, or perhaps the claim that the whole phenomenon of “meta-learning” is just task recognition could help shed some light on this result. ↩︎