Current themes in mechanistic interpretability research

Lee Sharkey; Sid Black; beren

This post gives an overview of discussions - from the perspective and understanding of the interpretability team at Conjecture - between mechanistic interpretability researchers from various organizations including Conjecture, Anthropic, Redwood Research, OpenAI, and DeepMind as well as some independent researchers. It is not a review of past work, nor a research agenda. We're thankful for comments and contributions from Neel Nanda, Tristan Hume, Chris Olah, Ryan Greenblatt, William Saunders, and other anonymous contributors to this post, which greatly improved its quality. While the post is a summary of discussions with many researchers and received comments and contributions from several, it may nevertheless not accurately represent their views.

The last two to three years have seen a surge in interest in mechanistic interpretability as a potential path to AGI safety. Now there are no fewer than five organizations working on the topic (Anthropic, Conjecture, DeepMind, OpenAI, Redwood Research) in addition to numerous academic and independent researchers.

In discussions about mechanistic interpretability between a subset of researchers, several themes emerged. By summarizing these themes here, we hope to facilitate research in the field more broadly.

We identify groups of themes that concern:

Object-level research topics in mechanistic interpretability
Research practices and tools in mechanistic interpretability
Field building and research coordination in mechanistic interpretability
Theories of impact for mechanistic interpretability

Object-level research topics in mechanistic interpretability

Solving superposition

Anthropic’s recent article on Toy Model of Superposition laid out a compelling case that superposition is a real phenomenon in neural networks. Superposition appears to be one of the reasons that polysemanticity happens, which makes mechanistic interpretability very difficult because it prevents us from telling simple stories about how features in one layer are constructed from features in previous layers.

A solution to superposition will look like the ability to enumerate all the features that a network represents, even if they’re represented in superposition. If we can do that, then we should be able to make statements like “For all features in the neural network, none violate rule X” (and more ambitiously, for "no features with property X participate in circuits which violate property Y"). Researchers at Anthropic hope this might enable ‘enumerative safety’, which might allow checking random samples or comprehensive investigations of safety-critical parts of the model for unexpected and concerning components. There are many potential reasons researchers could fail to achieve enumerative safety, including failing to solve superposition, scalability challenges, and several other barriers described in the next section.

Anthropic outlined several potential solutions to superposition in their article. Very briefly, these strategies are:

Create models without superposition.
Find a sparse overcomplete basis that describes how features are represented in models with superposition. This will likely involve large scale solutions to sparse coding.
Hybrid approaches in which one changes models, not resolving superposition, but making it easier for a second stage of analysis to find a sparse overcomplete basis that describes it.

Multiple organizations are pursuing these strategies. Researchers in all organizations are keen to hear from people interested in working together on this problem. However, there is a range of views among researchers on how central superposition is as a problem and how tractable it is.

Barriers beyond superposition?

We’ve been blaming superposition for rather a lot of our interpretability woes, which risks giving the misleading impression that a solution to superposition is a solution to mechanistic interpretability. But this seems unlikely. What other problems are we likely to bump up against when interpreting neural networks?

Non-linear representations

Viewing features as directions in activation space assumes that representations are primarily linear. Anthropic have discussed some of the reasons why we can expect representations to be mostly linear. But nonlinear representations are also possible. In nonlinear representations, networks assign different features to activation vectors that have similar directions but different magnitudes. This means that feature-interpretations that are valid in one context are invalid in others. It might be possible to fool ourselves into thinking that a capable model is safe if we look only at its linear representations and not its nonlinear representations.

Other exotic representations

We yet don’t know the full range of possible representations in transformers or other future architectures. There may be kinds of representations that we don’t yet know how to recognise. One such example might be ‘variable binding’ in Vector Symbolic Architectures, which transformers might be able to emulate.

Intrinsic messiness of representations

Discussions between mechanistic interpretability researchers revealed differences on how messy they expected neural network representations to be:

On one end of the spectrum, researchers expect neural networks to exhibit clearly identifiable features, circuits, and larger scale structural motifs in their hidden representations.
On the other end of the spectrum, the expectation is that networks learn only complicated compressions of large and dense underlying correlations in the world that are irreducible in terms of features that humans can understand.

Which is correct? Probably both - Different networks and tasks will likely result in networks closer to one end of the spectrum or the other. The important question is where researchers expect large transformers to lie on this spectrum. Most mechanistic interpretability researchers expect that they lie in-between, close to neither extreme.

Even absent extreme views, disagreement between researchers on this question leads to meaningfully different predictions about mechanistic interpretability. For instance, if you expect networks to be collections of dense correlations, then you might put less emphasis on identifying particular circuits or features in them; instead, you might emphasize building up causal models of network behavior in safety-critical settings on a higher level of abstraction.

Describing learning dynamics in terms of circuits

Inasmuch as identifiable circuits exist in neural networks, they must be learned at specific times during training. One example is induction heads. Researchers at Anthropic discovered that the learning of induction heads caused a consistent drop in language model loss curves at a particular phase in training (the ‘induction bump’). There are likely other such circuits waiting to be discovered. If we can characterize them all, we might be able to predict what large models are learning as well as when and why they’re learning it, which will be helpful for ensuring model safety.

Chris Olah suggests that even seemingly-smooth learning curves may be composed of lots of small bumps resulting from the emergence of particular circuits, and how there might be even more patterns common across models.

Deep learning theory questions

Mechanistic interpretability involves understanding the representations learned by deep learning systems. Deep learning theory will therefore probably shed light on how to think about those representations fundamentally. Questions in deep learning theory might therefore be tempting targets of inquiry for mechanistic interpretability researchers. Researchers should be cautious when discussing these questions in public, since their answers might be useful for improving capabilities (This is also true for other, more empirical results in mechanistic interpretability).

It’s an open question how relevant deep learning theory questions will be to mechanistic interpretability. Here we include a (very incomplete) list of topics that we think might be relevant to a mechanistic understanding of the representations learned by deep networks.

Generalization vs memorization: Do the representation that we’re interpreting generalize or are they shallow memorizations of the data? What is the difference between these kinds of representation? Chris Olah suspects that one form of memorization involves the model using ‘datapoints as its features’ instead of representing datapoints in terms of (generalizing) features. Why do neural networks’ representations generalize at all?
Double descent: Why does this happen? And is it relevant to mechanistic interpretability?
Lottery Tickets: Dangerous circuits might exist in networks at initialization. Is there a way we can remove these safely? e.g. removing them without running the network even once?
Inductive biases of stochastic gradient descent and other optimizers: Some circuits might be more likely to be learned due to the inductive biases of different optimizers. Understanding the properties of these inductive biases might be important for predicting whether networks will learn safe or unsafe representations.
Grokking: Have recent results regarding grokking (e.g. this and this) resolved questions on the topic that are relevant to mechanistic interpretability research?

More generally, there is interest among researchers in how mechanistic interpretability might serve as a "microscopic theory" of deep learning, in contrast to something like scaling laws as a "macroscopic theory". This frame suggests seeking bridges from microscopic properties like circuits to macroscopic properties like loss curves or scaling laws.

Automating mechanistic interpretability

Judging by the current pace of progress in AI capabilities, we might very soon be able to automate some components of interpretability research. Some signs of life exist in work that uses models to produce descriptions of neurons in image models or describe differences between text distributions. Assuming further automation becomes possible in the short- to medium-term future, how should interpretability research anticipate these changes and adapt?

Increasing automation elevates the importance of thinking about the ‘automated interpretability OODA loop’ in which we use models to help us interpret networks and decide which experiments or interventions to perform on them. One near-term-automatable component of this loop might be the labeling of neurons or directions. If this becomes possible, interpretability research will look less like a warehouse of researchers trying to identify the common features shared by collections of dataset examples and more like getting capable models to do the labeling work; to quantify their uncertainty about the labels; and to propose experiments to reduce this uncertainty. Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety.

Increasing automation also elevates the importance of interpretability theory, since we’ll want to be sure that our automated analyses don’t have systematic blindspots. For instance, automatically labeling polysemantic neurons will yield polysemantic labels, which aren’t very helpful for human-understandable, mechanistic descriptions of neural networks.

Research practices and tools

Interpretability demands good epistemics, which can be hard! This challenge is made especially difficult by the complexity of the objects that we’re studying. How do we avoid fooling ourselves about what our models are doing under the hood? How can we be sure we’re making progress?

Study simpler models

One of the ways to get around this is to test our interpretability approaches on simpler models where it’s easier to tell if our findings are true or not. There are a few potential ways to do this:

Simple models that implement simple functions: This is the approach taken by Anthropic in their recent Toy Models of Superposition paper and by Neel Nanda and Tom Lieberum in their work on grokking. In both cases, they study small networks trained on simple tasks. A small network trained to do a single task is likely to have learned a cleaner (and hopefully more interpretable) algorithm than a larger one trained to do many tasks. Redwood Research has coined the term ‘streetlight interpretability’ for work that focuses on models or behaviors that seem easier to interpret.
Compiling (rather than training) networks from programmes: Having access to the ground truth makes it possible to evaluate whether our analyses reveal that truth or not. If it were possible to create a network that implements a known programme, then we might have an easier time reverse engineering that programme from the weights and activations. This is kind of like practicing reverse engineering binary files by beginning by writing the code for a programme, compiling the code into a binary file, and seeing if we can recover the compiled programme.
Solving superposition in any network: In their recent paper on superposition, Anthropic argue that studying superposition would be made easier by having any network that doesn’t have superposition, even if it greatly hurt performance. Such a network would give us a ground truth for what the features in the model are. This might let us study features in superposition in regular models.

Study model systems in depth

Biologists study ‘model systems’, such as Drosophila and mice, not because these species are especially fascinating, but because they have already been studied in depth by other researchers. By focusing on species that are already well studied, biologists can build on previous work, gain more general insights, and devise more powerful tools than permitted by only shallow studies of many different species.

InceptionV1 has served as a model system for early mechanistic interpretability work in convolutional image classifiers (see Circuits thread). But no model system has emerged for transformers yet. What should be the Drosophila and mouse of mechanistic interpretability? It seems worthwhile to choose our model systems carefully. Some desiderata might be:

Size and ease of use: We probably want multiple model systems of varying size and capability. Some models might be prohibitively large and difficult to deal with for most researchers. But large models can learn more interesting representations. We need to balance this tradeoff carefully. Models should also be open source so that researchers in different organizations (as well as independent researchers) can study the same network.
Multiple instances with different seeds: Scientists repeat experiments multiple times in order to make statistical claims about phenomena. We’d like to be able to do the same in mechanistic interpretability. Depending on the phenomenon we’re studying, it may be necessary to compare across models. To allow comparisons, there should be multiple instances of our ideal model system. This may only be possible for smaller models due to costs of training.
Training process replicability: Mechanistic interpretability researchers are not only interested in studying representations at the end of training. Knowledge of training dynamics is probably going to be useful for AI safety. An ideal model system would therefore include many training checkpoints as well as references to the data samples used for each training step. Together, these would let researchers replicate the training process exactly, which will be essential for detailed study.
Ease of interpretability versus typicality: Some networks, like SoLU networks, appear to be easier to interpret than others. But most networks used in production are not SoLU networks. Choosing a SoLU network for our model system might make analysis easier but potentially cost us the ability to generalize our claims to more typical networks.

Approaches grounded in the theory of causality

Circuits-level interpretations about neural networks are fundamentally causal interpretations; they make claims such as “Neuron X activates and connects to neuron Y through weight Z, causing neuron Y to activate”. Many kinds of interpretability are similarly causal, but they abstract away the underlying circuits. For instance, feature visualization makes claims that ‘images that contain feature X cause neuron Y to fire maximally’ without reference to the circuits that achieve neuron Y’s selectivity to feature X. Similarly, Meng et al. (2022) use ‘causal tracing’ to isolate parts of a network that store factual associations, letting them modify the network to remove that knowledge without massively damaging performance. Redwood Research are doing significant work on causally grounded methods (Wang et al., 2022; Chan et al., 2022).

In general, it seems prudent to ground our interpretability methods firmly in the theory of causality to be sure that we’re making rigorous claims regardless of the level of abstraction. Although analyses grounded in causality are a gold standard, they’re not always easy to conduct in most areas of science. Mechanistic interpretability is thus in a unique position: It's easy to make causal inferences in artificial neural networks thanks to the relative ease of running experiments in silico compared with experiments in the physical world. Mechanistic interpretability therefore can and should have much higher standards of evidence than other similar domains of science such as biology.

Field building and research coordination

The field of mechanistic interpretability has grown quickly over the last few years. It’s unclear to most researchers what lessons to draw from this and which actions to take.

A substantial fraction of the growth has been from new research teams associated with organizations. The number of independent researchers is harder to measure but has also been surging. The field should probably try to make it easier for independent researchers to contribute. This might happen through

Organizations increasing the hiring of independent researchers (How best to do this?)
Super-powering independent research with open source tooling
Skill-building, either through programmes such as Redwood Research’s MLAB or resources such as Neel Nanda’s ‘A Barebones Guide to Mechanistic Interpretability Prerequisites’ or ‘An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers’.

If further growth seems positive, how should we do it? In general, growth strategies are dependent on AI timelines: If timelines are short, then waiting for researchers to climb the academic ladder seems suboptimal. Computational neuroscientists seem like a ready source of researchers with both relevant analytical skills and shared interests. Physicists, computer scientists, and engineers offer the potential for deep theoretical insights and practical skills.

As the field grows, we should increase our concerns with the health of the field. Questions such as “How to improve coordination between researchers to avoid wasted effort?” and “How should we encourage healthy norms on disagreements?” become relevant. Engaging with and integrating constructive criticism is also a key marker of field health.

Mechanistic interpretability is in a somewhat unique position compared with other domains of science in that most of it happens outside of academia. This has upsides and downsides with respect to publishing norms, epistemics, and coordination that should be carefully managed.

Open Source Tooling

A strong barrier currently in place to people trying to get into the field is good tooling. There's a strong and thriving ecosystem for conventional ML (in particular, core libraries like PyTorch, TensorFlow and JAX, and the HuggingFace ecosystem), which makes ML much easier to get into. This is particularly important for academics, students and independent researchers. But ML infrastructure and tooling is optimized for being able to use models and to be computationally efficient, not to be able to easily expose and access the internals of models, intervene on them, and probe at how they work. So there's a lot of room for better ML mechanistic interpretability tooling. As an initial step in this direction, Neel Nanda has been developing a library called EasyTransformer. There's also a need for tooling that better integrates interactive visualizations and the web dev ecosystem into Python and ML workflows, as good visualizations are often key to understanding the high-dimensional objects of neural networks.

Theories of impact

Despite being fundamentally interesting work, most researchers are scientifically invested in mechanistic interpretability because of its instrumental use for AI safety. In order to improve our positive impact through mechanistic interpretability research, we should have a carefully considered theory of impact. Neel Nanda (list) and Beth Barnes (list) have put together lists of pathways through which interpretability might contribute to AGI safety.

We should think carefully about the relationships between ‘level of progress in mechanistic interpretability’ and each ‘pathway to impact’. Not all pathways to impact are available at all levels of progress. For instance, if we use interpretability in the loss function before we have interpretability that is robust-to-training, we run a serious risk of simply training our networks to be good at hiding dangerous thoughts. We should therefore think carefully about interactions between these pathways to impact.

Even though mechanistic interpretability research appears to be one of the most promising pathways to AGI safety, many researchers are concerned about potential risks resulting from their research:

Mentioned above, mechanistic interpretability and capabilities work may both rely on better deep learning theory. We want to avoid incentivising theory that benefits AI capabilities without worthwhile gains for AI safety.
Mechanistic interpretability might itself become useful for capabilities. For instance, if takeoff is slow, then capabilities researchers might integrate interpretability into their workflow by studying issues with the capabilities-circuits learned by their models and fixing them.
A certain level of trustworthiness through interpretability might incentivise problematic actors that wouldn’t otherwise have used deep learning models to use them. A notable example of this would be DL models in military applications.
Interpretability might be used by capabilities groups for ‘safety washing’, where the safety of dangerous models is oversold. Inadequate interpretability might give many stakeholders a false sense of security regarding powerful AI systems.

Conclusion

It is a very exciting time in mechanistic interpretability research. To some, it represents one of the most plausible paths to avoiding an AI catastrophe. The field is growing quickly and is beginning to see accelerating research progress. Fortunately, it enjoys a high degree of openness between individuals and organizations, which will be important to foster to keep up the pace of research on this urgent problem.

Conjecture is hiring! We’re currently running a hiring round for 9+ roles, including research engineers, ML Engineering leads and some non-technical roles. We’re based in London and are looking for people who are excited about directly cutting at alignment. Interviews are happening on a rolling basis. Apply by the 2nd of December 2022 to be considered for this round. If you have any questions, reach out to jonny@conjecture.dev. To apply and find out more see: https://www.conjecture.dev/careers.

[-]Robert Kirk2y40

Thanks for writing the post, and it's great to see that (at least implicitly) lots of the people doing mechanistic interpretability (MI) are talking to each other somewhat.

Some comments and questions:

I think "science of deep learning" would be a better term than "deep learning theory" for what you're describing, given that I think all the phenomena you list aren't yet theoretically grounded or explained in a mathematical way, and are rather robust empirical observations. Deep learning theory could be useful, especially if it had results concerning the internals of the network, but I think that's a different genre of work to the science of DL work.
In your description of the relevance of the lottery ticket hypothesis (LTH), it feels like a bit of a non-sequitur to immediately discuss removing dangerous circuits at initialisation. I guess you think this is because lottery tickets are in some way about removing circuits at the beginning of training (although currently we only know how to find out which circuits by getting to the end of training)? I think the LTH potentially has broader relevance for MI, i.e.: if lottery tickets do exist and are of equal performance, then it's possible they'd be easier to interpret (due to increased sparsity); or just understanding what the existence of lottery tickets means for what circuits are more likely to emerge during neural network training.
When you say "Automating Mechanistic Interpretability research", do you mean automating (1) the task of interpreting a given network (automating MI), or automating (2) the research of building methods/understanding/etc. that enable us to better-interpret neural networks (automating MI Research)? I realise that a lot of current MI research, even if the ultimate goal is (2), is mostly currently doing (1) as a first step.
Most of the text in that section implies automating (1) to me, but "Eventually, we might also want to automate the process of deciding which interventions to perform on the model to improve AI safety" seems to lean more towards automating (2), which comes under generally approach of automating alignment research. Obviously it would be great to be able to do both of them, but automating (1) seems both much more tractable, and also probably necessary to enable scalable interpretability of large models, whereas (2) is potentially less necessary for MI research to be useful for AI safety.

AI ALIGNMENT FORUM
AF

38