Chris Olah and the rest of the rest of the OpenAI Clarity team just published “Zoom In: An Introduction to Circuits,” a Distill article about some of the transparency research they've been doing which I think is very much worth taking a look at. I'll try to go over some of my particular highlights here, but I highly recommend reading the full article.

Specifically, I have previously written about Chris's belief that the field of machine learning should be more like the natural sciences in seeking understanding first and foremost. I think “Zoom In” is a big step towards making something like that a reality, as it provides specific, concrete, testable claims about neural networks upon which you might actually be able to build a field. The three specific claims presented in the article are:

Claim 1: Features

Features are the fundamental unit of neural networks. They correspond to directions [in the space of neuron activations]. These features can be rigorously studied and understood.

Claim 2: Circuits

Features are connected by weights, forming circuits. These circuits can also be rigorously studied and understood.

Claim 3: Universality

Analogous features and circuits form across models and tasks.

“Zoom In” provides lots of in-depth justification and examples for each of these claims which I will mostly leave to the actual article. Some highlights, however:

  • How do convolutional neural networks (CNNs) detect dogs in an orientation-invariant way? It turns out they pretty consistently separately detect leftward-facing and rightward-facing dogs, then union the two together.
  • How do CNNs detect foreground-background boundaries? It turns out they use high-low frequency detectors—which look for high-frequency patterns on one side and low-frequency patterns on the other side—in a bunch of different possible orientations.

What's particularly nice about “Zoom In”'s three claims in my opinion, however, is that they give other researchers a foundation to build upon. Once it's established that neural networks have meaningful features and circuits in them, discovering new such circuits becomes a legitimate scientific endeavor—especially if, as the third claim suggests, those features and circuits are universal across many different networks. From “Zoom In:”

One particularly challenging aspect of being in a pre-paradigmatic field is that there isn’t a shared sense of how to evaluate work in interpretability. There are two common proposals for dealing with this, drawing on the standards of adjacent fields. Some researchers, especially those with a deep learning background, want an “interpretability benchmark” which can evaluate how effective an interpretability method is. Other researchers with an HCI background may wish to evaluate interpretability methods through user studies.

But interpretability could also borrow from a third paradigm: natural science. In this view, neural networks are an object of empirical investigation, perhaps similar to an organism in biology. Such work would try to make empirical claims about a given network, which could be held to the standard of falsifiability.

Why don’t we see more of this kind of evaluation of work in interpretability and visualization? Especially given that there’s so much adjacent ML work which does adopt this frame! One reason might be that it’s very difficult to make robustly true statements about the behavior of a neural network as a whole. They’re incredibly complicated objects. It’s also hard to formalize what the interesting empirical statements about them would, exactly, be. And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.

Circuits side steps these challenges by focusing on tiny subgraphs of a neural network for which rigorous empirical investigation is tractable. They’re very much falsifiable: for example, if you understand a circuit, you should be able to predict what will change if you edit the weights. In fact, for small enough circuits, statements about their behavior become questions of mathematical reasoning. Of course, the cost of this rigor is that statements about circuits are much smaller in scope than overall model behavior. But it seems like, with sufficient effort, statements about model behavior could be broken down into statements about circuits. If so, perhaps circuits could act as a kind of epistemic foundation for interpretability.

I, for one, am very excited about circuits as a direction for building up an understanding-focused interpretability field and want to congratulate Chris and the rest of OpenAI Clarity for putting in the hard work of doing the foundational work necessary to start building a real field around neural network interpretability.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 2:26 AM

Is conway's life, with random starting state, interpretable? If you zoom in on any single square, it is trivial to predict what it will do. Zoom out and you need a lot of compute. There is no obvious way to predict if a cell will be on in 1000000 timesteps without brute force simulating the whole thing (at least the past light cone). What would an interpretability tool for conway's life look like?

I think this question sort of misses what matters.

There's all sorts of computations which (probably) aren't very interpretable; SHA-256 is a solid example. But it's an empirical fact that our physical world has a lot more interpretable structure in it than SHA-256 computations. We have things like trees or cars, large-scale abstract structures which repeat over and over again, and display similar predictable behavior across instances despite different small-scale configurations.

Trained neural networks are not basically-random computations (like SHA-256); they're trained on the real world. We know that the real world has a lot of interpretable structure, so it's feasible that a network trained on the real world will reflect that structure. That's what Olah et al's research is about - backing out the structure of the real world from a network trained on the real world.

It's the coupling of the (trained) network to the real world which plays the central role. Something like Conway's game of life doesn't have any coupling to the real world, so it's not really analogous.

I think also Conway's game of life has a large bestiary of 'stable patterns' that you could figure out and then dramatically increase your ability to predict things.

As is demonstrated by the Hashlife algorithm, that exploits the redundancies for a massive speedup. That's not possible for things like SHA-256 (by design)!

The worry I'd have about this interpretability direction is that we become very good at telling stories about what 95% of the weights in neural networks do, but the remaning 5% hides some important stuff, which could end up including things like mesa-optimizers or deception. Do you have thoughts on that?

The "Zoom In" work is aimed at understanding what's going on in neural networks as a scientific question, not directly tackling mesa-optimization. This work is relevant to more application-oriented interpretability if you buy that understanding what is going on is an important prerequisite to applications.

As the original article put it:

And so we often get standards of evaluations more targeted at whether an interpretability method is useful rather than whether we’re learning true statements.

Or, as I put it in Embedded Curiosities:

One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern.

A better understanding of 'circuits' in the sense of Zoom In could yield unexpected fruits in terms of safety. But to name an expected direction: understanding the algorithms expressed by 95% of a neural network, one could re-implement those independently. This would yield a totally transparent algorithm. Obviously a further question to ask is, how much of a performance hit do we take by discarding the 5% we don't understand? (If it's too large, this is also a significant point against the idea that the 'circuits' methodology is really providing much understanding of the deep NN from a scientific point of view.)

I'm not claiming that doing that would eliminate all safety concerns with the resulting reimplementation, of course. Only that it would address the specific concern you mention.

[not affiliated with the author but have thought a fair bit about this sort of thing]

Once you understand what causes these circuits to arise, you could hopefully regularise for interpretability and boost that rookie 95% number up to 99.9%, where you could really believe that pruning the rest isn't a big deal.

I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I'm only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.

Given that assumption, I think it's possible to translate 95% transparency into a safety guarantee: just use your transparency to produce a consistent gradient away from deception such that your model never becomes deceptive in the first place and thus never does any sort of adversarial obfuscation.[1] I suspect that the right way to do this is to use your transparency tools to enforce some sort of simple condition that you are confident in rules out deception such as myopia. For more context, see my comment here and the full “Relaxed adversarial training for inner alignment” post.


  1. It is worth noting that this does introduce the possibility of getting obfuscation by overfitting the transparency tools, though I suspect that that sort of overfitting-style obfuscation will be significantly easier to deal with than actively adversarial obfuscation by a deceptive mesa-optimizer. ↩︎

I think for the remaining 5% to be hiding really big important stuff like the presence of optimization (which is to say, mesa-optimization) or deceptive cognition, it has to be the case that there was adversarial obfuscation (e.g. gradient hacking). Of course, I'm only hypothesizing here, but it seems quite unlikely for that sort of stuff to just be randomly obfuscated.

I read Adversarial Examples are Features Not Bugs as suggesting that this sort of thing happens by default, and the main question is "sure, some of it happens by default, but can really big stuff happen by default?". But if you imagine a LSTM implementing a finite state machine, or something, it seems quite possible to me that it will mostly be hard to unravel instead of easy to unravel, while still being a relevant part of the computation.

I'll probably post a child comment after I actually read the article, but I want to note before I do that I think the power of ResNets are evidence against these claims. Having super-deep networks with residual connections promote a picture that looks much more like a continuous "massaging" of the data than a human-friendly decision tree.