If SAE features are the correct units of analysis (or at least more so than neurons), should we expect that patching in the feature basis is less susceptible to the interpretability illusion than in the neuron basis?
The illusion is most concerning when learning arbitrary directions in space, not when iterating over individual neurons OR SAE features. I don't have strong takes on whether the illusion is more likely with neurons than SAEs if you're eg iterating over sparse subsets, in some sense it's more likely that you get a dormant and a disconnected feature in your SAE than as neurons since they are more meaningful?
What if we constrain v to be in some subspace that is actually used by the MLP? (We can get it from PCA over activations on many inputs.)
This way v won't have any dormant component, so the MLP output after patching also cannot use that dormant pathway.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort
We would like to thank Atticus Geiger for his valuable feedback and in-depth discussions throughout this project.
tl;dr:
Activation patching is a common method for finding model components (attention heads, MLP layers, …) relevant to a given task. However, features rarely occupy entire components: instead, we expect them to form non-basis-aligned subspaces of these components.
We show that the obvious generalization of activation patching to subspaces is prone to a kind of interpretability illusion. Specifically, it is possible for a 1-dimensional subspace patch in the IOI task to significantly affect predicted probabilities by activating a normally dormant pathway outside the IOI circuit. At the same time, activation patching the entire MLP layer where this subspace lies has no such effect. We call this an "MLP-In-The-Middle" illusion.
We show a simple mathematical model of how this situation may arise more generally, and a priori / heuristic arguments for why it may be common in real-world LLMs.
Introduction
The linear representation hypothesis suggests that language models represent concepts as meaningful directions (or subspaces, for non-binary features) in the much larger space of possible activations. A central goal of mechanistic interpretability is to discover these subspaces and map them to interpretable variables, as they form the “units” of model computation.
However, the residual stream activations (and maybe even the neuron activations!) mostly don’t have a privileged basis. This means that many meaningful subspaces won’t be basis-aligned; rather than iterating over possible neurons and sets of neurons, we need to consider arbitrary subspaces of activations. This is a much larger search space! How can we navigate it?
A natural approach to check “how well” a subspace represents a concept is to use a subspace analogue of the activation patching technique. You run the model on input A, but with the activation along the subspace taken from an input B that differs from A only in the value of the concept in question. If the subspace encodes the information used by the model to distinguish B from A, we expect to see a corresponding change in model behavior (compared to just running on A).
Surprisingly, just because a subspace has a causal effect when patched, it doesn't have to be meaningful! In this blog post, we present a mathematical example with a spurious direction (1-dimensional subspace[1]) that looks like the correct direction when patched. We then show empirical evidence in the indirect object identification task, where we find a direction with a causal effect on the model’s performance consistent with patching in a task-relevant binary feature, despite it being a subspace of a component that’s outside the IOI circuit. We show that this empirical example closely corresponds to the mathematical example.
We consider this result an important example of the counterintuitive properties of activation patching, and a note of caution when applying activation patching to arbitrary subspaces, such as when using techniques like Distributed Alignment Search.
An abstract example of the illusion
Background: activation patching
Consider a language model completing the sentence “The Eiffel Tower is in” with “ Paris”. How can we find which component of the model is responsible for knowing that “Paris” is the right answer for this landmark’s location? Activation patching, sometimes also referred to as “interchange intervention”, “resample ablation” or “causal tracing”, is a technique that can be used to find model components that are relevant for such a task.
Activation patching works by running the model on input A (e.g. “The Eiffel Tower is in”), storing the activation of some component c, and then running the model on input B (e.g., “The Colosseum is in”) but with the activation of c taken from A. If we find that patching a certain component makes the model output “Paris” on input B, this suggests that this component is important for the task[2].
Activation patching can be straightforwardly generalized to patching just a subspace of a component instead of the entire component, by patching in the dot products with an orthonormal basis of the subspace, but leaving dot products with the orthogonal complement the same. Equivalently, we apply a rotation (whose first k rows correspond to the subspace), patch the first k entries in this rotated basis, and then rotate back.
Where activation patching can go wrong for subspaces
As it turns out, activation patching over arbitrary subspaces gives us a lot of power! In particular, imagine patching a one-dimensional subspace (represented by a vector v) from B into A. Let’s decompose the activations from the two prompts along v and its orthogonal complement:
uA=u⊥A+cAv,uB=u⊥B+cBvwhere cA,cB are the coefficients of the activations of A and B along v. Then the patched activation will be
upatchedA=u⊥A+cBvwhich simplifies to
upatchedA=uA+(cB−cA)vFrom this formula, we see that patching v allows us to effectively add multiples of it to the activation as long as the two examples differ along v. In particular, if activations always lie in a subspace not containing v, patching can take them “off distribution” by taking them outside this subspace - and this gives us room to do counterintuitive things with the representation.
Let’s unpack this a bit. In order for the patch to have an effect on model behavior, two properties are necessary:
The crux of the problem is that we may get (1) and (2) from two completely different sources, even in a model component that doesn't participate in the computation. Namely, we can form v as v=virrelevant+vdormant, where
Choosing v as the sum of the two creates a “bridge” between these two directions, such that patching along it uses the variation in the correlational component to activate the (previously inactive) causal component (see the picture below).
Note that, at first glance, it is not clear if such pairs of directions (virrelevant,vdormant) exist at all: maybe every direction in activation space that is causal also correlates with the information being patched, or the other way around! In such worlds, the above example (and the corresponding interpretability illusion) would not exist. However, we show both a priori arguments and an example in the wild suggesting that this situation is common.
Let’s next distill the essence of the idea in a concrete example.
Setup for the example
In the simplest possible scenario, suppose we have a scalar input x that can take values in {−1,1}, and we are doing regression where the target is the input itself (i.e. y=x). Consider a linear model with three “hidden neurons” of the form x→WoutWTinx[3], where
Win=(1,1,0),Wout=(1,0,2).We have WoutWTin=1×1+1×0+0×2=1, so the model implements the identity x→x and thus performs the task perfectly. Let’s look at each of the features in the standard basis (e1,e2,e3):
We are interested in patching the "feature" x itself, so that patching from x′ into x should make the model output x′ instead of x.
Geometric intuition
So what could go wrong in this example when we do activation patching? Let’s first consider the case when things go right. If we patch along the subspace spanned by the first feature from input x′ into input x, we simply replace the value of the first hidden unit: (x,x,0) becomes (x′,x,0), and then the x′ propagates to the output. This is summarized in the table below:
However, this gets more interesting when we patch along the direction v given by the sum of the 2nd and 3rd neurons e2+e3. If v⊥ denotes the orthogonal complement, then to patch along v, we:
Below is a picture illustrating how the patching works (the first neuron is omitted to fit in 2D):
This is summarized in the table below:
As we see, patching from x′ into x along v results in the hidden activation (x,0,x′)! In particular, the behavior of the model when patching along v is identical to the case when we patch along the “true” subspace e1. This is counterintuitive for a few reasons:
We remark that this construction extends to subspaces of arbitrary dimension by e.g. adding more irrelevant directions.
We term this an “illusion”, because the patch succeeds by turning on a “dormant” neuron, and never even touching the correct circuit for the task. Note that calling this an illusion may be considered a judgment call, and this could be viewed as patching working as intended; see the appendix for more discussion of the nuances here.
An example in the wild: MLP-In-The-Middle Illusion
Finding dormant / irrelevant directions in MLP activations
Why do we expect to find dormant and irrelevant directions in the wild? While they might seem like a waste of model capacity, there are a priori / heuristic arguments for their existence in MLP layers specifically:
In summary, whenever some circuit involves two layers communicating some feature via the residual stream (i.e. using some skip connections), we predict that an MLP layer in the middle has a good chance of having a causal effect as in our mathematical example[4]. Next, we discuss a real-world example where we found the elements of our abstract illusion.
Real-world case study: the IOI task
In the indirect-object-identification (IOI) task, the model is prompted with sentences like "Lisa and Max went to the park and Max gave a flower to”, with “ Lisa” being the correct completion. Prior work has shown that there is a group of 4 heads in layers 7 and 8, the S-Inhibition heads, which compute information about both the position of the repeated name (first in the sentence, as in “Max and Lisa… Max gave…” vs second in the sentence, as in “Lisa and Max… Max gave…”) and the identity of the repeated name in the sentence. After that, another class of heads (the name mover heads) in layers 9, 10 and 11 use this as a signal to attend to the name that only appears once in the sentence, so that they can copy it to the final output.
Intuitively, the position signal (1st vs 2nd name in the sentence) seems like it could plausibly be represented as a binary feature (see also work on synthetic positional embeddings) occupying a one-dimensional subspace. How can we find this subspace?
Distributed alignment search
To find a subspace corresponding to the position information as it is used by the model, we used Distributed Alignment Search (DAS), a technique to find task-related subspaces using gradient descent. Specifically:
The direction found is able to flip model predictions about 80% of the time when patching between prompts that differ only in position information (on a test set using other templates and names). But how does it do that? We extensively validated that this subspace does the “right” thing:
These sanity checks and the single-dimensional nature of the subspace make us fairly confident that this is indeed a causally meaningful subspace of model activations representing the position information as used by the model. However, is this the only subspace whose patching has a causal effect like this?
Finding a position direction in MLP8
We can try to find another direction by running DAS on any other model component. However, a particularly interesting one is MLP8: it is between the S-Inhibition and name mover heads - so can potentially serve as a “middleman” for the signal transmitted between them - but isn’t implicated in the IOI circuit. We run the optimization described above, but this time on the hidden neuronal activations (after non-linearity) of MLP8 at the END token.
We find that while patching the full activations between examples with opposite position information (ABB -> BAB and vice versa) increases the logit difference between the correct and incorrect name by ~0.4 (so is in fact helpful for the task), patching just the direction that DAS finds decreases it by ~1.3, a ~3x stronger effect in the opposite direction. For some intuition, this is about ~30-40% of the logit difference lost when patching the entire residual stream after layer 8 at the END token, which brings model performance down to ~30% (from ~99%). This is a nontrivial contribution, since the true circuit itself is distributed across multiple heads - so this MLP would appear as a significant circuit component.
How does patching the direction achieve this causal effect? Let vmlp8 be the subspace direction that we found in MLP8, and let vtrue be the “ground truth” position direction in the residual stream we validated earlier. We find that Woutvmlp8 yields a direction very close to vtrue - the cosine similarity is[5] 0.52. This suggests that the downstream causal mechanism is similar to that of the “ground truth” direction we found.
Next, we will show how this direction maps onto the mathematical example of the illusion, by showing it can be (approximately) seen as the sum of an irrelevant and dormant neuron.
Decomposing into an irrelevant and dormant direction
How could we decompose our subspace to match the two components in our mathematical example? As explained in our heuristic arguments, Wout always has a high-dimensional null space, and all vectors in it have no causal effect on the output. Thus, it is natural to use the orthogonal decomposition
vmlp8=null_proj+projwhere null_proj is the component of vmlp8 in the nullspace of Wout and proj is the remaining part (the component in the orthogonal complement of the nullspace). We hypothesize that null_proj is the irrelevant component and proj is the dormant component. We find that
∥proj∥2≈∥null_proj∥2,in other words, the direction is about equal parts in the nullspace and its orthogonal complement.
Next, we show that the nullspace component null_proj is much more correlated with the position information than the remaining component proj. We sample a set of prompts which are of the form ABB or ABA, take their MLP8 activations and project those onto null_proj and proj:
We see that both proj and null_proj are correlated with the position information. However, the difference between ABB and ABA projections onto the null_proj is much bigger compared to proj. Thus, we seem to have found a decomposition that quite well matches the mathematical construction of the illusion.
Both components are necessary
The proj component is logically necessary, because otherwise the patch can have no causal effect. But is null_proj necessary at all for the direction to have a causal effect? To investigate this, we patched along proj, the part of the DAS subspace that does not point into the nullspace of Wout. The result is that the effect (as measured in logit difference decrease) decreases by a factor of ~3.3x.
Thus, both components are necessary for the patch to have its causal effect. We also note that patching the 2-dimensional subspace spanned by proj and null_proj also loses the effect of the vmlp8 patch. This makes sense given the discussion of our mathematical example: by patching both components, there is no interaction between them, and consequently the causal effect is limited by exploiting the (relatively weak) correlation in the proj direction.
Patching along the subspace changes the output distribution of MLP8
Next, we show that patching along the MLP8 direction vmlp8 activates a “dormant” pathway in the model, so that the output of MLP8 in the residual stream contributes significantly to activations along the “ground truth” subspace vtrue in the residual stream, where previously there was no such contribution.
Specifically, we patch between prompts with unrelated names for all combinations of position (1st vs 2nd) in the two names. For example, DCD -> ABB means patching from a sentence (DCD) where the repeated name comes first into a sentence (ABB) where it comes second and the two names A, B are different from C, D. For each combination, we project the activations of the output of MLP8 on the “ground truth” residual stream direction vtrue. The dashed distributions are the corresponding projections without any intervention grouped by position:
The clean (unpatched) projections are relatively close to zero and don't differ much between different name patterns, suggesting that MLP8 doesn’t write much to the position subspace (as we also confirmed with activation patching the entire MLP8). The same is true when we patch patterns containing the same positional information (e.g. CDD -> ABB or DCD -> BAB), which makes sense given the formula for patching: in this case, cA and cB are close to each other. This is consistent with the desired causal effect, since in this case the intervention should ideally not change anything.
However, when we patch in the opposite positional information, MLP8 suddenly writes a large value to the position subspace. Thus, patching vmlp8 activates a previously dormant direction, suggesting that the subspace isn’t faithful to the model’s actual computation on the task.
DAS subspaces exist in MLPs with random weights
Finally, how certain are we that MLP8 doesn’t actually matter for the IOI task? While we find the IOI paper analysis convincing, to make our results more robust to the possibility that it does matter, we also design a further experiment.
Specifically, we randomly sampled MLP weights and biases such that the norm of the output activations matches those of MLP8. As random MLPs might lead to nonsensical text generation, we don’t replace MLP8 with the random MLP, but rather train a subspace using DAS on the MLP activations and add the difference between the patched and unpatched output of the random MLP to the real output of MLP8. This setup finds a subspace that reduces logit difference even more than the vmlp8 direction.
This suggests that the existence of the vmlp8 subspace is less about what information MLP8 contains, and more about where MLP8 is in the network.
Discussion and takeaways
What conclusions can we draw from our findings? First, let's take a step back. The core assumptions of mechanistic interpretability are that
But what are the right building blocks, and how do we find them? Many existing mechanistic analyses focus on components (attention heads, MLP layers) as building blocks (IOI, the docstring circuit, ...), and make heavy use of activation patching between carefully chosen examples to localize and verify the causal relevance of a component.
However, each component represents multiple features (and sometimes even multiple task-relevant features, like name and position information in the IOI task). To precisely localize a feature of interest, we need to zoom in on subspaces of these components - and as we discussed, these subspaces need not be basis-aligned.
To do this, a natural idea is to make activation patching more surgical, and carry out the same kind of analysis on arbitrary subspaces of components. A natural intuition one might have here is: if patching an entire component has no causal effect, then so will patching any subspace of it.
As we saw in this blog post, this intuition is wrong[6]! In particular, we took a circuit heavily analyzed on the component level, and then found a direction in an MLP layer whose patching has a strong causal effect on model behavior, yet the MLP is outside the circuit.
If we only consider the end-to-end effect of this intervention, it looks perfectly reasonable: we do the patch, and the model behaves as if the intended feature was patched. However, it raises an alarming question: should we trust the component-level analysis (which tells us that the MLP layer doesn’t matter much) or the subspace-level analysis (which tells us that it somewhat matters)?
We suggest that we should answer this by performing a more thorough mechanistic analysis of model internals, as opposed to an end-to-end evaluation. In particular,
We believe that these results suggest that the within-circuit direction is a more faithful explanation of model behavior than the direction found in the MLP layer.
Key Takeaways/Best Practices
More broadly, what are we to take away from this phenomenon?
Appendix
The importance of correct model units
There is a subtlety in our abstract example. Suppose we reparametrized our hidden layer so that instead of the standard basis (e1,e2,e3) we are using a “rotated” basis where one of the directions is e2+e3, the other direction is orthogonal to it and to the vector Wout (so it will be irrelevant), and the last direction is the one orthogonal to these two. You can work out that this basis is given by the (unnormalized) directions
e2+e3,2e1+e2−e3,−e1+e2−e3.and in this case,
So from this perspective, it looks like e2+e3 is the correct direction used by the model, and e1 is just a weird add-on! This raises an important question: on what grounds can we claim that the direction e1 is “more meaningful” than e2+e3?
The answer lies beyond the scope of the abstract example, and boils down to prior beliefs about which model units should be considered “meaningful”. For example, when we realize this abstract example in the wild, the role of e1 is (loosely speaking) played by a direction in the summed outputs of the S-Inhibition heads, whereas e2, e3 map to directions in the neuron activations of MLP8. We have prior reasons to believe that the S-Inhibition heads are crucial for the IOI circuit, whereas MLP8 is not part of the circuit. This gives us grounds to claim that e1 is a preferable solution to e2+e3.
More generally, multiple strands of evidence suggest that attention heads and MLP layers do different kinds of computations in models. This gives us some prior reason to analyze individual MLPs and attention heads as separate units. Conversely, if we could arbitrarily group model components into single “units”, then we could group the MLP8 layer with the S-Inhibition heads, and this would dramatically affect some of the analyses of the illusion. For example, patching along the MLP8 direction would no longer take the output of this “Frankensteinian” component off-distribution - one of our main reasons to consider this subspace “weird”!
This is a very concrete example of how our implicit assumptions about the units of analysis may affect the conclusions of interpretability experiments. We believe it’s always better to keep these assumptions explicitly in mind when doing interpretability work.
MLPs can recover arbitrary directions in the residual stream
For the interpretability illusion to work, there has to be a direction in the MLP that correlates with the task. A priori, we expect the input weight of an MLP layer to be a (close to) full rank transform, which implies that “most of the information” in the residual stream will also be present in the neuron activations. In particular, it seems plausible (though certainly not logically necessary) that examples linearly separable in the residual stream remain linearly separable in the neuron activations.
Is this true in practice? To test this, we
Results are shown in the graph below (every marker corresponds to one regression model using a different random direction):
The S-position subspace in the IOI task is a binary feature, so we are also interested in whether binary information in general is linearly recoverable from the MLP activations. To test this, we sample activations from the model run on randomly-sampled prompts. This time however, we add or subtract a multiple of a random direction to the residual stream and calculate the MLP activations based on this new vector:
resid_mid_new=resid_mid+y×z×v×∥resid_mid∥y: either -1 or 1
z: a scaling factor
v: a randomly-sampled direction with unit norm
For each classifier, we randomly sample a direction v that we either add or subtract (using y) from the residual stream. The classifier is trained to predict y. We rescale v to match the average norm of a residual vector and then scale it with a small scalar z.
Then, a logistic classifier is trained (C=10000) on 1600 samples. Again, we repeat this experiment for different v, z, and for each layer. We observe that the classifier works quite well across layers even with very small values of z, (still, accuracy drops for z = 0.0001).
DAS performance changes proportionally to the number of MLP neurons it is trained on
Here, we want to investigate how much the ability to find a pathological subspace depends on the overparameterization of the MLP layer. We train DAS on MLP8 again, but only allowing it to change a limited number of neurons.
We observe a linear relationship between the number of neurons used for training and performance of DAS (y axis measures the fraction of logit difference lost compared to patching the entire residual stream after layer 8 at the END token):
The mathematical example straightforwardly generalizes to higher-dimensional subspaces.
Care needs to be taken when choosing the two prompts, so that as much of the structure in the sentence is the same, yet the intended prediction is different. Our example has the nice property of controlling for all other features present in the original example (e.g. that this is a sentence about the location of a landmark, that it is in English, etc). Imagine patching into “The capital of Italy is” instead - there’s a subtle difference in what such a patching would show!
The coefficient 2 in Wout is there for technical reasons
We also show that this effect really benefits from the fact that the hidden layer has more dimensions than the residual stream in the appendix.
As a baseline, the cosine sim between two random vectors in this space is ~0.036
Another way this intuition could fail is if we have e.g. some layer containing two attention heads, one with a positive contribution to the logit difference, the other with a matching negative contribution. In other words, the heads are perfectly anticorrelated. If we patch the entire layer, the sum of the two heads will remain zero. However, if we patch only one of the heads, we may suddenly observe a downstream effect. This scenario seems less intuitively likely than our mathematical example, because it hinges on the heads (nearly) perfectly canceling each other.
However, we also note that this example is isomorphic to our mathematical example under a change of basis - though with the important difference that here the different components of the pathological subspace are in different model units, which gives us more reason to take this scenario seriously
In practice, model computation tends to be “sparse” and most features are only written to by a handful of layers. So it is common that an MLP layer “could” write to some feature, which is present in the residual stream, but never does.