[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq; Dan Braun; Oliver Clive-Griffin; Bart Bussmann; Nathan Hu; mivanitskiy; Linda Linsefors; Lee Sharkey

This is a linkpost for https://www.goodfire.ai/research/interpreting-lm-parameters

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)^[1] and decompose the parameters of a small^[2] language model with it.

VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.

Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.

We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.

While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.^[3]^[4] More details and some examples can be found in the paper.

Additionally, as with our previous technique SPD, VPD does not seem to suffer from (the parameter space analog of) 'feature splitting', either in principle or in practice.^[5]

We also do a bunch of other comparisons to per-layer transcoders and CLTs and find that our approach compares somewhat favourably (details in post).

Abstract

Neural networks use millions to trillions of parameters to learn how to solve tasks that no other machines can solve. What structure do these parameters learn? And how do they compute intelligent behavior?
Mechanistic interpretability aims to uncover how neural networks use their parameters to implement their impressive neural algorithms. Although previous work has uncovered substantial structure in the intermediate representations that networks use, little progress has been made to understand how the parameters and nonlinearities of networks perform computations on those representations.
In this work, we present a method that brings us closer to this understanding by decomposing a language model's parameters into subcomponents that each implement only a small part of the model's learned algorithm, while simultaneously requiring only a small fraction of those subcomponents to account for the network's behavior on any input.
The method, adVersarial Parameter Decomposition (VPD), optimizes for decompositions of neural network parameters into simple subcomponents that preserve the network's input-output behavior even when many subcomponents are ablated, including under ablations that are adversarially selected to destroy behavior. This encourages learning subcomponents that provide short, mechanistically faithful descriptions of the network's behavior that should aggregate appropriately into more global descriptions of the network's learned algorithm.
We study how sequences of interactions between these parameter subcomponents produce the network's output on particular inputs, enabling a new kind of 'circuit' analysis. While more work remains to be done to deepen our understanding of how neural networks use their parameters to compute their behavior, our work suggests an approach to identify a small set of simple, mechanistically faithful subcomponents on which further mechanistic analysis can be based.

Links

Paper: https://www.goodfire.ai/research/interpreting-lm-parameters
Less technical summary: https://www.goodfire.ai/research/vpd-explainer
Twitter thread: https://x.com/leedsharkey/status/2051717264286609516

^{^}
APD was taken but the "adversarial" part really did seem like the most important highlight, so we resorted to desperate measures.
^{^}
Ca. 67 million parameters total, decomposition only applied to the 28 million non-embedding parameters. Trained on the Pile.
^{^}
These methods often use stochastic ablation or resampling schemes with no adversary, or they even just plain try to find the largest set of nodes they can ablate without changing the network output.
^{^}
Related post and comment.
^{^}
The SPD paper showed something sort of like feature splitting in figure 8, but that was for making the importance minimality loss coefficient larger rather than making the dictionary larger. Also it turned out to be caused by a math bug.^[5]
^{^}
The lower-leaky sigmoid function appears to have been the problem. Turns out negative causal importances are a bad idea.

[-]Fabien Roger21d50

Cool work!

My current understanding is that this part of the tech tree centrally relies on some kind of token-wise sparsity - the components are built such that they have an impact only on a small subset of tokens.

Is that correct? Or does it play a sufficiently small role that it could be ablated away?
Is that bad? My guess is that the most important circuits to interpret are useful on many tokens. Circuits related to a deceptive character trying to not get caught or consequentialist reasoning to steer the story towards a happy outcome may be slightly sparse in pretraining (though maybe more like 1/10k tokens than 1/10M tokens), but active much more frequently in post-training if they were powering scary deception.
Is there a way out? It's not clear to me there is one because things don't seem very sparse in Transformers, but I would also not be surprised if there were satisfying sparse human understandable full explanations of small Transformers because the reality that Transformers are trying to capture is probably explainable in sparse terms.

60

[Linkpost] Interpreting Language Model Parameters

60