This is a linkpost for https://www.goodfire.ai/research/interpreting-lm-parameters
Cool work!
My current understanding is that this part of the tech tree centrally relies on some kind of token-wise sparsity - the components are built such that they have an impact only on a small subset of tokens.
This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it.
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.
Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.
We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully identifying which nodes in them were causally important for computing the final output. We think this casts some doubt on the faithfulness of subnetworks found by the majority of other subnetwork identification methods in the literature.[3][4] More details and some examples can be found in the paper.
Additionally, as with our previous technique SPD, VPD does not seem to suffer from (the parameter space analog of) 'feature splitting', either in principle or in practice.[5]
We also do a bunch of other comparisons to per-layer transcoders and CLTs and find that our approach compares somewhat favourably (details in post).
Abstract
Links
APD was taken but the "adversarial" part really did seem like the most important highlight, so we resorted to desperate measures.
Ca. 67 million parameters total, decomposition only applied to the 28 million non-embedding parameters. Trained on the Pile.
These methods often use stochastic ablation or resampling schemes with no adversary, or they even just plain try to find the largest set of nodes they can ablate without changing the network output.
Related post and comment.
The SPD paper showed something sort of like feature splitting in figure 8, but that was for making the importance minimality loss coefficient larger rather than making the dictionary larger. Also it turned out to be caused by a math bug.[5]
The lower-leaky sigmoid function appears to have been the problem. Turns out negative causal importances are a bad idea.