I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.
However, I think there's a likely to be another 'phase' that they don't discuss (possibly it didn't crop up in their small models, since it's only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vectors are almost invariably nearly orthogonal. So gradient descent would have very little work to do to find a very large number of vectors (much larger than the number of dimensions) that are all mutually almost-orthogonal, so that show very little interference between them. This is basically the limiting case of the pattern observed in the paper of packing n features in superposition into d dimensions where n > d >= 1, taking this towards the limit where n >> d >> 1.
Intuitively this phase seems particularly likely in a context like the residual stream of an LLM, where (in theory, if not in practice) the embedding space is invariant under arbitrary rotations, so there's no obvious reason to expect vectors to align with the coordinate axes. On the other hand, in a system where there was a preferred basis (such as a system using L1 regularization), you might get such vectors that were themselves sparse, with most components zero but a significant number of non-zero components, enough for the randomness to still give low interference.
More speculatively, in a neural net that was using at least some of its neurons in this high-dimensionality dense superposition phase, the model will presumably learn ways to manipulate these vectors to do computation in superposition. One possibility for this might be methods comparable to some of the possible Vector Symbolic Architectures (also known as hyperdimensional computing) outlined in e.g. https://arxiv.org/pdf/2106.05268.pdf. Of the primitives used in that, a fully connected layer can clearly be trained to implement both addition of vectors and permutations of their elements, I suspect something functionally comparable to the vector elementwise-multiplication (Hadamard product) operation could be produced by using the nonlinearity of a smooth activation function such as GELU or Swish, and I suspect their their clean-up memory operation could be implemented using attention. If it turned out to be the case that SGD actually often finds solutions of this form, then an understanding of vector symbolic architectures might be helpful for interpretability of models where portions of them used this phase. This seems most likely in models that need to pack vast numbers of features into large numbers of dimensions, such as modern large LLMs.
I think this really incentivizes things like network dissection over "interpret this neuron" approaches.
A new Anthropic interpretability paper—“Toy Models of Superpostion”—came out last week that I think is quite exciting and hasn't been discussed here yet.
Twitter thread from Anthropic
Twitter thread from Chris
Introduction:
And the beginning of their section on the overall strategic picture, which echoes some of my sentiment from “A transparency and interpretability tech tree”, including the importance of universal quantification (what I called “worst-case transparency”) and unknown unknowns: