Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.
Can confirm, that list is SO out of date and does not represent the current frontiers. Zero offence taken. Thanks for publishing this list!
Some takes on some of these research questions:
Looking for opposing feature directions in SAEs
I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.
SAE/Transcoder activation shuffling
I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.
How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?
It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.
Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation.
@Lucius Bushnaq , why would MLPs compute linear transformations?
Because two linear transformations can be combined into one linear transformation, why wouldn't downstream MLPs/Attns that rely on this linearly transformed vector just learn the combined function?
[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
- In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?
This is very interesting! What prior does log(1+|a|) correspond to? And what about using instead of ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?
[Nix] Toy model of feature splitting
- There are at least two explanations for feature splitting I find plausible:
- Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
- There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.
These do not sound like different explanations to me. In particular, the distinction between "mostly-continuous but approximated as discrete" and "discrete but very similar" seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won't change the behavior of the network meaningfully).
As far as toy models go, I'm pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width , I expect each feature to split into roughly features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I'm quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.
Edit: I expect this toy model will also permit exploring:
[Lee] Is there structure in feature splitting?
- Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
- 'topology in a math context'
- 'topology in a physics context'
- 'high dimensions in a math context'
- 'high dimensions in a physics context'
- Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn't the original SAEs find these directions?
I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.
Why we made this list:
We therefore thought it would be helpful to share our list of project ideas!
Comments and caveats:
We hope some people find this list helpful!
We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out.
Foundational work on sparse dictionary learning for interpretability
Transcoder-related project ideas
Other
Applied interpretability
Intrinsic interpretability
Understanding features (not SDL)
Theoretical foundations for interpretability
Singular-learning-theory-related
Other
Meta-research and philosophy
Engineering
Papers from our first project here and here and from our second project here.