Posts

Sorted by New

12Sparse Autoencoders: Future Work

63Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Wikitag Contributions

Comments

Sorted by

Newest

Attribution-based parameter decomposition

Aidan Ewart2mo40

Are you guys aware of the task arithmetic line of work (e.g. this paper and related works following it)? It seems extremely relevant/useful for this line of work (e.g. linear decomposition of the parameter space, some follow-up work ties in with NTK theory and identifies regimes where linearity might be more expected), but you guys don't appear to have cited it.

If you are aware and didn't cite it for another reason, fairs!

Sparsify: A mechanistic interpretability research agenda

Aidan Ewart1y20

Cool post! I often find myself confused/unable to guess why people I don't know are excited about SAEs (there seem to be a few vaguely conflicting reasons), and this was a very clear description of your agenda.

I'm a little confused by this point:

> The reconstruction loss trains the SAE features to approximate what the network does, thus optimizing for mathematical description accuracy

It's not clear to me that framing reconstruction loss as 'approximating what the network does' is the correct framing of this loss. In my mind, the reconstruction loss is more of a non-degeneracy control to encourage almost-orthogonality between features; In toy settings, SAEs are able to recover ground truth directions while still having sub-perfect reconstruction loss, and it seems very plausible that we should be able to use this (e.g. maybe through gradient-based attribution) without having to optimise heavily for reconstruction loss, which might degrade scalability (which seems very important for this agenda) and monosemanticity compared to currently-unexplored alternatives.

Sparse Autoencoders: Future Work

Aidan Ewart1y32

Yep! We are planning to do exactly that for (at least) the models we focus on in the paper (Pythia-70m + Pythia-410m), and probably also GPT2 small. We are also working on cleaning up our codebase (https://github.com/HoagyC/sparse_coding) and implementing some easy dictionary training solutions.