User Comment Replies — AI Alignment Forum

Attribution-based parameter decomposition

Are you guys aware of the task arithmetic line of work (e.g. this paper and related works following it)? It seems extremely relevant/useful for this line of work (e.g. linear decomposition of the parameter space, some follow-up work ties in with NTK theory and identifies regimes where linearity might be more expected), but you guys don't appear to have cited it.

If you are aware and didn't cite it for another reason, fairs!

3Lee Sharkey2mo

We're aware of model diffing work like this, but I wasn't aware of this particular paper. It's probably an edge case: They do happen both to be in weight space and to be suggestive of weight space linearity. Indeed, our work was informed by various observations from a range of areas that suggest weight space linearity (some listed here). On the other hand, our work focused on decomposing a given network's parameters. But the line of work you linked above seems more in pursuit of model editing and understanding the difference between two similar models, rather than decomposing a particular model's weights. all in all, whether it deserved to be in the related work section is unclear to me. Seems plausible either way. The related work section was already pretty long, but it maybe deserves a section on weight space linearity, though probably not one on model diffing imo.

Sparsify: A mechanistic interpretability research agenda

Aidan Ewart1y20

Cool post! I often find myself confused/unable to guess why people I don't know are excited about SAEs (there seem to be a few vaguely conflicting reasons), and this was a very clear description of your agenda.

I'm a little confused by this point:

> The reconstruction loss trains the SAE features to approximate what the network does, thus optimizing for mathematical description accuracy

It's not clear to me that framing reconstruction loss as 'approximating what the network does' is the correct framing of this loss. In my mind, the reconstruction loss... (read more)

Sparse Autoencoders: Future Work

Aidan Ewart1y32

Yep! We are planning to do exactly that for (at least) the models we focus on in the paper (Pythia-70m + Pythia-410m), and probably also GPT2 small. We are also working on cleaning up our codebase (https://github.com/HoagyC/sparse_coding) and implementing some easy dictionary training solutions.

2Neel Nanda1y

Awesome! On the residual stream, or also on the MLP/attention outputs? I think both would be great if you have the resources, I expect there's a lot of interest in both and how they interact. (IMO the Anthropic paper training on MLP activations is equivalent to training it on the MLP layer output, just with 4x the parameters). Ideally if you're doing it on attn_out you could instead do it on the mixed value (z in TransformerLens) which has the same dimensionality, but makes it super clear which head the dictionary is looking at, and is robust to head superposition. Having a clean codebase also seems like a really useful resource, esp if you implement some of the tricks from the Anthropic paper, like neuron resampling. Looking forwards to it!

AI ALIGNMENT FORUM
AF

All of Aidan Ewart's Comments + Replies