Undergraduate student studying Mathematics @ University of Bristol.
Interested in & persuing a career in technical AI safety.
Yep! We are planning to do exactly that for (at least) the models we focus on in the paper (Pythia-70m + Pythia-410m), and probably also GPT2 small. We are also working on cleaning up our codebase (https://github.com/HoagyC/sparse_coding) and implementing some easy dictionary training solutions.
Cool post! I often find myself confused/unable to guess why people I don't know are excited about SAEs (there seem to be a few vaguely conflicting reasons), and this was a very clear description of your agenda.
I'm a little confused by this point:
> The reconstruction loss trains the SAE features to approximate what the network does, thus optimizing for mathematical description accuracy
It's not clear to me that framing reconstruction loss as 'approximating what the network does' is the correct framing of this loss. In my mind, the reconstruction loss is more of a non-degeneracy control to encourage almost-orthogonality between features; In toy settings, SAEs are able to recover ground truth directions while still having sub-perfect reconstruction loss, and it seems very plausible that we should be able to use this (e.g. maybe through gradient-based attribution) without having to optimise heavily for reconstruction loss, which might degrade scalability (which seems very important for this agenda) and monosemanticity compared to currently-unexplored alternatives.