In the post Taking features out of superposition with sparse autoencoders, Lee Sharkey, Beren Millidge and Dan Braun (formerly at Conjecture) showed a potential technique for removing superposition from neurons using sparse coding. The original post shows the technique working on simulated data, but struggling on real models, while a recent update shows promise, at least for extremely small models.
We've now replicated the toy-model section of this post and are sharing the code on github so that others can test and extend it, as the original code is proprietary.
Additional replications have also been done by Trenton Bricken at Anthropic and most recently Adam Shai.
Thanks to Lee Sharkey for answering questions and Pierre Peigne for some of the data-generating code.
We hope to expand on this in the coming days/weeks by:
If you're interested in working on similar things, let us know! We'll be working on these directions in the lead-up to SERI MATS.