Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Senthooran Rajamanoharan; Arthur Conmy; CallumMcDougall; Tom Lieberum; János Kramár; Rohin Shah; Neel Nanda

Copying over from X an exchange related to this post:

I’m a bit confused by this - perhaps due to differences of opinion in what ‘fundamental SAE research’ is and what interpretability is for. This is why I prefer to talk about interpreter models rather than SAEs - we’re attached to the end goal, not the details of methodology. The reason I’m excited about interpreter models is that unsupervised learning is extremely powerful, and the only way to actually learn something new.
[thread continues]

Neel Nanda:

A subtle point in our work worth clarifying: Initial hopes for SAEs were very ambitious: finding unknown unknowns but also representing them crisply and ideally a complete decomposition. Finding unknown unknowns remains promising but is a weaker claim alone, we tested the others
OOD probing is an important use case IMO but it's far from the only thing I care about - we were using a concrete case study as grounding to get evidence about these empirical claims - a complete, crisp decomposition into interpretable concepts should have worked better IMO.
[thread continues]

Sam Marks (me):

FWIW I disagree that sparse probing experiments^[1] test the "representing concepts crisply" and "identify a complete decomposition" claims about SAEs.
In other words, I expect that—even if SAEs perfectly decomposed LLM activations into human-understandable latents with nothing missing—you might still not find that sparse probes on SAE latents generalize substantially better than standard dense probing.
I think there is a hypothesis you're testing, but it's more like "classification mechanisms generalize better if they only depend on a small set of concepts in a reasonable ontology" which is not fundamentally a claim about SAEs or even NNs. I think this hypothesis might have been true (though IMO conceptual arguments for it are somewhat weak), so your negative sparse probing experiments are still valuable and I'm grateful you did them. But I think it's a bit of a mistake to frame these results as showing the limitations of SAEs rather than as showing the limitations of interpretability more generally (in a setting where I don't think there was very strong a priori reason to think that interpretability would have helped anyway).
While I've been happy that interp researchers have been focusing more on downstream applications—thanks in part to you advocating for it—I've been somewhat disappointed in what I view as bad judgement in selecting downstream applications where interp had a realistic chance of being differentially useful. Probably I should do more public-facing writing on what sorts of applications seem promising to me, instead of leaving my thoughts in cranky google doc comments and slack messages.

Neel Nanda:

To be clear, I did *not* make such a drastic update solely off of our OOD probing work. [...] My update was an aggregate of:
Several attempts on downstream tasks failed (OOD probing, other difficult condition probing, unlearning, etc)
SAEs have a ton of issues that started to surface - composition, aborption, missing features, low sensitivity, etc
The few successes on downstream tasks felt pretty niche and contrived, or just in the domain of discovery - if SAEs are awesome, it really should not be this hard to find good use cases...
It's kinda awkward to simultaneously convey my aggregate update, along with the research that was just one factor in my update, lol (and a more emotionally salient one, obviously)
There's disagreement on my team about how big an update OOD probing specifically should be, but IMO if SAEs are to be justified on pragmatic grounds they should be useful for tasks we care about, and harmful intent is one such task - if linear probes work and SAEs don't, that is still a knock against SAEs. Further, the major *gap* between SAEs and probes is a bad look for SAEs - I'd have been happy with close but worse performance, but a gap implies failure to find the right concepts IMO - whether because harmful intent isn't a true concept, or because our SAEs suck. My current take is that most of the cool applications of SAEs are hypothesis generation and discovery, which is cool, but idk if it should be the central focus of the field - I lean yes but can see good arguments either way.
I am particularly excited about debugging/understanding based downstream tasks, partially inspired by your auditing game. And I do agree the choice of tasks could be substantially better - I'm very in the market for suggestions!

Sam Marks:

Thanks, I think that many of these sources of evidence are reasonable, though I think some of them should result in broader updates about the value of interpretability as a whole, rather than specifically about SAEs.
In more detail:
SAEs have a bunch of limitations on their own terms, e.g. reconstructing activations poorly or not having crisp features. Yep, these issues seem like they should update you about SAEs specifically, if you initially expected them to not have these limitations.
Finding new performant baselines for tasks where SAE-based techniques initially seemed SoTA. I've also made this update recently, due to results like:
(A) Semantic search proving to be a good baseline in our auditing game (section 5.4 of https://arxiv.org/abs/2503.10965 )
(B) Linear probes also identifying spurious correlations (section 4.3.2 of https://arxiv.org/pdf/2502.16681 and other similar results)
(C) Gendered token deletion doing well for the Bias in Bios SHIFT task (https://lesswrong.com/posts/QdxwGz9AeDu5du4Rk/shift-relies-on-token-level-features-to-de-bias-bias-in-bios… )
I think the update from these sorts of "good baselines" results is twofold:
1. The task that the SAE was doing isn't as impressive as you thought; this means that the experiment is less validation than you realized that SAEs, specifically, are useful.
2. Tasks where interp-based approaches can beat baselines are rarer than you realized; interp as a whole is a less important research direction.
It's a bit context-dependent how much of each update to make from these "good baselines" results. E.g. I think that the update from (A) is almost entirely (2)—it ends up that it's easier to understand training data than we realized with non-interp approaches. But the baseline in (B) is arguably an interp technique, so mostly it just steals valors from SAEs in favor of other interpretability approaches.
Obvious non-interp baselines outperformed SAEs on [task]. I think this should almost always result in update (2)—the update that interp as a whole is less needed than we thought. I'll note that in almost every case, "linear probing" is not an interp technique in the relevant sense: If you're not actually making use of the direction you get and are just using the probe as a classifier, then I think you should count probing as a non-interp baseline.

Arthur Conmy:

I agree with most of this post. Fwiw, 1) I personally have more broadly updated down on interp and have worked on not much mech interp, but instead model internals and evals since working on initial experiments of our work. 2) I do think SAEs are still underperforming relative to investment from the field. Including today’s progress on CLTs! It is exciting work, but IMO there are a lot of ifs ahead of SAEs being actually providing nontrivial counterfactual direct value to safety

^{^}
Sam Marks:
To clarify, my points here are about OOD probing experiments where the SAE-based intervention is "just regularize the probe to attend to a sparse subset of the latents."
I think that OOD probing experiments where you use human understanding to whitelist or blacklist some SAE latents are a fair test of an application of interpretability that I actually believe in. (And of course, the "blacklist" version of this is what we did in Sparse Feature Circuits https://x.com/saprmarks/status/1775513451668045946… )

[-]Senthooran Rajamanoharan7mo910

Thanks for copying this over!

For what it's worth, my current view on SAEs is that they remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model. And, as such, they share many of the pros and cons of unsupervised techniques in general:^[1]

(Pros) They may be useful / efficient for getting a first pass understanding of what's going in a model / with some data (indeed many of their success stories have this flavour).
(Cons) They are hit and miss - often not carving up the data in the way you'd prefer, with weird omissions or gerrymandered boundaries you need to manually correct for. Once you have a hypothesis, a supervised method will likely give you better results.

I think this means SAEs could still be useful for generating hypotheses when trying to understand model behaviour, and and I really like the CLTs papers in this regard.^[2] However, it's still unclear whether they are better for hypothesis generation than alternative techniques, particularly techniques that have other advantages, like the ability to be used with limited model access (i.e. black-box techniques) or techniques that don't require paying a large up-front cost before they can be used on a model.

I largely agree with your updates 1 and 2 above, although on 2 I still think it's plausible that while many "why is the model doing X?" type questions can be answered with black-box techniques today, this may not continue to hold into the future, which is why I still view interp as a worthwhile research direction. This does make it important though to always try strong baselines on any new project and only get excited when interp sheds light on problems that genuinely seem hard to solve using these baselines.^[3]

When I say unsupervised learning, I'm using this term in its conventional sense, e.g. clustering algorithms, manifold learning, etc; not in the sense of tasks like language model pre-training which I sometimes see referred to as unsupervised. ↩︎
Particularly its emphasis on techniques to prune massive attribution graphs, improving tooling for making sense of the results, and accepting that some manual adjustment of the decompositions produced by CLTs may be necessary because we're giving up on the idea that CLTs / SAEs are uncovering a "true basis". ↩︎
And it does seem that black box methods often suffice (in the sense of giving "good enough explanations" for whatever we need these explanations for) when we try to do this. Though this could just be - as you say - because of bad judgement. I'd definitely appreciate suggestions for better downstream tasks we should try! ↩︎

[-]Sam Marks7mo122

I agree with most of this, especially

SAEs [...] remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model.

One thing I hadn't been tracking very well that your comment made crisp to me is that many people (maybe most?) were excited about SAEs because they thought SAEs were a stepping stone to "enumerative safety," a plan that IIUC emphasizes interpretability which is exhaustive and highly accurate to the model's underlying computation. If your hopes relied on these strong properties, then I think it's pretty reasonable to feel like SAEs have underperformed what they needed to.

Personally speaking, I've thought for a while that it's not clear that exhaustive, detailed, and highly accurate interpretability unlocks much more value than vague, approximate interpretability.^[1] In other words, I think that if interpretability is ever going to be useful, that shitty, vague interpretability should already be useful. Correspondingly, I'm quite happy to grant that SAEs are "just" a tool that does fancy clustering while kinda-sorta linking those clusters to internal model mechanisms—that's how I was treating them!

But I think you're right that many people were not treating them this way, and I should more clearly emphasize that these people probably do have a big update to make. Good point.

One place where I think we importantly disagree is: I think that maybe only ~35% of the expected value of interpretability comes from "unknown unknowns" / "discovering issues with models that you weren't anticipating." (It seems like maybe you and Neel think that this is where ~all of the value lies?)

Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition." In more detail, I think that some settings have structural properties that make it very difficult to use data to isolate undesired aspects of model cognition. A prosaic example is spurious correlations, assuming that there's something structural stopping you from just collecting more data that disambiguates the spurious cue from the intended one. Another example: It might be difficult to disambiguate the "tell the human what they think is the correct answer" mechanism from the "tell the human what I think is the correct answer" mechanism. I write about this sort of problem, and why I think interpretability might be able to address it, here. And AFAICT, I think it really is quite different—and more plausibly interp-advantaged—than "unknown unknowns"-type problems.

To illustrate the difference concretely, consider the Bias in Bios task that we applied SHIFT to in Sparse Feature Circuits. Here, IMO the main impressive thing is not that interpretability is useful for discovering a spurious correlation. (I'm not sure that it is.) Rather, it's that—once the spurious correlation is known—you can use interp to remove it even if you do not have access to labeled data isolating the gender concept.^[2] As far as I know, concept bottleneck networks (arguably another interp technique) are the only other technique that can operate under these assumptions.

^{^}
Just to establish the historical claim about my beliefs here:
- Here I described the idea that turned into SHIFT as "us[ing] vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want".
- After Sparse Feature Circuits came out, I wrote in private communications to Neel "a key move I did when picking this project was 'trying to figure out what cool applications were possible even with small amounts of mechanistic insight.' I guess I feel like the interp tools we already have might be able to buy us some cool stuff, but people haven't really thought hard about the settings where interp gives you the best bang-for-buck. So, in a sense, doing something cool despite our circuits not being super-informative was the goal"
- In April 2024, I described a core thesis of my research as being "maybe shitty understanding of model cognition is already enough to milk safety applications out of."
^{^}
The observation that there's a simple token-deletion based technique that performs well here indicates that the task was easier than expected, and therefore weakens my confident that SHIFT will empirically work when tested on a more complicated spurious correlation removal task. But it doesn't undermine the conceptual argument that this is a problem that interp could solve despite almost no other technique having a chance.

[-]Rohin Shah7mo42

Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition."

Is this a problem you expect to arise in practice? I don't really expect it to arise, if you're allowing for a significant amount of effort in creating that data (since I assume you'd also be putting a significant amount of effort into interpretability).

[-]habryka7mo*73

Promoted to curated: I really liked this post for its combination of reporting negative results, communicating a deeper shift in response to those negative results, while seeming pretty well-calibrated about the extent of the update. I would have already been excited about curating this post without the latter, but it felt like an additional good reason.

[-]A2z7mo00

I never understood the SAE literature, which came after my earlier work (2019-2020) on sparse inductive biases for feature detection (i.e., semi-supervised decomposition of feature contributions) and interpretability-by-exemplar via model approximations (over the representation space of models), which I originally developed for the goal of bringing deep learning to medicine. Since the parameters of the large neural networks are non-identifiable, the mechanisms for interpretability must shift from understanding individual parameter values to semi-supervised matching against comparable instances and most importantly, to robust and reliable predictive uncertainty over the output, for which we now have effective approaches: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1

(That said, obviously the normal caveat applies that people should feel free to study whatever they are interested in, as you can never predict what other side effects, learning, and new results---including in other areas---might occur.)

^{^}

As we describe in detail in the paper, we use straight-through-estimators (STEs) to differentiate through the step discontinuities in both the L0 sparsity penalty and the jump discontinuity in the JumpReLU activation function to train JumpReLU SAEs. We use the same method here.

^{^}

A very similar argument can be made about TopK SAEs, which control L0 via the $k$ parameter.

^{^}

Indeed, in our paper, we show how we can modify the sparsity penalty to train SAEs that target a fixed L0, much like TopK SAEs. Here, we will instead modify the sparsity penalty to directly target high frequency features.

^{^}

An alternative, and even simpler, sparsity penalty would be to penalise all latents according to the square of their firing frequencies, i.e. using a penalty of the form $λ \sum_{i = 1}^{M} {^ω}_{i}^{2}$ . This squared-frequency penalty also has the advantage of not introducing yet another hyperparameter. However, this penalty under-penalises latents with very low firing frequencies, leading to a frequency distribution that is devoid of both high and low firing frequencies, i.e. more sharply peaked around the mean firing frequency. One way to see why this is the case is to notice that such a penalty can also be expressed as $⟨^ω ⟩^{2} + V a r (ω)$ : i.e. holding the mean firing frequency $⟨^ω ⟩$ fixed, it corresponds to minimising the variance of the frequency distribution. In contrast, the quadratic-frequency sparsity penalty here ensures that all latents receive a frequency penalty that is at least linear, while high frequency latents receive an additional penalty that is quadratic; this ensures that the lower part of the frequency distribution remains similar to the latent frequency distributions for JumpReLU and TopK, while nevertheless suppressing the top end of the frequency distribution.

^{^}

Note that this is a departure from the approach we've taken in earlier work, where latents were sampled for auto/manual auto-interp uniformly and with a low-frequency cutoff to deal with latents that have insufficient data.

^{^}

We train all JumpReLU variants using straight-through-estimators that only provide gradients to the threshold, as in the original paper.

^{^}

This could be happening with Gated SAEs too.

	Train	Test	OOD
GemmaScope	0.00087	0.0018	0.048
SAE finetune comparable steps	0.0013	0.0024	0.039

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

58

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

58

TL;DR

Introduction

Motivation

Our Task

Conclusions and Strategic Updates

Comparing different ways to train Chat SAEs

Using SAEs for OOD Probing

Technical Setup

Datasets

Probing

Results

Related Work and Discussion

Is it surprising that SAEs didn’t work?

Dataset debugging with SAEs

Autointerp and high frequency latents

Removing High Frequency Latents from JumpReLU SAEs

Method

Motivation

Modifying the sparsity penalty

How we evaluated interpretability

Results

Reconstruction loss at fixed sparsity

Frequency histograms

Latent interpretability

Conclusions

Appendix