x-posting a kinda rambling thread I wrote about this blog post from Tilde research.
---
If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!
I'll discussed some things that surprised me about this case study in
---
The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.
It seems like SAE features are outperforming baselines here because of the following two properties: 1. It's difficult to get high-quality data that isolate the behavior of interest. (I.e. it's difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.
Property (1) is not surprising in the abstract, and I've often argued that if interpretability is going to be useful, then it will be for tasks where there are structural obstacles to collecting high-quality supervised data (see e.g. the opening paragraph to section 4 of Sparse Feature Circuits https://arxiv.org/abs/2403.19647).
However, I think property (1) is a bit surprising in this particular instance—it seems like getting good data for the regex task is more "tricky and annoying" than "structurally difficult." I'd weakly guess that if you are a whiz at synthetic data generation then you'd be able to get good enough data here to train probes that outperform the SAEs. But that's not too much of a knock against SAEs—it's still cool if they enable an application that would otherwise require synthetic datagen expertise. And overall, it's a cool showcase of the fact that SAEs find meaningful units in an unsupervised way.
Property (2) is pretty surprising to me! Specifically, I'm surprised that SAE feature steering enables finer-grained control than prompt engineering. As others have noted, steering with SAE features often results in unintended side effects; in contrast, since prompts are built out of natural language, I would guess that in most cases we'd be able to construct instructions specific enough to nail down our behavior of interest pretty precisely. But in this case, it seems like the task instructions are so long and complicated that the models have trouble following them all. (And if you try to improve your prompt to fix the regex behavior, the model starts misbehaving in other ways, leading to a "whack-a-mole" problem.) And also in this case, SAE feature steering had fewer side-effects than I expected!
I'm having a hard time drawing a generalizable lesson from property (2) here. My guess is that this particular problem will go away with scale, as larger models are able to more capably follow fine-grained instructions without needing model-internals-based interventions. But maybe there are analogous problems that I shouldn't expect to be solved with scale? E.g. maybe interpretability-assisted control will be useful across scales for resisting jailbreaks (which are, in some sense, an issue with fine-grained instruction-following).
Overall, something surprised me here and I'm excited to figure out what my takeaways should be.
---
Some things that I'd love to see independent validation of:
1. It's not trivial to solve this problem with simple changes to the system prompt. (But I'd be surprised if it were: I've run into similar problems trying to engineer system prompts with many instructions.)
2. It's not trivial to construct a dataset for training probes that outcompete SAE features. (I'm at ~30% that the authors just got unlucky here.)
---
Huge kudos to everyone involved, especially the eagle-eyed @Adam Karvonen for spotting this problem in the wild and correctly anticipating that interpretability could solve it!
---
I'd also be interested in tracking whether Benchify (the company that had the fuzz-tests-without-regexes problem) ends up deploying this system to production (vs. later finding out that the SAE steering is unsuitable for a reason that they haven't yet noticed).
Isn't it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.
I'm guessing you'd need to rejection sample entire blocks, not just lines. But yeah, good point, I'm also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen
@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").
I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.
The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)
This uses transformers
, which is IIUC way less efficient for inference than e.g. vllm
, to an extent that is probably unacceptable for production usecases.
Some updates about the dictionary_learning repo:
ActivationBuffers
now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).ActivationBuffers
can now be stored on the GPU.evaluation.py
contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating dictionaries people send to me.reconstructed_acts, features = dictionary(acts, output_features=True)
to get both the reconstruction and the features computed by dictionary
.Also, if you'd like to train dictionaries for many model components in parallel, you can use the parallel
branch. I don't promise to never make breaking changes to the parallel
branch, sorry.
Finally, we've released a new set of dictionaries for the MLP outputs, attention outputs, and residual stream in all layers of Pythia-70m-deduped. The MLP and attention dictionaries seem pretty good, and the residual stream dictionaries seem like a mixed bag. Their stats can be found here.
Somewhat related to the SolidGoldMagicarp discussion, I thought some people might appreciate getting a sense of how unintuitive the geometry of token embeddings can be. Namely, it's worth noting that the tokens whose embeddings are most cosine-similar to a random vector in embedding space tend not to look very semantically similar to each other. Some examples:
v_1 v_2 v_3
--------------------------------------------------
characterized Columb determines
Stra 1900 conserv
Ire sher distinguishes
sent paed emphasizes
Shelter 000 consists
Pil mx operates
stro female independent
wired alt operate
Kor GW encompasses
Maul lvl consisted
Here v_1, v_2, v_3, are random vectors in embedding space (drawn from ), and the columns give the 10 tokens whose embeddings are most cosine-similar to . I used GPT-2-large.
Perhaps 20% of the time, we get something like , where many of the nearest neighbors have something semantically similar among them (in this case, being present tense verbs in the 3rd person singular).
But most of the time, we get things that look like or : a hodgepodge with no obvious shared semantic content. GPT-2-large seems to agree: picking " female" and " alt" randomly from the column, the cosine similarity between the embeddings of these tokens is 0.06.
[Epistemic status: I haven't thought that hard about this paragraph.] Thinking about the geometry here, I don't think any of this should be surprising. Given a random vector , we should typically find that is ~orthogonal to all of the ~50000 token embeddings. Moreover, asking whether the nearest neighbors to should be semantically clustered seems to boil down to the following. Divide the tokens into semantic clusters ; then compare the distribution of intra-cluster variances to the distribution of cosine similiarities of the cluster means . From the perspective of cosine similarity to , we should expect these clusters to look basically randomly drawn from the full dataset , so that each variance in the former set should be . This should be greater than the mean of the latter set, implying that we should expect the nearest neighbors to to mostly be random tokens taken from different clusters, rather than a bunch of tokens taken from the same cluster. I could be badly wrong about all of this, though.
There's a little bit of code for playing around with this here.