Cool work - figuring out how much of scaling up autoencoders is discovering new features vs splitting existing ones feels quite important. Especially since for any one scale of autoencoder there are simultaneously features which are split too finely and features which are too rare to yet be discovered, it seems quite plausible that the most useful autoencoders will be ones with features stitched together from multiple scales.
Some minor nitpicks: I would recommend always thinking of MSE/L0 in terms of the frontier between the two, rather than either alone; in my experiments I found it very easy to misjudge at a glance whether a run with better MSE but worse L0 was better or worse than the frontier.
What I would be interested to understand about feature splitting is whether the fine-grained features are alternatives, describing an ontology, or are defining a subspace (corners of a simplex, like R, G, and B defining color space). Suppose a feature X in a small VAE is split into three features X1, X2, and X3 in a larger VAE for the same model. If occurrences of X1, X2, and X3 are correlated, so activations containing any of them commonly have some mix of them, then they span a 2d subspace (in this case the simplex is a triangle). If, on the other hand, X1, X2 and X3 co-occur in an activations only rarely (just as two randomly-selected features rarely co-occur), then they describe three similar-but-distinct variations on a concept, and X is the result of coarse-graining these together as a singly concept at a higher level in an ontology tree (so by comparing VAEs of different sizes we can generate a natural ontology).
This seems like it would be a fairly simple, objective experiment to carry out. (Maybe someone already has, and can tell me the result!) It is of course quite possible that some split features describe subspaces, and other ontologies, or indeed something between the two where the features co-occur rarely but less rarely than two random features. Or X1 could be distinct but X2 and X3 might blend to span a 1-d subspace. Nevertheless, understanding the relative frequency of these different behaviors would be enlightening.
It would be interesting to validate this using a case like the days of the week, where we believe we already understand the answer: they are 7 alternatives that are laid out in a heptagon in a 2-dimensional subspace that enables doing modular addition/subtraction modulo 7. So if we have a VAE small enough that it represented all day-of-the week names by a single feature, if we increase the VAE size somewhat we'd expect to see this to split into three features spanning a 2-d subspace, then if we increased it more we'd expect to see it resolve into 7 mutually-exclusive alternatives, and hopefully then stay at 7 in larger VAEs (at least until other concepts started to get mixed in, if that ever happened).
Work done in Neel Nanda’s stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University
TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) “novel features” with new information not in the small SAE and 2) “reconstruction features” that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.
Introduction
Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when you take the same activation in the same model and train a wider dictionary on it, what changes? And how do the features learned vary?
We show that features in larger SAEs cluster into two kinds of features: those that capture similar information to the smaller SAE (either identical features, or split features; about 65%), and those which capture novel features absent in the smaller mode (the remaining 35%). We validate this by showing that inserting the novel features from the larger SAE into the smaller SAE boosts the reconstruction performance, while inserting the similar features makes performance worse.
Building on this insight, we show how features from multiple SAEs of different sizes can be combined to create a "Frankenstein" model that outperforms SAEs with an equal number of features, though tends to lead to higher L0, making a fair comparison difficult. Our work provides new understanding of how SAE dictionary size impacts the learned feature space, and how to reason about whether to train a wider SAE. We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features.
Larger SAEs learn both similar and entirely novel features
Set-up
We use sparse autoencoders as in Towards Monosemanticity and Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as:
fi(x)=ReLU(Wenci,.⋅(x−bdec)+benci)Based on these feature activations, the input is then reconstructed as
^x=bdec+F∑i=1fi(x)Wdec.,iThe encoder and decoder matrices and biases are trained with a loss function that combines an L2 penalty on the reconstruction loss and an L1 penalty on the feature activations:
L=Ex[||x−^x||22+λF∑i=1fi(x)]In our experiments, we train a range of sparse autoencoders (SAEs) with varying widths across residual streams in GPT-2 and Pythia-410m. The width of an SAE is determined by the number of features (F) in the sparse autoencoder. Our smallest SAE on GPT-2 consists of only 768 features, while the largest one has nearly 100,000 features. Here is the full list of SAEs used in this research:
The base language models used are those included in TransformerLens.
How similar are features in SAEs of different widths?
When we compare the features in pairs of SAEs of different sizes at the same model site, for example GPT-768 and GPT-1536; we refer to the SAE with fewer features as the small SAE, and the SAE with more features as the larger SAE, and our results relate to these pairs, rather than a universal concept of small and large SAEs.
Given our wide range of SAEs with different dictionary sizes, we can investigate what features larger SAEs learn compared to smaller SAEs. As the loss function consists of two parts, a reconstruction loss and a sparsity penalty, there are a two intuitive explanations for the types of feature large SAEs learn, in comparison to smaller SAEs at the same site:
To evaluate whether features are similar or dissimilar between SAEs trained on the same layer, we use the cosine similarity of the decoder directions for those features. Nanda proposes using the decoder directions to identify features rather than the encoder weights, as encoder weights are optimized to minimize interference with other features as well as just detecting features, whereas decoder weights define the downstream impact of feature activations. The cosine similarity between two features is calculated as
SC(A,B)=A⋅B|A||B|Towards Monosemanticity uses a masked activation cosine similarity metric, essentially looking at how much the features co-occurred, i.e. activated on the same data points to identify similar features. However, we empirically find a high correlation between this metric and the decoder cosine similarity metric for similar features, and decoder cosine similarity is considerably computationally cheaper.
To find the features in GPT2-768 that are most similar to each feature in GPT2-1536, we iterate over each feature in GPT2-1536, take the cosine sim with each feature in GPT-768 and take the max cosine sim. (Note that this metric is not symmetric, if we do it for GPT-768 on GPT-1536 we get 768 numbers not 1536 numbers)
On the right-hand-side there is a cluster of GPT2-1536 features with high cosine similarity to at least one of the GPT2-768 features. For example, GPT2-1536:1290 (left) and GPT2-768:0 (right) have a cosine similarity of 0.99, and both activate strongly on the token “with” and boost similar logits, with some overlap in their max activating dataset.
However, the GPT2-1536 feature GPT2-1536:903 that activates on “make sure” has no counterpart in the 768 feature SAE. The three closest features in the 768-feature SAE are:
If we compare the reconstruction of the two SAEs on dataset examples where GPT2-1536:903 is active vs inactive, we find the difference in MSE is significantly larger in the small SAE, confirming the novel information added by this feature is not present in GPT2-768.
Averaging this metric across all 657 features in GPT-1536 that have low maximum cosine similarity (<=0.7) with all features in GPT-768 we see a similar pattern.
Can we add features from one SAE to another?
We evaluate whether it is possible to add features from one SAE into another SAE without decreasing, and ideally improving, reconstruction performance. Given two base SAEs:
SAE1(x)=bdec1+∑F1i=1f1,i(x)Wdec1.,i and SAE2(x)=bdec2+∑F2i=1f2,i(x)Wdec2.,i
We can construct a hybrid SAE by adding a feature from one to the other. For example we can add feature 3 from SAE1 to SAE2:
SAE⋆2(x)=f1,3(x)Wdec1.,3+SAE2(x)=f1,3(x)Wdec1.,3+bdec2+∑F2i=1f2,i(x)Wdec2.,i
To test if the novel features from larger SAEs can improve smaller SAEs, we add each feature from GPT2-1536 into GPT2-768 one at a time and measure the change in MSE. We find a clear relationship between a feature's maximum cosine similarity to GPT2-768 and its impact on MSE. Features with a smaller maximum cosine similarity almost universally improve performance, while adding more similar features tend to hurt performance.
Based on the results we divide the features of a larger SAE in two groups, using a maximum cosine similarity threshold of 0.7. This threshold is chosen somewhat arbitrarily, but seems to be around the point where the majority of features change from decreasing MSE to increasing MSE. Furthermore, it is close to 1√2(=0.707), which is the cosine similarity threshold where the vectors are more aligned than orthogonal.
Based on this threshold we divide the features in two categories:
The reason that many of the reconstruction features with high cosine similarity (> 0.7) increase MSE is that their information is already well-represented in the smaller SAE. Adding them causes the model to overpredict these features, impairing reconstruction performance. In contrast, novel features with lower cosine similarity (<= 0.7) mostly provide new information that was not captured by the smaller SAE, leading to improved reconstruction when added.
The figures above further supports this categorization into two groups based on the maximum cosine similarity. We see a clear difference in both the individual effect of adding in a reconstruction feature vs a novel feature as well as a difference in the cumulative effect of adding in all novel or reconstruction features. We can decrease the MSE of GPT2-768 by almost 10% by just adding in 657 features from GPT-1536 to GPT-768.
Can we swap features between SAEs?
In the previous section, we saw that adding novel features from larger SAEs to smaller SAEs can improve the performance of the smaller SAEs. However, we can't simply add in all features from the larger SAE, as some of them represent information that is already captured by the smaller SAE (reconstruction features). Instead, we can swap these similar features between the SAEs.
To identify which features can be swapped with which other features, we apply the same threshold to the decoder cosine similarity metric as before. If the cosine similarity between a large SAE feature and a small SAE feature is greater than 0.7, we consider the large SAE feature to be a child of the small SAE feature. This allows us to construct a graph of relationships between features in the small and large SAEs, where connected subgraphs represent potential swaps. These structures are very similar to the feature splitting phenomenon as shown in Towards Monosemanticity.
Based on these proposed swaps, we can replace the parents in the small SAE with their children (reconstruction features) in the larger SAE to get a more sparse representation without impacting the MSE too much. Most of the swaps result individually in an increase in MSE, but also in a decrease of L0.
Now if we combine these two methods and first add in all the novel features and then swap the reconstruction features in order of cosine similarity, we can smoothly interpolate between two SAE sizes. In this way, we have many possibilities to select a model with different trade-offs between the number of features, reconstruction performance, and L0.
Frankenstein’s SAE
In the previous section we saw that adding in parentless SAE features from large SAEs improves the performance of smaller SAEs. We now investigate whether this insight can be used to create smaller SAEs with a lower loss than the original SAEs. The idea here is to iteratively take the features from SAEs that add new information and to create a new SAE by stitching all these features together: Frankenstein’s SAE.
We construct Frankenstein’s SAEs in the following way:
This allows us to construct sparse autoencoders with lower MSE reconstruction loss than the original SAEs of the same size. Frankenstein's SAE with 40214 features already has a better reconstruction performance than the original SAE with 98304 features! This is partly explained by the fact that the Frankenstein SAEs have a much higher L0, as we keep adding in all the novel features without swapping in the sparsifying reconstruction features. For example, the Frankenstein SAE with 1425 features has an L0 of 53.1 (vs an L0 of 35.0 for the normal SAE of size 1536) and the Frankenstein with 10485 features has an L0 109.7 (vs 44.1 for the SAE with dictionary size 12288). If we would simply train an autoencoder with a lower sparsity penalty, this will also result in a model with higher L0 and lower MSE. However, we argue that the features in the Frankenstein model may be more interpretable than those learned by a regular sparse autoencoder with high L0. Firstly, the encoder directions are unchanged from those of the features in the original SAEs, and thus activate exactly on the same examples as the low-L0 SAEs. Secondly, the fine-tuned decoder directions have high cosine similarity to features in the base SAEs.
Even the features with low cosine similarity have clear and similar interpretations. For example, feature 867 (left) in the frankenstein SAE has 0.73 cosine similarity with its GPT2-1536:183 origin (right), however it boosts largely the same logits.
Similarly, frankenstein features 1128 (left) and GPT-1536:662 (right), with a cosine similarity of 0.74.
Although we have to train a number of SAEs to achieve the same performance as a larger SAE, the total number of features trained is the same. For example, constructing our SAE with 40124 features required training SAEs with
6∑i=0768⋅2i=97536features, slightly fewer than in the largest SAE (98304 features).
Discussion and Limitations
In this brief investigation, we found that when you scale up SAEs, they effectively learn two groups of features. One group consists of entirely novel features, which we can add into smaller SAEs which boosts their performance. The second group is of features that are reconstructions of features present in the smaller SAE, and that we can swap into the smaller SAE to decrease the L0.
We used these insights to construct "Frankenstein" SAEs, which stitch together novel features from multiple pre-trained SAEs of different sizes. By iteratively adding in novel features and retraining the decoder, we were able to construct SAEs with lower reconstruction loss than original SAEs with the same number of features (at the cost of a higher L0).
There are several limitations to this work:
Overall, we believe this work provides valuable insights into what happens when you scale up SAEs and introduces a simple approach to stitch features from one SAE to another.
Without retraining the decoder weights Frankenstein’s SAEs performance starts to degrade when adding in novel features from too many different sizes of SAEs. As the novel features still share some common directions (up to cosine similarity 0.7), adding in too many of these features still leads to the over-prediction of some feature directions.