TL;DR

We achieve better SAE performance by:

Removing the lowest activating features
Replacing the L1(feature_activations) penalty function with L1(sqrt(feature_activations))

with 'better' meaning: we can reconstruct the original LLM activations w/ lower MSE & with fewer features/datapoint.

As a sneak peak (the graph should make more sense as we build up to it, don't worry!):

The L1(sqrt()) graph (ie dotted one) is farther to the lower-left corner (this is good!) of low Cross Entropy loss with low features/datapoint (ie L0 Norm).

Now in more details:

Sparse Autoencoders (SAEs) reconstruct each datapoint in [layer 3's residual stream activations of Pythia-70M-deduped] using a certain amount of features (this is the L0-norm of the hidden activation in the SAE). Typically the higher activations are interpretable & the lowest of activations non-interpretable.

This is a histogram of the specific tokens that activated a specific feature. If you run 1M tokens in an LLM, 1000 of those activate this feature. Of those 1000, there are ~30 that activate between 5.53 & 6.15 which are all the token apostrophe (aka ').

Here is a feature that activates mostly on apostrophe (removing it also makes it worse at predicting "s"). The lower activations are conceptually similar, but then we have a huge amount of tokens that are something else.

From a datapoint viewpoint, there's a similar story: given a specific datapoint, the top activation features make a lot of sense, but the lowest ones don't (ie if 20 features activate that reconstruct a specific datapoint, the top ~5 features make a decent amount of sense & the lower 15 make less and less sense)

Are these low-activating features actually important for downstream performance (eg CE)? Or are they modeling noise in the underlying LLM (which is why we see conceptually similar datapoints in lower activation points)?

Ablating Lowest Features

There are a few different ways to remove the "lowest" feature activations.

Dataset View:

Lowest k-features per datapoint

Feature View: Features have different activation values. Some are an OOM larger than others on average, so we can set feature specific thresholds.

Percentage of max activation - remove all feature activations that are < [10%] of max activation for that feature
Quantile - Remove all features in the [10th] percentile activations for each feat

...

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments