I'm very excited about approaches to add hierarchy to SAEs - seems like an important step forward. In general, approaches that constraint latents in various ways that let us have higher L0 without reconstruction becoming trivial seem exciting.
I think it would be cool to get follow up work on bigger LMs. It should also be possible to do matryoshka with block size = 1 efficiently with some kernel tricks, which would be cool.
This is likely not the first instance, but OpenAI was already using the word "aligned" in this way in 2021 in the Codex paper.
https://arxiv.org/abs/2107.03374 (section 7.2)
Very exciting that JumpReLU works well with STE gradient estimation! I think this fixes one of the biggest flaws with TopK, which is that having a fixed number of latents k on each token is kind of wonky. I also like the argument in section 4 a lot - in particular the point about how this works because we're optimizing the expectation of the loss. Because of how sparse the features are, I wonder if it would reduce gradient noise substantially to use a KDE with state persisting across a few recent steps.
Some takes on some of these research questions:
Looking for opposing feature directions in SAEs
I checked a top-k SAE with 256k features and k=256 trained on GPT-4 and found only 286 features that had any other feature with cosine similarity < -0.9, and 1314 with cosine sim < -0.7.
SAE/Transcoder activation shuffling
I'm confident that when learning rate and batch size are tuned properly, not shuffling eventually converges to the same thing as shuffling. The right way to frame this imo is the efficiency loss from not shuffling, which from preliminary experiments+intuition I'd guess is probably substantial.
How much does initializing the encoder to be the transpose of the decoder (as done so here and here) help for SAEs and transcoders?
It helps tremendously for SAEs by very substantially reducing dead latents; see appendix C.1 in our paper.
Cool work - figuring out how much of scaling up autoencoders is discovering new features vs splitting existing ones feels quite important. Especially since for any one scale of autoencoder there are simultaneously features which are split too finely and features which are too rare to yet be discovered, it seems quite plausible that the most useful autoencoders will be ones with features stitched together from multiple scales.
Some minor nitpicks: I would recommend always thinking of MSE/L0 in terms of the frontier between the two, rather than either alone; in my experiments I found it very easy to misjudge at a glance whether a run with better MSE but worse L0 was better or worse than the frontier.
VAEs:
a normal autoencoder decodes single latents z to single images (or whatever other kind of data) x, and also encodes single images x to single latents z.
with VAEs, we want our decoder (p(x|z)) to take single latents z and output a distribution over x's. for simplicity we generally declare that this distribution is a gaussian with identity covariance, and we have our decoder output a single x value that is the mean of the gaussian.
because each x can be produced by multiple z's, to run this backwards you also need a distribution of z's for each single x. we call the ideal encoder p(z|x) - the thing that would perfectly invert our decoder p(x|z). unfortunately, we obviously don't have access to this thing. so we have to train an encoder network q(z|x) to approximate it. to make our encoder output a distribution, we have it output a mean vector and a stddev vector for a gaussian. at runtime we sample a random vector eps ~ N(0, 1) and multiply it by the mean and stddev vectors to get an N(mu, std).
to train this thing, we would like to optimize the following loss function:
-log p(x) + KL(q(z|x)||p(z|x))
where the terms optimize the likelihood (how good is the VAE at modelling data, assuming we have access to the perfect z distribution) and the quality of our encoder (how good is our q(z|x) at approximating p(z|x)). unfortunately, neither term is tractable - the former requires marginalizing over z, which is intractable, and the latter requires p(z|x) which we also don't have access to. however, it turns out that the following is mathematically equivalent and is tractable:
-E z~q(z|x) [log p(x|z)] + KL(q(z|x)||p(z))
the former term is just the likelihood of the real data under the decoder distribution given z drawn from the encoder distribution (which happens to be exactly equivalent to the MSE, because it's the log of gaussian pdf). the latter term can be computed analytically, because both distributions are gaussians with known mean and std. (the distribution p is determined in part by the decoder p(x|z), but that doesn't pin down the entire distribution; we still have a degree of freedom in how we pick p(z). so we typically declare by fiat that p(z) is a N(0, 1) gaussian. then, p(z|x) is implied to be equal to p(x|z) p(z) / sum z' p(x|z') p(z'))
We had done very extensive ablations at small scale where we found TopK to be consistently better than all of the alternatives we iterated through, and by the time we launched the big run we had already worked out how to scale all of the relevant hyperparameters, so we were decently confident.
One reason we might want a progressive code is it would basically let you train one autoencoder and use it for any k you wanted to at test time (which is nice because we don't really know exactly how to set k for maximum interpretability yet). Unfortunately, this is somewhat worse than training for the specific k you want to use, so our recommendation for now is to train multiple autoencoders.
Also, even with a progressive code, the activations on the margin would not generally be negative (we actually apply a ReLU to make sure that the activations are definitely non-negative, but almost always the (k+1)th value is still substantially positive)
in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it's not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i'm going to assume that whenever you say size you meant to say compute.
obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the current model production methodology is the optimal one.
it is unknown how much compute the latest models were trained with, and therefore what compute efficiency win they obtain over gpt4. it is unknown how much more effective compute gpt4 used than gpt3. we can't really make strong assumptions using public information about what kinds of compute efficiency improvements have been discovered by various labs at different points in time. therefore, we can't really make any strong conclusions about whether the current models are not that much better than gpt4 because of (a) a shortage of compute, (b) a shortage of compute efficiency improvements, or (c) a diminishing return of capability wrt effective compute.