All of jacob_drori's Comments + Replies

Nice work! I'm not sure I fully understand what the "gated-ness" is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup:

Let  and   be the encoder and decoder functions, as in your paper, and let  be the model activation that is fed into the SAE.

The usual SAE reconstruction is , which suffers from the shrinkage problem.

Now, introduce a new learned parameter , and define an "expanded" reconstruction ... (read more)

2Rohin Shah
This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs. The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to ReLU(πgate(x)), so you might think of πgate as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on fmag for everything else.

The typical noise on feature  caused by 1 unit of activation from feature , for any pair of features , is (derived from Johnson–Lindenstrauss lemma)

   [1]

  

1. ... This is a worst case scenario. I have not calculated the typical case, but I expect it to be somewhat less, but still same order of magnitude

Perhaps I'm misunderstanding your claim here, but the "typical" (i.e. RMS) inner product between two independently random unit vectors in  is . So I think the&nb... (read more)

2Linda Linsefors
Good point. I need to think about this a bit more. Thanks Just quickly writing up my though for now... What I think is going on here is that Johnson–Lindenstrauss lemma gives a bound on how well you can do, so it's more like a worst case scenario. I.e. Johnson–Lindenstrauss lemma  gives you the worst case error for the best possible feature embedding. I've assumed that the typical noise would be same order of magnitude as the worst case, but now I think I was wrong about this for large m.  I'll have to think about what is more important of worst case and typical case. When adding up noise one should probably use worst typical case. But when calculating how many features to fit in, one should probably use worst case.