Thanks Logan!
2. Unlike local SAEs, our e2e SAEs aren't trained on reconstructing the current layer's activations. So at least my expectation was that they would get a worse reconstruction error at the current layer.
Improving training times wasn't our focus for this paper, but I agree it would be interesting and expect there to be big gains to be made by doing things like mixing training between local and e2e+downstream and/or training multiple SAEs at once (depending on how you do this, you may need to be more careful about taking different pathways of com...
Nice post.
Pushing back a little on this part of the appendix:
Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.
I'm a bit concerned about people assuming this is true for models going forward. A sufficient...
Nice project and writeup. I particularly liked the walkthrough of thought processes throughout the project
Decision square's Euclidean distance to the top-right corner, positive ().
We are confused and don't fully understand which logical interactions produce this positive regression coefficient.
I'd be weary about interpreting the regression coefficients of features that are correlated (see Multicollinearity). Even the sign may be misleading.
It might be worth making a cross-correlation plot of the features. This won't give you a new coeff...
Every SAE in the paper is hosted on wandb, only some are hosted on huggingface, so I suggest loading them from wandb for now. We’ll upload more to huggingface if several people prefer that. Info for downloading from wandb can be found in the repo, the easiest way is probably:
... (read more)