All of Lennart Buerger's Comments + Replies

This is an excellent question! Indeed, we cannot rule out that is a linear combination or boolean function of features since we are not able to investigate every possible distribution shift. However, we showed in the paper that generalizes robustly under several significant distribution shifts. Specifically, is learned from a limited training set consisting of simple affirmative and negated statements on a restricted number of topics, all ending with a "." token. Despite this limited training data generalizes reasonably well to (i) unseen topic... (read more)

Nice work! I was wondering what context length you were using when you extracted the LLM activations to train the SAE. I could not find it in the paper but I might also have missed it. I know that OpenAI used a context length of 64 tokens in all their experiments which is probably not sufficient to elicit many interesting features. Do you use a variable context length or also a fixed value?