User Comment Replies — AI Alignment Forum

This is an excellent question! Indeed, we cannot rule out that $t_{G}$ is a linear combination or boolean function of features since we are not able to investigate every possible distribution shift. However, we showed in the paper that $t_{G}$ generalizes robustly under several significant distribution shifts. Specifically, $t_{G}$ is learned from a limited training set consisting of simple affirmative and negated statements on a restricted number of topics, all ending with a "." token. Despite this limited training data $t_{G}$ generalizes reasonably well to (i) unseen topic... (read more)

Nice work! I was wondering what context length you were using when you extracted the LLM activations to train the SAE. I could not find it in the paper but I might also have missed it. I know that OpenAI used a context length of 64 tokens in all their experiments which is probably not sufficient to elicit many interesting features. Do you use a variable context length or also a fixed value?

AI ALIGNMENT FORUM
AF

All of Lennart Buerger's Comments + Replies