Attention Output SAEs - AI Alignment Forum

We train sparse autoencoders (SAEs) on the output of attention layers, extending the existing work that trained them on MLP layers and the residual stream.

We perform a qualitative study of the features computed by attention layers, and find multiple feature families: long-range context, short-range context and induction features, for example.

More importantly, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit, validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit.

Finally, we open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention SAEs.