This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort
In previous work, we trained and open sourced a set of attention SAEs on all 12 layers of GPT-2 Small. We found that random SAE features in each layer were highly interpretable, and highlighted a set of interesting features families. We’ve since leveraged our SAEs as a tool to interpret the roles of attention heads. The key idea of the technique relies on our SAEs being trained to reconstruct the entire layer, but that contributions to specific heads can be inferred. This allows us to find the top 10 features most salient to a given head, and note whenever there is a pattern that it may suggest a role of that head. We then used this to manually inspect the role of every head in GPT-2 small, and spend the rest of this post exploring various implications of our findings and the technique.
In the spirit of An Overview of Early Vision in InceptionV1, we start with a high-level, guided tour of the different behaviors implemented by heads across every layer, building better intuitions for what attention heads learn in a real language model.
To validate that the technique is teaching something real about the roles of these heads, we confirm that our interpretations match previously studied heads. We note that our annotator mostly did not know a priori which heads had previously been studied. We find:
In addition to building intuition about what different heads are doing, we use our SAEs to better understand the prevalence of attention head polysemanticity at a high level. Our technique suggests that the vast majority of heads (>90%) are doing multiple tasks. This implies that we’ll either need to understand each head’s multiple roles, or primarily use features (rather than heads) as better units of analysis. However, our work also suggests that not all heads are fully polysemantic. We find 9% plausibly monosemantic and 5% mostly monosemantic heads, though we note that as our technique only explores the top 10 most salient features per head it is not reliable enough to prove monosemanticity on its own.
Finally, we’re excited to find that our SAE-based technique helps enable faster mech interp research on non-SAE questions. A long open question concerns why there are so many induction heads. Inspecting salient features for the two induction heads in Layer 5 immediately motivated the hypothesis that one head specializes in “long prefix induction”, wanting the last several tokens to match, while the other performs “standard induction”, only needing a single token to match. We then verified this hypothesis with more rigorous non-SAE based experiments. With so much of the field investing effort in sparse autoencoders, it is nice to have signs of life that these are a legitimately useful tool for advancing the field’s broader agenda.
Recall that we train our SAEs for GPT-2 small on the “z” vectors (aka the mixed values aka the attention outputs before a linear map) concatenated across all heads for each layer. See our first post for more technical details.
We develop a technique specific to this setup: decoder weight attribution by head. For each layer, our attention SAEs are trained to reconstruct the concatenated outputs of each head. Thus each SAE decoder direction is of shape (n_heads * d_head), and we can think of each n_heads slice as specifically reconstructing the output of the corresponding head. We then compute the norm of each slice as a proxy for how strongly each head writes the feature. Finally, for each head, we sort features by their decoder weight attribution to get a sense of what features that head is most responsible for outputting.
For example, for head 7.9, Feature 29342 has a maximal decoder weight attribution of 0.49.
A better metric for determining attribution of SAEs to attention heads might be to look at the fraction of variance explained (FVE) by each head via DFA, as we did in our first post. However, this is more computationally expensive and an examination of decoder weight attribution against FVE on random heads shows these are correlated and should roughly preserve ranking order. One example for H4.5 is shown below.
There are a few heads such as H8.5 where top-ranking features by FVE are not as correlated for the top features and using FVE may have produced better results. We recommend using FVE for any reproductions.
Based on a review of the top 10 SAE features by decoder weight attribution for each attention head, we observed that features become more abstract up to layer 9 and then less so after that.
The table below summarizes the kind of feature groupings that we identified across the various layers. Notable heads indicated with an asterisk (*) were particularly interesting as possibly monosemantic due to exhibiting the same feature grouping for all top 10 features.
Layer | Feature groups / possible roles | Notable Heads |
0 | - Single-token (“of”) - Bi-gram features (following “S”) - Micro-context features (cars, Apple tech, solar) | - H0.1: Top 6 features are all variants capturing “of” - H0.9: Long range context tracking family (headlines, nutrition facts, lists of nouns) |
1 | - Single-token (Roman numerals) - Bi-gram features (following “L”) - Specific noun tracking (choice, refugee, gender, film/movie) | - H1.5*: Succession or pairs related behavior - H1.8: Long range context tracking with very weak weight attribution |
2 | - Short phrases (“never been…”) - Entity features (court, media, govt) - Bi-gram & tri-gram features (“un-”) - Physical direction and logical relationships (“under”) - Entities followed by what happened (govt) | - H2.0: Short phrases following a predicate like not/just/never/more - H2.5: Subject tracking for physical directions (under, after, between, by), logical relationships (then X, both A and B) - H2.7: Groups of context tracking features |
3 | - Entity-related fragments (“world’s X”) - Tracking of a characteristic (ordinality or extremity) (“whole/entire”) - Single-token and double-token (eg) - Tracking following commas (while, though, given) | - H3.0: Identified as duplicate token head from 8/10 features - H3.11: Tracking of ordinality or entirety or extremity |
4 | - Active verbs (do, share) - Context tracking families (story highlights) | - H4.5: Characterizations of typicality or extremity - H4.7*: Weak/non-standard duplicate token head - H4.11*: Identified as a previous token head based on all features |
5 | - Induction (F) | - H5.1: Long prefix Induction head - H5.5: Induction head |
6 | - Induction (M) - Active verbs (want to, going to) - Local context tracking for certain concepts (vegetation) | - H6.3: Active verb tracking following a comma - H6.7: Local context tracking for certain concepts (payment, vegetation, recruiting, death) - H6.9*: Induction head - H6.11: Suffix completions on specific verb and phrase forms |
7 | - Induction (al-) - Active verbs (asked/needed) - Reasoning and justification phrases (because, for which) | - H7.2*: Non-standard induction - H7.10*: Induction head |
8 | - Active verbs (“hold”) - Compound phrases (either) - Time and distance relationships - Quantity or size comparisons or specifiers (larger/smaller) - Url completions (twitter) | - H8.1*: Prepositions copying (with, for, on, to, in, at, by, of, as, from) - H8.5: Grammatical compound phrases (either A or B, neither C nor D, not only Z) - H8.8: Quantity or time comparisons or specifiers |
9 | - Complex concept completions (time, eyes) - Grammatical relationship joiners (between) | - H9.0*: Complex tracking on specific concepts (what is happening to time, where focus should be, actions done to eyes, etc.) - H9.2: Complex concept completions (death, diagnosis, LGBT discrimination, problem and issue, feminism, safety) - H9.9*: Copying, usually names, with some induction |
10 | - Grammatical adjusters - Physical or spatial property assertions | - H10.1: Assertions about a physical or spatial property (up/back/down/ over/full/hard/soft/long/lower) - H10.4: Various separator character (colon for time, hyphen for phone, period for quantifiers) - H10.5: Counterfactual and timing/tense assertions (if/than/had/since/will/would/until/has X/have Y) - H10.6: Official titles - H10.10*: Capital letter completions with some context tracking (possibly non-standard induction) - H10.11: Certain conceptual relationships |
11 | - Grammatical adjustments - Bi-grams - Capital letter completions - Long range context tracking | - H11.3: Late layer long range context tracking, possibly for output confidence calibration |
While our technique is not sufficient to prove that a head is monosemantic, we believe that having multiple unrelated features attributed to a head is evidence that the head is doing multiple tasks. We also note that there is a possibility we missed some monosemantic heads due to missing patterns at certain levels of abstraction (e.g. some patterns might not be evident from a small sample of SAE features, and in other instances an SAE might have mistakenly learned some red herring features).
During our investigations of each head, we found 14 monosemantic candidates (i.e. all of the top 10 attributed features for these heads were closely related). This suggests that over 90% of the attention heads in GPT-2 small are performing at least two different tasks.
To explicitly show an example of a polysemantic head, we use evidence from what SAEs attributed to the head have learned to deduce that the head performs multiple tasks. In H10.2, we find both integer copying behavior and url completion behavior in the top SAE features. Zero ablating each head in Layer 10 and recording the mean change in loss on synthetic datasets[1] for each task shows a clear jump for H10.2 relative to other heads in L10, confirming that this head really is doing both of these tasks:
Note that the line between polysemantic and monosemantic heads is blurry, and we had a high bar for considering monosemantic candidates. For example, consider H5.10: all 10 features look like context features, boosting the logits of tokens related to that context. We labeled this as polysemantic since some of the contexts are unrelated, but we could plausibly think of it as a general monosemantic “context” head. We also think that head polysemanticity is on a spectrum (e.g. heads with 2 roles are less polysemantic than heads with 10). If we can understand the multiple roles of a polysemantic head, it still might be worth trying to fully understand it in the style of McDougall et al.
While our technique doesn’t prove that heads are monosemantic, we are excited that our SAEs might narrow down the search space for monosemantic heads, and reveal new, messier behaviors that have been historically harder to understand in comparison to cleaner algorithmic tasks.
For the monosemantic candidates, we see a range of different behaviors:
For example, consider H8.1. For all top 10 features, the top DFA suggests that the head is attending to a different preposition (including ‘with’, ‘from’, ‘for’, ‘to’), and the top logits boost that same preposition. This suggests that this might be a monosemantic “Preposition mover” head that specifically looks for opportunities to copy prepositions in the context.
As shown in the table below, we also found instances where a head was almost monosemantic or plausibly bisemantic.
Head Type | Fraction of Heads |
Plausibly monosemantic: All top 10 features were deemed conceptually related by our annotator. | 9.7% (14/144) |
Plausibly monosemantic, with minor exceptions: All features were deemed conceptually related with just one or two exceptions. | 5.5% (8/144) |
Plausibly bisemantic: All features were clearly in only two conceptual categories. | 2.7% (4/144) |
To further verify that our SAE features are teaching us something real about the roles of heads, we show that we can distinguish meaningfully different roles for two induction heads in L5 of GPT-2 small, shedding some light on why there are so many induction heads in language models. We find that one head seems to be specializing in “long prefix induction”, while one head mostly does “standard induction”.
Notably this hypothesis was motivated by our SAEs, as glancing at the top features attributed to 5.1 shows “long induction” features, defined as features that activate on examples of induction with at least two repeated prefix matches (eg 2-prefix induction: ABC … AB -> C). We spot these by comparing tokens at (and before) each top feature activation with the tokens preceding the corresponding top DFA[2] source position. While previous work from Goldowsky-Dill et al found similar “long induction” behavior in a 2-layer model, we (Connor and Rob) were not aware of this during our investigations, showing that our SAEs can teach us novel insights about attention heads.
As an illustrative example, we compare two “‘-’ is next by induction” features attributed to heads 5.1 and 5.5 respectively. Notice that all of the top examples for 5.1’s feature are examples of long prefix induction, while almost all of the examples in 5.5’s feature are standard (AB ... A -> B) induction. For example, comparing the top DFA to the feature activation for 5.1’s top example shows a 4-prefix match (.| ”|Stop| victim), while 5.5’s top feature is 1-prefix ( center).
To confirm that this isn’t just an artifact of our SAEs, we reinforce this hypothesis with independent lines of evidence. We first generate synthetic induction datasets with random repeated tokens of varying prefix lengths. We confirm that while both induction scores rise as we increase prefix length, 5.1 has a much more dramatic phase change as we transition to long prefixes (i.e. >=2 ):
We also check each head’s average direct logit attribution (DLA) to the correct next token as a function of prefix length. We again see that 5.1’s DLA skyrockets as we enter the long prefix regime, while 5.5’s DLA remains relatively constant:
We now check that these results hold on a random sample of the training distribution. We first filter for examples where the heads are attending non-trivially to some token[3] (i.e. not just attending to BOS), and check how often these are examples of n-prefix induction. We find that 5.1 will mostly attend to tokens in long prefix induction, while 5.1 is mostly doing normal 1-prefix induction.
We intervene on the long induction examples from the training distribution, corrupting them to only be one prefix, and show that 5.1’s average induction score plummets from ~0.55 to ~0.05, while 5.1 still maintains an induction score of ~0.45.
Finally, we see hints of universality: checking average induction scores on our synthetic induction dataset for a larger model, GPT-2 Medium, reveals signs of both “long prefix” and “standard” induction heads.
We recorded the groupings for all heads in this Google Sheet based on the corresponding attention head feature dashboards. We thank Callum McDougall for providing the visualization codebase on top of which these dashboards were constructed. The code for generating the dashboards is available by messaging any of the first two authors.
Feel free to use the citation from the first post, or this citation specifically for this current post:
@misc{gpt2_attention_saes_3,
author= {Robert Krzyzanowski and Connor Kissane and Arthur Conmy and Neel Nanda},
url = {https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don},
year = {2024},
howpublished = {Alignment Forum},
title = {We Inspected Every Head in GPT-2 Small Using SAEs So You Don’t Have To}
}
Connor and Rob were core contributors on this project. Rob performed high-level grouping analysis of every attention head in GPT-2 Small and some corresponding shallow dives. Connor performed the long prefix induction deep dive and the H10.2 polysemanticity experiment. Arthur and Neel gave guidance and feedback throughout the project. The original project idea was suggested by Neel.
We would like to thank Georg Lange and Joseph Bloom for extremely helpful criticism about our claims on polysemanticity in an earlier draft of this work.
For other heads, we also used proxies to detect examples in OpenWebText with the hypothesized head behaviors, but had messier results. Our initial hypotheses were often too broad (eg "succession" vs "succession for integers"), which led to false negatives. Synthetic data was helpful to filter these.
Note DFA is attention weighted, so you can think of it as similar to an attention pattern
We show a threshold of 0.3. The results generally hold for a range of thresholds.