Produced as a part of MATS Program, under @Neel Nanda and @Lee Sharkey mentorship
Epistemic status: optimized to get the post out quickly, but we are confident in the main claims
TL;DR: head 1.4 in attn-only-4l exhibits many different attention patterns that are all relevant to model's performance
·
is for a space, ⏎
for a new line, and labels such as ⏎[·×K]
mean "\n
and K spaces".⏎[·×3]
and ⏎[·×7]
have basically the same distributions, with ~87% of patterns not classifiedToken: ⏎[·×3]
Behaviour: previous+induction
There are many ways to understand this pattern, there is likely more going on than simple previous and induction behaviours.
Token: ·R
Behaviour: inactive
Token: ⏎[·×7]
Behaviour: unknown
This is a very common pattern, where attention is paid from "new line and indentation" to "new line and bigger indentation". We believe it accounts for most of what classified as unknown for ⏎[·×7]
and ⏎[·×3]
.
Token: width
Behaviour: unknown
We did not see many examples like this, but looks like attention is being paid to recent tokens representing arithmetic operations.
Token: dict
Behaviour: previous
Mostly previous token, but ·collections
gets more than .
and default
, which points at something more complicated.
To demonstrate the behaviours, we set up three distinct templates, wherein each template is structured to be as similar as possible to code examples found within the training dataset.
The dataset for demonstrating induction behaviour is 120 prompts of the following structure:
The current position is highlighted orange, attention head 1.4 attends heavily to the token immediately after an earlier copy of the orange token (the red token). The correct next token here is highlighted green. The dataset is generated with three distinct pairs of red and green tokens and many variants of blue and indentation tokens.
The dataset for demonstrating similar indentation token behaviour is 50 prompts of the following structure:
The current position is highlighted orange, attention head 1.4 attends heavily to a similar token (the red token) exhibiting similar indentation token behaviour. The correct next token here is highlighted green. The dataset is generated by taking random variable names for the blue tokens.
The dataset for demonstrating fuzzy previous token behaviour is 50 prompts of the following structure:
Again, the current position is highlighted orange, attention head 1.4 attends heavily to the previous token (the red token) acting as a previous token head. The correct next token here is highlighted green. The dataset is generated by taking random variable names for the blue tokens (of which the off-blue represents the fact that there are two tokens in the prompt definition, a parent class and child class name) There are 4 random tokens being used to generate a prompts, 2 for the class name, one for the parent class and one for the init argument.
We start by studying the importance of attention head 1.4 on the aforementioned induction task (an example presented below).
In this prompt template, we BOS-ablate the orange token (in this particular example, the ⏎···
token) when doing the forward pass for the corrupted run. In the clean run, we see that over the 120 dataset examples, 93% have the correct next token (the green token) as the top-predicted token, this is compared to 25% on the corrupted run. The same is true for the mean probability of the correct token in each dataset example, it’s 0.28 on the clean run and 0.09 on the corrupted run.
% of correct top prediction | mean probability of correct prediction | |
clean | 93% | 0.28 |
ablated | 25% | 0.09 |
We now explore what effect BOS-ablation has on other sequence positions, we intend to see whether BOS-ablation has a similarly large effect on these other sequence positions. For clarity we measure average effect across the 120 prompts in the plot below, but only display a single example on the x-axis.
The line plot above indicates that performing BOS-ablation on other positions (besides the orange token) is not associated with a large decrease in the probability assigned to the correct answer (< 5 percentage point drop in probability). In the case of the BOS-ablating the orange token position however, where attention head 1.4 is acting as an induction head, the probability assigned to the correct answer (the green tokens across the dataset distribution) decreases by approximately 20 percentage points.
We move onto understanding the importance of similar indentation token behaviour performed by attention head 1.4 on the corresponding dataset, an example presented below.
As earlier, over the 50 dataset examples, the clean run is associated with 100% of the 0th-rank tokens being the correct token as compared to 4% on the corrupted run. We also find that the mean probability of the correct token is 0.55 on the clean run and 0.11 on the corrupted run.
% of correct top prediction | mean probability of correct prediction | |
clean | 100% | 0.55 |
ablated | 4% | 0.11 |
Again, we test the importance of different token positions by BOS-ablating all tokens in the prompt iteratively and recording the change in probability associated with the correct token.
All tokens beside the indent token that are BOS-ablated at attention head 1.4 are only associated with negligible changes in the probability assigned to the correct token (the green tokens across the dataset). BOS-ablating the indent token and thus attention head 1.4’s similar indentation token behaviour results in an approximate decrease of 40 percentage points assigned to the correct next token.
Finally, we study the importance of attention head 1.4’s fuzzy previous token behaviour on the corresponding dataset, an example presented below.
In this case, the clean run’s 0th-rank token is always the correct token while this is true for only 80% of the corrupted runs on the dataset examples. The mean probability associated with the correct token for the clean run is 0.87 and 0.40 for the corrupted run.
% of correct top prediction | mean probability of correct prediction | |
clean | 100% | 0.87 |
ablated | 80% | 0.40 |
As above, we iteratively BOS-ablate each token in prompts across the dataset and record the drop in probability assigned to the correct token.
The drop in the probability across all tokens besides the open bracket token is negligible, whereas this token is associated with a 45 percentage point drop when BOS-ablated.
Our results suggest that head 1.4 in attn-only-4l exhibits multiple simple attention patterns that are relevant to model's performance. We believe the model is incentivized to use a single head for many purposes because it saves parameters. We are curious how these behaviours are implemented by the head, but we did not make meaningful progress trying to understand this mechanistically.
We believe the results are relevant to circuit analysis, because researchers often label attention heads based purely on its behaviour on a narrow task (IOI, Docstring, MMLU). Copy Suppression is an exception.
We would like to thank @Yeu-Tong Lau and @jacek for feedback on the draft.