All of Aryaman Arora's Comments + Replies

Very useful list Neel!! Thanks for mentioning AxBench, but unfortunately we don't own the domain you linked to 😅 the actual link is https://github.com/stanfordnlp/axbench

Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.

Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?

1Neel Nanda
Haven't checked lol

I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product the cosine similarity is a useless term.

What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first a... (read more)

1Neel Nanda
See my other comment - it turns out to be the boring fact that there's a large constant offset in the GPT-Neo embeddings. If you subtract the mean of the GPT-Neo embed it looks normal. (Though the fact that this exists is interesting! I wonder what that direction is used for?) I mean that, as far as I can tell (medium confidence) attn0 in GPT-2 isn't used for much, and MLP0 contains most of the information about the value of the token at each position. Eg, ablating MLP0 completely kills performance, while ablating other MLPs doesn't. And generally the kind of tasks that I'd expect to depend on tokens depend substantially on MLP0

Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider:

  • the backup heads could have other main functions but incidentally are useful for the specific task we're looking at, so they end up taking the place of the main heads
  • thinking of virtual attention heads, the computations performed are not easily interpretable at the individual head-level once you have a lot of layers, sort of like how neurons aren't interpretable in big models due to superposition

Re: GPT-Neo being weird, one o... (read more)

1Neel Nanda
Just dug into it more, the GPT-Neo embed just has a large constant offset. Average norm is 11.4, norm of mean is 11. Avg cosine sim is 0.93 before, after subtracting the mean it's 0.0024 (avg absolute value of cosine sim is 0.1831)
1Neel Nanda
Wait, WTF? Are you sure? 0.96 is super high. The only explanation I can see for that is a massive constant offset dominating the cosine sim (which isn't crazy tbh). The Colab claims that the logit lens doesn't work for GPT-Neo, but does work if you include the final block, which seems sane to me. I think that in GPT-2 the MLP0 is basically part of the embed, so it doesn't seem crazy for the inverse to be true (esp if you do the dumb thing of making your embedding + unembedding matrix the same)

Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs

GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.

0Arthur Conmy
Thanks for the comment!  I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.  In depth, when GPT-Neo is fed a sequence of tokens t1t2...t10t11t12...t20 where t1,...,t10 are uniformly random and ti=ti−10 for i≥11, there are four heads in Layer 6 that have the induction attention pattern (i.e attend from ti to ti−9). Three of these heads (6.0, 6.6, 6.11) when ablated decrease loss, and one of these heads increases loss on ablation (6.1). Interestingly, when 6.1 is ablated, the additional ablation of 6.0, 6.6 and 6.11 causes loss to increase (perhaps this is confusing, see this table!).  My guess is the model is able to use the outputs of 6.0, 6.6 and 6.11 differently in the two regimes, so they "compensate" when 6.1 is ablated.
2Neel Nanda
Honestly I expect that training without dropout makes it notably better. Dropout is fucked! Interesting that you say logit lens fails and later layers don't matter - can you say more about that? Arthur mentions something in the walkthrough about how GPT-Neo does seem to have some backup heads, which is wild - I agree that intuitively backup heads should come from dropout.