Cool that you figured that out, easily explains the high cosine similarity! It does seem to me that a large constant offset to all the embeddings is interesting, since that means GPT-Neo's later layers have to do computation taking that into account, which seems not at all like an efficient decision. I will def poke around more.
Interesting on MLP0 (I swear I use zero indexing lol just got momentarily confused)! Does that hold across the different GPT sizes?
I'm pretty sure! I don't think I messed up anywhere in my code (just nested for loop lol). An interesting consequence of this is that for GPT-2, applying logit lens to the embedding matrix (i.e. ) gives us a near-perfect autoencoder (the top output is the token fed in itself), but for GPT-Neo it always gets us the vector with the largest magnitude since in the dot product the cosine similarity is a useless term.
What do you mean about MLP0 being basically part of the embed btw? There is no MLP before the first a...
Huh interesting about the backup heads in GPT-Neo! I would not expect a dropout-less model to have that--some ideas to consider:
Re: GPT-Neo being weird, one o...
Understand IOI in GPT-Neo: it's a same size model but does IOI via composition of MLPs
GPT-Neo might be weird because it was trained without dropout iirc. In general, it seems to be a very unusual model compared to others of its size; e.g. logit lens totally fails on it, and probing experiments find most of its later layers add very little information to its logit predictions. Relatedly, I would think dropout is responsible for backup heads existing and taking over if other heads are knocked out.
Very useful list Neel!! Thanks for mentioning AxBench, but unfortunately we don't own the domain you linked to 😅 the actual link is https://github.com/stanfordnlp/axbench