Hey I'm not finished reading this yet but I noticed something off about what you said.
At the end, the final 1600-dimensional vector is multiplied by W's transpose to project back into vocab space.
This isn't quite right. They don't multiply by W's transpose at the end. Rather there is a completely new matrix at the end, whose shape is the same as the transpose of W.
You can see this in huggingface's code for GPT2. In the class GPT2LMHeadModel the final matrix multiplication is performed by the matrix called "lm_head"...
Thanks for the info.
This was a great read, very informative.