In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
I disagree at least with the "almost completely" version of this claim:
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
This also applies to the non-modular addition operation, but I think it's pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the "tokenized" representation back into the "magnitude" representation, and then simply adds them as normal.
Some evidence for this:
It seems like if you believe "when the operation you want to learn is a convolution operation, then you will learn the Fourier basis", you should also believe that you'll get a Fourier basis for non-modular addition on one-hot-encoded numbers, and currently my guess is that that's false.
Fwiw, I agree that the algorithm is quite "mathematically natural" (indeed, one person came up with the algorithm independently, prompted by "how would you solve this problem if you were a Transformer"), though I feel like the "modular" part is pretty crucial for me (and the story I'd tell would be the one in Daniel's comment).
These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I've not spent much time on this post and it's certainly plausible that I've gotten things wrong.
In the 'Key Takeaways' section of the Modular Addition part of the well-known post 'A Mechanistic Interpretability Analysis of Grokking' , Nanda and Lieberum write:
And
But the casual reader should use caution! It is in fact the case that "Inputs x,y are given as one-hot encoded vectors in Rp ". This point is of course emphasized more in the full notebook (it has to be, that's where the code is), and the arXiv paper that followed is also much clearer about this point. However, when giving brief takeaways from the work, especially when it comes to discussing how 'natural' the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given x and y as inputs. It is not trained to 'act' on the numbers x, y themselves.
(1x∗1y)(t)=∑s1x(t−s)1y(s)=1x+y(t).When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting is already a significant transformation. You have moved away from having the number x be represented by its own magnitude. You instead have a situation in which x and y now really live 'in the domain' (its almost like a dual point of view: The number x is not the size of the signal, but the position at which the input signal is non-zero).
So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is 'seeing' as input is something like the indicator functions t↦1x(t) and t↦1y(t). Here, t is something like the 'token variable' in the sense that these are functions on the vocabulary. And if we essentially ignore the additional tokens for | and =, we can think that these are functions on the group Z/pZ and that we would like the network to learn to produce the function t↦1x+y(t) at its output neurons.
In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.
Notice that the operation you want to learn is manifestly a convolution operation, i.e.
And (as I distinctly remember being made to practically chant in an 'Analysis of Boolean Functions' class given by Tom Sanders) the Fourier Transform is the (essentially unique) change of basis that simultaneously diagonalizes all convolution operations. This is coming close to saying something like: There is one special basis that makes the operation you want to learn uniquely easy to do using matrix multiplications, and that basis is the Fourier basis.