Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm

carboniferous_umbraculum

These remarks are basically me just wanting to get my thoughts down after a Twitter exchange on this subject. I've not spent much time on this post and it's certainly plausible that I've gotten things wrong.

In the 'Key Takeaways' section of the Modular Addition part of the well-known post 'A Mechanistic Interpretability Analysis of Grokking' , Nanda and Lieberum write:

This algorithm operates via using trig identities and Discrete Fourier Transforms to map , and then extracting $x + y (mod p)$

And

The model is trained to map $x, y$ to $z \equiv x + y (mod 113)$ (henceforth 113 is referred to as $p$ )

But the casual reader should use caution! It is in fact the case that "Inputs $x, y$ are given as one-hot encoded vectors in $R^{p}$ ". This point is of course emphasized more in the full notebook (it has to be, that's where the code is), and the arXiv paper that followed is also much clearer about this point. However, when giving brief takeaways from the work, especially when it comes to discussing how 'natural' the learned algorithm is, I would go as far as saying that it is actually misleading to suggest that the network is literally given $x$ and $y$ as inputs. It is not trained to 'act' on the numbers $x, y$ themselves.

When thinking seriously about why the network is doing the particular thing that it is doing at the mechanistic level, I would want to emphasize that one-hotting is already a significant transformation. You have moved away from having the number $x$ be represented by its own magnitude. You instead have a situation in which $x$ and $y$ now really live 'in the domain' (its almost like a dual point of view: The number $x$ is not the size of the signal, but the position at which the input signal is non-zero).

So, while I of course fully admit that I too am looking at it through my own subjective lens, one might say that (before the embedding happens) it is more mathematically natural to think that what the network is 'seeing' as input is something like the indicator functions $t \mapsto 1_{x} (t)$ and $t \mapsto 1_{y} (t)$ . Here, $t$ is something like the 'token variable' in the sense that these are functions on the vocabulary. And if we essentially ignore the additional tokens for | and =, we can think that these are functions on the group $Z / p Z$ and that we would like the network to learn to produce the function $t \mapsto 1_{x + y} (t)$ at its output neurons.

In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.

Notice that the operation you want to learn is manifestly a convolution operation, i.e.

(1_{x} * 1_{y}) (t) = \sum s 1_{x} (t - s) 1_{y} (s) = 1_{x + y} (t) .

And (as I distinctly remember being made to practically chant in an 'Analysis of Boolean Functions' class given by Tom Sanders) the Fourier Transform is the (essentially unique) change of basis that simultaneously diagonalizes all convolution operations. This is coming close to saying something like: There is one special basis that makes the operation you want to learn uniquely easy to do using matrix multiplications, and that basis is the Fourier basis.

[-]Rohin Shah2y30

In particular, this point of view further (and perhaps almost completely) demystifies the use of the Fourier basis.

I disagree at least with the "almost completely" version of this claim:

Notice that the operation you want to learn is manifestly a convolution operation, i.e.

This also applies to the non-modular addition operation, but I think it's pretty plausible that if you train on non-modular addition (to the point of ~perfect generalization), the network would learn an embedding that converts the "tokenized" representation back into the "magnitude" representation, and then simply adds them as normal.

Some evidence for this:

I've heard it claimed that an LLM represented the number of characters in a token as a magnitude, which was used for deciding whether a line of code was > 80 characters (useful for predicting line breaks).
This paper trains on non-modular addition and gets this result. (Note however that the paper has a highly unusual setting that isn't representative of typical network training, and arguably the setup is such that you have to get this result: in particular the architecture enforces that the embeddings of the two inputs are added together, which wouldn't work in a Fourier basis. I cite it as evidence that networks do learn magnitude representations when forced to do so.)

It seems like if you believe "when the operation you want to learn is a convolution operation, then you will learn the Fourier basis", you should also believe that you'll get a Fourier basis for non-modular addition on one-hot-encoded numbers, and currently my guess is that that's false.

Fwiw, I agree that the algorithm is quite "mathematically natural" (indeed, one person came up with the algorithm independently, prompted by "how would you solve this problem if you were a Transformer"), though I feel like the "modular" part is pretty crucial for me (and the story I'd tell would be the one in Daniel's comment).

AI ALIGNMENT FORUM
AF

54

Short Remark on the (subjective) mathematical 'naturalness' of the Nanda--Lieberum addition modulo 113 algorithm

54