All of Tom Lieberum's Comments + Replies

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup) 

(To show my idea compatible with Boolean attention.) I use a single head, and the ranks add up linearly.

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values. From the RASP paper: Addition of such attention patterns corresponds to anding of such selectors.

I can't speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.

ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.

0Michael Y. Zuo
I'm not familiar with London housing prices. Is it possible to affordably rent or mortgage a decent 2 bedroom condo within a 5 minute walk of the offices with your compensation package? (By affordable I mean less than 1/3 of total comp spent on housing, stretching to 1/2 if comp is unusually high.)

Oh I thought figure 1 was S5 but it actually is modular division. I'll give that a go..

Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)

So maybe the initial observation is more a general/global property of the loss landsca... (read more)

1Rohin Shah
Yeah, that seems right, I think I'm basically at "no, you can't just 10x the learning rate once grokking starts".

So I ran some experiments for the permutation group S_5 with the task x o y = ?

Interestingly here increasing the learning rate just never works. I'm very confused.

3Rohin Shah
Also interestingly, in the default setting for these new experiments, grokking happens in ~1000 steps while memorization happens in ~1500 steps, so the grokking is already faster than the memorization, in stark contrast to the graphs in the original post. (This does depend on when you start the counter for grokking, as there's a long period of slowly increasing validation accuracy. You could reasonably say grokking took ~2500 steps.)

I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.

There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn't properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.

3Rohin Shah
Huh, intriguing. Yeah, it might be worth running with a non-commutative function and seeing if it holds up -- it seems like in the default setting the validation accuracy hits almost 0.5 once the training accuracy is 1, which is about what you'd get if you understood commutativity but nothing else about the function. So the "grokking" part is probably happening after that, i.e. at roughly the 1.5k steps location in the default setting.

Yep I used my own re-implementation, which somehow has slightly different behavior.

I'll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.

I'm not sure I understand.

I chose the grokking starting point as 300 steps, based on the yellow plot. I'd say it's reasonable to say that 'grokking is complete' by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale

3Rohin Shah
Ah, I just looked at your plots, verified that the grokking indeed still happened with 5x and 10x learning rates, and then just assumed 10x faster convergence in the original plots in the post. Apparently that reasoning was wrong. Presumably you're using different hyperparameters than the ones used in this post? You seem to have faster grokking in the "default setting" than the in the plots shown in the post. (And it does look like, given some default setting, "10x faster convergence" is basically right, since in your case 10x higher LR makes the grokking stage go from 1700 steps to 150 steps.) (Partly the issue was that I wasn't sure whether the x-axis in your plots was starting from the beginning of training, or from the point that grokking started, so I instead reasoned about the impact on the graphs in this post. Though looking at the LR plot it's now obvious that it's from the beginning of training.) I now think this is relatively strong evidence for my view, given that grokking happens pretty quickly (~a third of total training), though it probably is still decently slower than the memorization. (Do you happen to have the training loss curves, so we can estimate how long it takes to memorize under your hyperparameters?)

It would be interesting to see if, once grokking had clearly started, you could just 100x the learning rate and speed up the convergence to zero validation loss by 100x.

I ran a quick-and-dirty experiment and it does in fact look like you can just crank up the learning rate at the point where some part of grokking happens to speed up convergence significantly. See the wandb report: (read more)

2Rohin Shah
Very nice! Thanks for actually running the experiment :) It's not clear to me which story this supports since 10x-ing the learning rate only brings the grokking phase down to 8×104 steps, which is still the majority of the training run.

While I agree that using percentages would make impact more comparable between agents and timesteps, it also leads to counterintuitive results (at least counterintuitive to me)

Consider a sequence of utilities at times 0, 1, 2 with , and .

Now the drop from to would be more dramatic (decrease by 100%) compared to the drop from to (decrease by 99%) if we were using percentages. But I think the agent should 'care more' about the larger drop in absolute utility (i.e. spend more resources to prevent it from happening) an

... (read more)