All of Tom Lieberum's Comments + Replies

If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at?

 

Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).

I am especially concerned about this when training on the final answer.

Maybe my question just boils down to

How do you distinguish between "paraphra

... (read more)
2Fabien Roger
I agree this is a concern. My view here is that if you look at the paraphrases, it seems that a decent amount of standardization happened. Empirically, it also looks like T=1 sampling of the paraphraser introduces a random variations in the syntax. So I would be very surprised if a decent chunk of the syntactic link was not broken (e.g. at least 20% of it). Therefore I think it's reasonable to bound the performance due to encoded reasoning to ~5x the gap between the original distillation and the paraphrased distillation, which would be ~7% (in absolute pass@1), which still seems not very frightening to me given that the no-CoT vs CoT gap is >50%. This is sensitive to the size of the gap. If future work finds a ~5% gap on a new model, I'd be very concerned because it would make it more likely that the majority of the reasoning is encoded. (Edit: This is a rough sketch of the argument, there are complications around diminishing returns on reasoning: maybe only a very small amount of visible reasoning is responsible for the majority of the no-CoT vs CoT gap. But it's unclear how much this undermines my argument because I most care about the ability of the model to not leak anything about what it is reasoning about.)

Great experiments!

Questions: 

  1. To clarify, when distilling did you also train the model to predict the final answer or only the CoT? I assume the latter but it wasn't 100% clear from the text.
  2. . What's the variance on the pass@1 performance? (e.g. via bootstrap sampling) You say the performance of "chunked paraphrasing" is worse, but the degradation looks pretty minor and without knowing the variance it's hard to know whether this is a meaningful difference.
  3. "We find that the pass@1 results are worse, especially for small models." Do you mean large models
... (read more)
2Fabien Roger
1. Both (the final answers are usually very short compared to the CoT) 2. The 1-sigma uncertainty over the mean is sigma = sqrt(acc * (1-acc) / n) ~ 1%. But I think the results are slightly more significant than this suggests as I evaluate all models on the exact same set of questions + given that chunk-paraphrased is worse across experiments on multiple models, I would guess that the same effect would exist for a much larger eval. 3. Good catch, I think I wrote that based on earlier results. I'll edit the post. On your notes: If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at? My hope is that paraphrasing removes the mutual information through standardization / randomization (where what is random is not the map, but the actual samples because I am sampling at temperature 1). For example, if enumerated list => equation and bulleted list => sth else, then swapping enumerated and bulleted would not help but having the model use always a bulleted list or use a bulleted list a random 50% of the time does help to break the enumerated list <=> equation link.  I agree that it's unclear where paraphrasing is on the [bijection]-[no syntax left unrandomized] spectrum.

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup) 

1Gurkenglas
(To show my idea compatible with Boolean attention.) I use a single head, and the ranks add up linearly.

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

1Gurkenglas
Appendix C attempts to discard the softargmax, but it's an integral part of my suggestion. If the inputs to softargmax take values {0,1000,2000}, the outputs will take only two values. From the RASP paper: https://arxiv.org/pdf/2106.06981 Addition of such attention patterns corresponds to anding of such selectors.

I can't speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.

ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.

0Michael Y. Zuo
I'm not familiar with London housing prices. Is it possible to affordably rent or mortgage a decent 2 bedroom condo within a 5 minute walk of the offices with your compensation package? (By affordable I mean less than 1/3 of total comp spent on housing, stretching to 1/2 if comp is unusually high.)

Oh I thought figure 1 was S5 but it actually is modular division. I'll give that a go..

Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)

So maybe the initial observation is more a general/global property of the loss landsca... (read more)

1Rohin Shah
Yeah, that seems right, I think I'm basically at "no, you can't just 10x the learning rate once grokking starts".

So I ran some experiments for the permutation group S_5 with the task x o y = ?

Interestingly here increasing the learning rate just never works. I'm very confused.

3Rohin Shah
Also interestingly, in the default setting for these new experiments, grokking happens in ~1000 steps while memorization happens in ~1500 steps, so the grokking is already faster than the memorization, in stark contrast to the graphs in the original post. (This does depend on when you start the counter for grokking, as there's a long period of slowly increasing validation accuracy. You could reasonably say grokking took ~2500 steps.)

I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.

There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn't properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.

3Rohin Shah
Huh, intriguing. Yeah, it might be worth running with a non-commutative function and seeing if it holds up -- it seems like in the default setting the validation accuracy hits almost 0.5 once the training accuracy is 1, which is about what you'd get if you understood commutativity but nothing else about the function. So the "grokking" part is probably happening after that, i.e. at roughly the 1.5k steps location in the default setting.

Yep I used my own re-implementation, which somehow has slightly different behavior.

I'll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.

I'm not sure I understand.

I chose the grokking starting point as 300 steps, based on the yellow plot. I'd say it's reasonable to say that 'grokking is complete' by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale

3Rohin Shah
Ah, I just looked at your plots, verified that the grokking indeed still happened with 5x and 10x learning rates, and then just assumed 10x faster convergence in the original plots in the post. Apparently that reasoning was wrong. Presumably you're using different hyperparameters than the ones used in this post? You seem to have faster grokking in the "default setting" than the in the plots shown in the post. (And it does look like, given some default setting, "10x faster convergence" is basically right, since in your case 10x higher LR makes the grokking stage go from 1700 steps to 150 steps.) (Partly the issue was that I wasn't sure whether the x-axis in your plots was starting from the beginning of training, or from the point that grokking started, so I instead reasoned about the impact on the graphs in this post. Though looking at the LR plot it's now obvious that it's from the beginning of training.) I now think this is relatively strong evidence for my view, given that grokking happens pretty quickly (~a third of total training), though it probably is still decently slower than the memorization. (Do you happen to have the training loss curves, so we can estimate how long it takes to memorize under your hyperparameters?)

It would be interesting to see if, once grokking had clearly started, you could just 100x the learning rate and speed up the convergence to zero validation loss by 100x.

I ran a quick-and-dirty experiment and it does in fact look like you can just crank up the learning rate at the point where some part of grokking happens to speed up convergence significantly. See the wandb report:

https://wandb.ai/tomfrederik/interpreting_grokking/reports/Increasing-Learning-Rate-at-Grokking--VmlldzoxNTQ2ODY2?accessToken=y3f00qfxot60n709pu8d049wgci69g53pki6pq6khsemnncca1... (read more)

2Rohin Shah
Very nice! Thanks for actually running the experiment :) It's not clear to me which story this supports since 10x-ing the learning rate only brings the grokking phase down to 8×104 steps, which is still the majority of the training run.

While I agree that using percentages would make impact more comparable between agents and timesteps, it also leads to counterintuitive results (at least counterintuitive to me)

Consider a sequence of utilities at times 0, 1, 2 with , and .

Now the drop from to would be more dramatic (decrease by 100%) compared to the drop from to (decrease by 99%) if we were using percentages. But I think the agent should 'care more' about the larger drop in absolute utility (i.e. spend more resources to prevent it from happening) an

... (read more)