If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at?

Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).

I am especially concerned about this when training on the final answer.

Maybe my question just boils down to

How do you distinguish between "paraphrasing didn't break the syntactical <> semantic link" and "the model isn't relying on that link / the link doesn't exist in the first place".

Reply

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Tom Lieberum17d20

Great experiments!

Questions:

To clarify, when distilling did you also train the model to predict the final answer or only the CoT? I assume the latter but it wasn't 100% clear from the text.
. What's the variance on the pass@1 performance? (e.g. via bootstrap sampling) You say the performance of "chunked paraphrasing" is worse, but the degradation looks pretty minor and without knowing the variance it's hard to know whether this is a meaningful difference.
"We find that the pass@1 results are worse, especially for small models." Do you mean large models?

Notes:

I'm somewhat confused why you claim that paraphrasing the whole CoT (i.e. non-chunked) breaks the causal link between syntax(t=0) and semantics(t=1) if you then train on the CoT. I'll try to paraphrase (ha!) what's going on in my own words:

Let x be the original CoT and f(x) the paraphrase. Let y be the correct final answer. Then f(x) might still contain the same structure as x, e.g. mutual information between X and f(X) is high. If you now trained a model to predict $y | f(x)$, I would guess it will learn to pick up on any present steganographic structure that is predictive of y.

If you trained a model to only predict the CoT, i.e. f(x, t=1) | f(x|t=0) I would still guess the model will learn to pick up on the structure if it is present and helpful. E.g. if previously the model was always putting an equation after outputting an enumerated list and now it does so after a bullet point list because f converted all enumerations into bullet points.

So this seems to hinge on f choosing incoherent syntax, e.g. what you observe in the chunk-paraphrase setting, but it's not clear to me that this is the behavior we should expect from the non-chunked paraphraser.

Reply

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Tom Lieberum2y10

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup)

Reply

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Tom Lieberum2y1-2

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

Reply

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Tom Lieberum3y00

I can't speak to the option for remote work but as a counterpoint, it seems very straightforward to get a UK visa for you and your spouse/children (at least straightforward relative to the US). The relevant visa to google is the Skilled Worker / Tier 2 visa if you want to know more.

ETA: Of course, there are still legitimate reasons for not wanting to move. Just wanted to point out that the legal barrier is lower than you might think.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum3y20

Oh I thought figure 1 was S5 but it actually is modular division. I'll give that a go..

Here are results for modular division. Not super sure what to make of them. Small increases in learning rate work, but so does just choosing a larger learning rate from the beginning. In fact, increasing lr to 5x from the beginning works super well but switching to 5x once grokking arguably starts just destroys any progress. 10x lr from the start does not work (nor when switching later)

So maybe the initial observation is more a general/global property of the loss landscape for the task and not of the particular region during grokking?

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum3y30

So I ran some experiments for the permutation group S_5 with the task x o y = ?

Interestingly here increasing the learning rate just never works. I'm very confused.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum3y00

I updated the report with the training curves. Under default settings, 100% training accuracy is reached after 500 steps.

There is actually an overlap between the train/val curves going up. Might be an artifact of the simplicity of the task or that I didn't properly split the dataset (e.g. x+y being in train and y+x being in val). I might run it again for a harder task to verify.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum3y00

Yep I used my own re-implementation, which somehow has slightly different behavior.

I'll also note that the task in the report is modular addition while figure 1 from the paper (the one with the red and green lines for train/val) is the significantly harder permutation group task.

Reply

Hypothesis: gradient descent prefers general circuits

Tom Lieberum3y20

I'm not sure I understand.

I chose the grokking starting point as 300 steps, based on the yellow plot. I'd say it's reasonable to say that 'grokking is complete' by the 2000 step mark in the default setting, whereas it is complete by the 450 step mark in the 10x setting (assuming appropriate LR decay to avoid overshooting). Also note that the plots in the report are not log-scale

Reply