Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws?
My guess is that the answer is mostly yes (maybe not the exact numbers predicted by existing scaling laws, but similar ballpark).
how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?
I think this is mostly irrelevant to timelines / previous scaling laws for transfer:
You probably could come to a unified view that incorporates both this new paper and previous scaling law papers, but I expect you'd need to spend a bunch of time getting into the minutiae of the details across the two methods. (Probably high tens to low hundreds of hours.)
Thanks! Your answer no. 2 is especially convincing to me; I didn't realize the authors used smaller models as the comparison--that seems like an unfair comparison! I would like to see how well these 0.1%-tuned transformers do compared to similarly-sized transformers trained from scratch.
I think it's plausible that the data dependence will act like it's 3 OOM smaller. Compute dependence will be different, though, right? Even if you're just finetuning part of the model you have to run the whole thing to do evaluation. In a sense this actually seems like the worst of both worlds (but you get the benefit from pretraining).
Edit: Actually, I'm confused why you say a smaller model needs that factor fewer steps. I thought the slope on that one was actually quite gentle. It's just that smaller models are cheap - or am I getting it wrong?
I think compute cost equals data x parameters, so even if parameters are the same, if data is 3 OOM smaller, then compute cost will be 3 OOM smaller.
I'm not sure I understand your edit question. I'm referring to the scaling laws as discussed and interpreted by Ajeya. Perhaps part of what's going on is that in the sizes of model we've explored so far, bigger models only need a little bit more data, because bigger models are more data-efficient. But very soon it is prophecied that this will stop and we will transition to a slower scaling ...
The scaling laws, at least according to the interpretation used in Ajeya's framework (and this seems to be basically endorsed by tons of people I respect on this matter) say basically that if you increase parameter count by an order of magnitude, you also need to increase training steps/data points by about an order of magnitude, or else you are wasting your compute and could get the same performance with a smaller parameter count. For example, for a 10^14 parameter model (the size of the human brain, basically) we'd need 10^13 training steps/data points.
Now we have papers like this one claiming that pre-trained transformers can be fine-tuned to do well at completely different tasks (incl. different modalities!) by only modifying 0.1% of the parameters.
Does this mean that this fine-tuning process can be thought of as training a NN that is 3 OOMs smaller, and thus needs 3 OOMs fewer training steps according to the scaling laws? I'm guessing the answer is no, but I don't know why, so I'm asking.
(If the answer is yes, how does that not contradict the scaling laws for transfer described here and used in this calculation by Rohin?)