The implication depends on the distribution of lottery tickets. If there is a short-tailed distribution, then the rewards of scaling will be relatively small; bigger would still get better, but very slowly. A long-tailed distribution, on the other hand, would suggest continued returns to getting more lottery tickets.
I ask a question here about what's true in practice.
I wouldn't say the scaling hypothesis is purely about Transformers. Quite a few of my examples are RNNs, and it's unclear how much of a difference there is between RNNs and Transformers anyway. Transformers just appear to be a sweet spot in terms of power while being still efficiently optimizable on contemporary GPUs. CNNs for classification definitely get better with scale and do things like disentangle & transfer & become more robust as they get bigger (example from today), but whether they start exhibiting any meta-learning specifically I don't know.
When the network is randomly initialized, there is a sub-network that is already decent at the task.
From what I can tell, the paper doesn't demonstrate this--i.e. I don't think they ever test the performance of a sub-network with random weights (rather they test the performance of a subnetwork after training only the subnetwork). Though maybe this isn't what you meant, in which case you can ignore me :)
The original paper doesn't demonstrate this but later papers do, or at least claim to. Here are several papers with quotes:
https://arxiv.org/abs/2103.09377
"In this paper, we propose (and prove) a stronger Multi-Prize Lottery Ticket Hypothesis:
A sufficiently over-parameterized neural network with random weights contains several subnetworks (winning tickets) that (a) have comparable accuracy to a dense target network with learned weights (prize 1), (b) do not require any further training to achieve prize 1 (prize 2), and (c) is robust to extreme forms of quantization (i.e., binary weights and/or activation) (prize 3)."
https://arxiv.org/abs/2006.12156
"An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network."
https://arxiv.org/abs/2006.07990
The strong {\it lottery ticket hypothesis} (LTH) postulates that one can approximate any target neural network by only pruning the weights of a sufficiently over-parameterized random network. A recent work by Malach et al. \cite{MalachEtAl20} establishes the first theoretical analysis for the strong LTH: one can provably approximate a neural network of width d and depth l, by pruning a random one that is a factor O(d4l2) wider and twice as deep. This polynomial over-parameterization requirement is at odds with recent experimental research that achieves good approximation with networks that are a small factor wider than the target. In this work, we close the gap and offer an exponential improvement to the over-parameterization requirement for the existence of lottery tickets. We show that any target network of width d and depth l can be approximated by pruning a random network that is a factor O(log(dl)) wider and twice as deep.
None of those quotes claim that training just reinforces the 'winning tickets'. Also those are referred to as the "strong" or "multi-ticket" LTH.
Yeah, fair enough. I should amend the title of the question. Re: reinforcing the winning tickets: Isn't that implied? If it's not implied, would you not agree that it is happening? Plausibly, if there is a ticket at the beginning that does well at the task, and a ticket at the end that does well at the task, it's reasonable to think that it's the same ticket? Idk, I'm open to alternative suggestions now that you mention it...
Re: reinforcing the winning tickets: Isn't that implied? If it's not implied, would you not agree that it is happening?
I don't think it's implied, and I'm not confident that it's happening. There are lots of neural networks!
Hmmm, ok. Can you say more about why? Isn't the simplest explanation that the two tickets are the same?
I expect that there are probably a bunch of different neural networks that perform well at a given task. We sort of know this because you can train a dense neural network to high accuracy, and also prune it to get a definitely-different neural network that also has high accuracy. Is it the case that these sparse architectures are small enough that there's only one optimum? Maybe, but IDK why I'd expect that.
Whoa, the thing you are arguing against is not at all what I had been saying -- but maybe it was implied by what I was saying and I just didn't realize it? I totally agree that there are many optima, not just one. Maybe we are talking past each other?
(Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning! So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!)
I guess I'm imagining that 'by default', your distribution over which optimum SGD reaches should be basically uniform, and you need a convincing story to end up believing that it reliably gets to one specific optimum.
So for them not to be the same, the training process would need to kill the first ticket and then build a new ticket on exactly the same spot!
Yes, that's exactly what I think happens. Training takes a long time, and I expect the weights in a 'ticket' to change based on the weights of the rest of the network (since those other weights have similar magnitude). I think the best way to see why I think that is to manually run thru the backpropagation algorithm.
If I'm wrong, it's probably because of this paper that I don't have time to read over right now (but that I do recommend you read).
Oh here's where I think things went wrong:
Part of why I think the two tickets are the same is that the at-initialization ticket is found by taking the after-training ticket and rewinding it to the beginning!
This is true in the original LTH paper, but there the "at-initialization ticket" doesn't actually perform well: it's just easy to train to high performance.
In the multi-prize LTH paper, it is the case that the "at-initialization ticket" performs well, but they don't find it by winding back the weights of a trained pruned network.
If you got multi-prize at-initialization tickets by winding back the weights of a trained pruned network, I would find that pretty convincing - the idea that they'd be totally different networks would seem like too much of a coincidence. But I would still want to actually check whether the weights were actually the same (which funnily enough isn't trivial if you're not familiar with a little-discussed symmetry of DNNs: for a hidden layer neuron with a ReLU activation function, you can scale the input weights up by a positive constant and the output weights down by the same constant without changing the functioning of the network).
OH this indeed changes everything (about what I had been thinking) thank you! I shall have to puzzle over these ideas some more then, and probably read the multi-prize paper more closely (I only skimmed it earlier)
Ah to be clear I am entirely basing my comments off of reading the abstracts (and skimming the multi-prize paper with an eye one develops after having been a ML PhD student for mumbles indistinctly years).
The lottery ticket hypothesis, as I (vaguely) understand it, is that artificial neural networks tend to work in the following way: When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.
[EDIT: This understanding goes beyond what the original paper proved, it draws from things proved (or allegedly proved) in later papers. See thread below. EDIT EDIT: Daniel Filan has now convinced me that my understanding of the LTH as expressed above was importantly wrong, or at least importantly goes-beyond-the-evidence.]
By the scaling hypothesis I mean that in the next five years, many other architectures besides the transformer will also be shown to get substantially better as they get bigger. I'm also interested in defining it differently, as whatever Gwern is talking about.