On March 29th, DeepMind published a paper, "Training Compute-Optimal Large Language Models", that shows that essentially everyone -- OpenAI, DeepMind, Microsoft, etc. -- has been training large language models with a deeply suboptimal use of compute.
Following the new scaling laws that they propose for the optimal use of compute, DeepMind trains a new, 70-billion parameter model that outperforms much larger language models, including the 175-billion parameter GPT-3 and DeepMind's own 270-billion parameter "Gopher".
I'm going to walk through the background of the now-falsified scaling laws from prior to this paper; then I'm going to describe the new laws given by this paper, and why they weren't found earlier; and finally I'll briefly mention some possible implications of this paper.
Independently of the consequences -- this paper is exciting! Machine learning researchers thought they knew laws about how to scale compute optimally, and the laws turned out to be wrong! It's a nice clear instance of science-functioning-in-ways-it-should in ML.
Background
In 2020 OpenAI proposed scaling laws which have since been used (at least implicitly) to guide the training of large models.
These scaling laws attempt to answer several questions. One of these questions is "Given a certain quantity of compute, how large of a model should I train in order to get the best possible performance?"
The answer isn't "as large a model as possible" because, for a fixed quantity of compute, a larger model must be trained on less data. So training a 1-million parameter model on 10 books takes about as many floating point operations (FLOPs) as training a 10-million parameter model on one book.
In the case of very large language models like GPT-3, these alternatives look more like training a 20-billion parameter model on 40% of an archive of the Internet, or training a 200-billion parameter model on 4% of an archive of the Internet, or any of an infinite number of points along the same boundary.
Compute on this scale is not cheap -- so if you're going to be spending 10 million dollars per training run on a model scaled up to be 100x bigger than your toy version of the model, you want principles better than a feeling in your gut to guide how you allocate this compute between "amount of data the model sees" and "how big the model should be."
So if you get 10x more compute, how much bigger do you make your model? What about 100x more compute? Or 1000x more compute?
Well, the OpenAI paper answers the question. If you get 10x more compute, you increase your model size by about 5x and your data size by about 2x. Another 10x in compute, and model size is 25x bigger and data size is only 4x bigger.
Model size is almost everything.
Model Size Is (Almost) Everything
Subsequent researchers and institutions took this philosophy to heart, and focused mostly on figuring out how to engineer increasingly-large models, rather than training comparatively smaller models over more data. Thus, the many headlines of increasingly-larger models that we've seen coming from ML research institutions and AI accelerator startups.
See, for instance, the following chart from the new DeepMind paper.
Large Subsequent Models
Note the increase to half a trillion parameters, with identical quantities of training data.
And note that this understanding of the world has also been used to project forward future data requirements -- NVIDIA, for instance, talks about training a trillion parameter model with only 450 billion tokens. Everyone had decided model size was much more important than data size.
Falsification
The DeepMind paper re-approaches the issue of scaling laws.
It uses three separate methods to try to find the correct scaling law, but I'm going to zoom in on the second because I think it's the easiest to comprehend.
The method is simple. They choose 9 different quantities of compute, ranging from about 1018 FLOPs to 1021 FLOPs.
For each quantity of compute, they train many different-sized models. Because the quantity of compute is constant for each level, the smaller models are trained for more time and the larger models for less.
The following chart from the paper illustrates this. Each line connects models (at different sizes) trained using the same amount of compute. The vertical axis is the loss, where lower is better:
IsoFLOP Curves
Each of these curves has a clear interpretation. To the left of the minima on each curve, models are too small -- a larger model trained on less data would be an improvement. To the right of the minima on each curve, models are too large -- a smaller model trained on more data would be an improvement. The best models are at the minima.
If you connect the minima at each curve and extend the line outwards, you get a new law! Specifically, it looks like for every increase in compute, you should increase data size and model size by approximately same amount.
If you get a 10x increase in compute, you should make your model 3.1x times bigger and the data you train over 3.1x bigger; if you get a 100x increase in compute, you should make your model 10x bigger and your data 10x bigger.
Now, all of these experimental runs graphed above were on relatively small models, trained with non-insane quantities of compute. So you could have argued that this rule wouldn't work with much larger numbers.
But to verify that the law was right, DeepMind trained a 70-billion parameter model ("Chinchilla") using the same compute as had been used for the 280-billion parameter Gopher. That is, they trained the smaller Chinchilla with 1.4 trillion tokens, while the larger Gopher had only been trained with 300 billion tokens.
And, as the new scaling laws predicts, Chinchilla is a lot better than Gopher on pretty much everything. It is better by the standard less-perplexity-per-word measure, and by the more interesting usefulness-on-downstream-task measures. I could insert a bunch of graphs here, but if you aren't familiar with the measures in question they basically all sum to "Hey, number goes up!"
Number goes up (Or down when appropriate)
Given the evidence of Chinchilla, it appears pretty definite that OpenAI got the scaling laws wrong. So one natural question is "What happened that they got it wrong?"
Well, background: The learning rate of a deep neural network dictates how much the parameters of a network are updated for each piece of training data. Learning rates on large training runs are typically decreased according to a schedule, so that data towards the end of a training run adjusts the parameters of a neural network less than data towards the beginning of it. You can see this as reflecting the need to not "forget" what was learned earlier in the training run.
It looks like OpenAI used a single total annealing schedule for all of their runs, even those of different lengths. This shifted the apparent best-possible performance downwards for the networks on a non-ideal annealing schedule. And this lead to a distorted notion of what laws should be.
Consequences
One funky thing about this is that we shouldn't see larger language models... at all, for at least a few years.
DeepMind provides a helpful chart of how much training data and compute you'd need to optimally train models of various sizes.
Note that it wouldn't make sense to train a model with 520 billion parameters until you had 60x as much compute as was used for Gopher / Chinchilla. You don't hit the need for a trillion parameters until you have 200x as much compute as was used for Gopher / Chinchilla.
(You might need even more compute; in part of the paper, DeepMind says that at large quantities of compute the scaling laws bend slightly, and the optimal behavior might be to scale data by even more than you scale model size. In which case you might need to increase compute by more than 200x before it would make sense to use a trillion parameters.)
So until wafer-scale chips decrease the cost of compute ten times, and Google also decides all it really needs for AGI is to put ten times as much money into LM's, we've seen the largest LM's we're likely to see. However long that may be.
One potential thing that could follow from this is that, because inference costs are obviously smaller for small language models, services such as OpenAI's GPT-3 should be cheaper for them to provide. The cost to run them, at the same level of quality, should drop by at least 3x. I don't know what percent the cost of providing these services is running them rather than training them, but potentially it could make services based on these models more efficient than they were before, and open up economic viability in places that didn't exist before.
One last consequence is that this paper makes the engineering involved in training large language models easier. Gathering more good data would be (I think) far easier than trying to efficiently split computation for increasingly large LM's across 1000s of machines.
On March 29th, DeepMind published a paper, "Training Compute-Optimal Large Language Models", that shows that essentially everyone -- OpenAI, DeepMind, Microsoft, etc. -- has been training large language models with a deeply suboptimal use of compute.
Following the new scaling laws that they propose for the optimal use of compute, DeepMind trains a new, 70-billion parameter model that outperforms much larger language models, including the 175-billion parameter GPT-3 and DeepMind's own 270-billion parameter "Gopher".
I'm going to walk through the background of the now-falsified scaling laws from prior to this paper; then I'm going to describe the new laws given by this paper, and why they weren't found earlier; and finally I'll briefly mention some possible implications of this paper.
Independently of the consequences -- this paper is exciting! Machine learning researchers thought they knew laws about how to scale compute optimally, and the laws turned out to be wrong! It's a nice clear instance of science-functioning-in-ways-it-should in ML.
Background
In 2020 OpenAI proposed scaling laws which have since been used (at least implicitly) to guide the training of large models.
These scaling laws attempt to answer several questions. One of these questions is "Given a certain quantity of compute, how large of a model should I train in order to get the best possible performance?"
The answer isn't "as large a model as possible" because, for a fixed quantity of compute, a larger model must be trained on less data. So training a 1-million parameter model on 10 books takes about as many floating point operations (FLOPs) as training a 10-million parameter model on one book.
In the case of very large language models like GPT-3, these alternatives look more like training a 20-billion parameter model on 40% of an archive of the Internet, or training a 200-billion parameter model on 4% of an archive of the Internet, or any of an infinite number of points along the same boundary.
Compute on this scale is not cheap -- so if you're going to be spending 10 million dollars per training run on a model scaled up to be 100x bigger than your toy version of the model, you want principles better than a feeling in your gut to guide how you allocate this compute between "amount of data the model sees" and "how big the model should be."
So if you get 10x more compute, how much bigger do you make your model? What about 100x more compute? Or 1000x more compute?
Well, the OpenAI paper answers the question. If you get 10x more compute, you increase your model size by about 5x and your data size by about 2x. Another 10x in compute, and model size is 25x bigger and data size is only 4x bigger.
Model size is almost everything.
Model Size Is (Almost) Everything
Subsequent researchers and institutions took this philosophy to heart, and focused mostly on figuring out how to engineer increasingly-large models, rather than training comparatively smaller models over more data. Thus, the many headlines of increasingly-larger models that we've seen coming from ML research institutions and AI accelerator startups.
See, for instance, the following chart from the new DeepMind paper.
Large Subsequent Models
Note the increase to half a trillion parameters, with identical quantities of training data.
And note that this understanding of the world has also been used to project forward future data requirements -- NVIDIA, for instance, talks about training a trillion parameter model with only 450 billion tokens. Everyone had decided model size was much more important than data size.
Falsification
The DeepMind paper re-approaches the issue of scaling laws.
It uses three separate methods to try to find the correct scaling law, but I'm going to zoom in on the second because I think it's the easiest to comprehend.
The method is simple. They choose 9 different quantities of compute, ranging from about 1018 FLOPs to 1021 FLOPs.
For each quantity of compute, they train many different-sized models. Because the quantity of compute is constant for each level, the smaller models are trained for more time and the larger models for less.
The following chart from the paper illustrates this. Each line connects models (at different sizes) trained using the same amount of compute. The vertical axis is the loss, where lower is better:
IsoFLOP Curves
Each of these curves has a clear interpretation. To the left of the minima on each curve, models are too small -- a larger model trained on less data would be an improvement. To the right of the minima on each curve, models are too large -- a smaller model trained on more data would be an improvement. The best models are at the minima.
If you connect the minima at each curve and extend the line outwards, you get a new law! Specifically, it looks like for every increase in compute, you should increase data size and model size by approximately same amount.
If you get a 10x increase in compute, you should make your model 3.1x times bigger and the data you train over 3.1x bigger; if you get a 100x increase in compute, you should make your model 10x bigger and your data 10x bigger.
Now, all of these experimental runs graphed above were on relatively small models, trained with non-insane quantities of compute. So you could have argued that this rule wouldn't work with much larger numbers.
But to verify that the law was right, DeepMind trained a 70-billion parameter model ("Chinchilla") using the same compute as had been used for the 280-billion parameter Gopher. That is, they trained the smaller Chinchilla with 1.4 trillion tokens, while the larger Gopher had only been trained with 300 billion tokens.
And, as the new scaling laws predicts, Chinchilla is a lot better than Gopher on pretty much everything. It is better by the standard less-perplexity-per-word measure, and by the more interesting usefulness-on-downstream-task measures. I could insert a bunch of graphs here, but if you aren't familiar with the measures in question they basically all sum to "Hey, number goes up!"
Number goes up (Or down when appropriate)
Given the evidence of Chinchilla, it appears pretty definite that OpenAI got the scaling laws wrong. So one natural question is "What happened that they got it wrong?"
Well, background: The learning rate of a deep neural network dictates how much the parameters of a network are updated for each piece of training data. Learning rates on large training runs are typically decreased according to a schedule, so that data towards the end of a training run adjusts the parameters of a neural network less than data towards the beginning of it. You can see this as reflecting the need to not "forget" what was learned earlier in the training run.
It looks like OpenAI used a single total annealing schedule for all of their runs, even those of different lengths. This shifted the apparent best-possible performance downwards for the networks on a non-ideal annealing schedule. And this lead to a distorted notion of what laws should be.
Consequences
One funky thing about this is that we shouldn't see larger language models... at all, for at least a few years.
DeepMind provides a helpful chart of how much training data and compute you'd need to optimally train models of various sizes.
Note that it wouldn't make sense to train a model with 520 billion parameters until you had 60x as much compute as was used for Gopher / Chinchilla. You don't hit the need for a trillion parameters until you have 200x as much compute as was used for Gopher / Chinchilla.
(You might need even more compute; in part of the paper, DeepMind says that at large quantities of compute the scaling laws bend slightly, and the optimal behavior might be to scale data by even more than you scale model size. In which case you might need to increase compute by more than 200x before it would make sense to use a trillion parameters.)
So until wafer-scale chips decrease the cost of compute ten times, and Google also decides all it really needs for AGI is to put ten times as much money into LM's, we've seen the largest LM's we're likely to see. However long that may be.
One potential thing that could follow from this is that, because inference costs are obviously smaller for small language models, services such as OpenAI's GPT-3 should be cheaper for them to provide. The cost to run them, at the same level of quality, should drop by at least 3x. I don't know what percent the cost of providing these services is running them rather than training them, but potentially it could make services based on these models more efficient than they were before, and open up economic viability in places that didn't exist before.
One last consequence is that this paper makes the engineering involved in training large language models easier. Gathering more good data would be (I think) far easier than trying to efficiently split computation for increasingly large LM's across 1000s of machines.
(Cross post)