I've been thinking for a while that one could do syllabus learning for LLMs. It's fairly easy to classify text by reading age. So start training the LLM on only text with a low reading age, and then increase the ceiling on reading age until it's training on the full distribution of text. (https://arxiv.org/pdf/2108.02170.pdf experimented with curriculum learning in early LLMs, with little effect, but oddly didn't test reading age.)
To avoid distorting the final training distribution by much, you would need to be able to raise the reading age limit fairly fast, so by the time it's reached maximum you're only used up say ten percent of the text with low reading ages, so then in the final training distribution those're only say ten percent underrepresented. So the LLM is still capable of generating children's stories if needed (just slightly less likely to do so randomly).
The hope is that this would improve quality faster early in the training run, to sooner get the LLM to a level where it can extract more benefit from even the more difficult texts, so hopefully reach a slightly higher final quality from the same amount of training data and compute. Otherwise for those really difficult texts that happen to be used early on in the training run, the LLM presumably gets less value from them than if they'd been later in the training. I'd expect any resulting improvement to be fairly small, but then this isn't very hard to do.
A more challenging approach would be to do the early training on low-reading-age material in a smaller LLM, potentially saving compute, and then do something like add more layers near the middle, or distill the behavior of the small LLM into a larger one, before continuing the training. Here the aim would be to also save some compute during the early parts of the training run. Potential issues would be if the distillation process or loss of quality from adding new randomly-initialized layers ended up costing more compute/quality than we'd saved/gained.
[In general, the Bitter Lesson suggests that sadly the time and engineering effort spent on these sorts of small tweaks might be better spent on just scaling up more.]
Abstract
Implications
Interpretability
One part that isn't mentioned in the abstract but is interesting:
The difference between highly activating tokens for a neuron is striking, here's the tiny model:
...and here's GPT2-XL:
Capabilities
Again from the introduction (emphasis mine)
If this is true, there could be ways to drastically cut LLM training costs while maintaining (or increasing) the capabilities of the final model.
This could be related to dataset quality. QLoRA found (among other things) that a high-quality dataset of 9000 examples (OpenAssistant) beat a 1M dataset of lower quality.