Thanks to Collin Burns, Ruiqi Zhong, Cassidy Laidlaw, Jean-Stanislas Denain, and Erik Jones, who generated most of the considerations discussed in this post.
Previously, I evaluated the accuracy of forecasts about performance on the MATH and MMLU (Massive Multitask) datasets. I argued that most people, including myself, significantly underestimated the rate of progress, and encouraged ML researchers to make forecasts for the next year in order to become more calibrated.
In that spirit, I’ll offer my own forecasts for state-of-the-art performance on MATH and MMLU. Following the corresponding Metaculus questions, I’ll forecast accuracy as of June 30, 2023. My forecasts are based on a one-hour exercise I performed with my research group, where we brainstormed considerations, looked up relevant information, formed initial forecasts, discussed, and then made updated forecasts. It was fairly easy to devote one group meeting to this, and I’d encourage other research groups to do the same.
Below, I’ll describe my reasoning for the MATH and MMLU forecasts in turn. I’ll review relevant background info, describe the key considerations we brainstormed followed, analyze those considerations, and then give my bottom-line forecast.
Metaculus does a good job of describing the MATH dataset and corresponding forecasting question:
The MATH dataset is a dataset of challenging high school mathematics problems constructed by Hendrycks et al. (2021). Hypermind forecasters were commissioned to predict state-of-the-art performance on June 30, 2022, '23, '24, and '25. The 2022 result of 50.3% was significantly outside forecasters' prediction intervals, so we're seeing what the updated forecasts are for 2023, '24, and '25.
What will be state-of-the-art performance on the MATH dataset in the following years?
These questions should resolve identically to the Hypermind forecasts:
"These questions resolve as the highest performance achieved on MATH by June 30 in the following years by an eligible model.
Eligible models may use scratch space before outputting an answer (if desired) and may be trained in any way that does not use the test set (few-shot, fine tuned, etc.). The model need not be publicly released, as long as the resulting performance itself is reported in a published paper (on arxiv or a major ML conference) or through an official communication channel of an industry lab (e.g. claimed in a research blog post on the OpenAI blog, or a press release). In case of ambiguity, the question will resolve according to Jacob Steinhardt’s expert judgement."
It’s perhaps a bit sketchy for me to be both making and resolving the forecast, but I expect in most cases the answer will be unambiguous.
Below I list key considerations generated during our brainstorming:
Minerva incorporated several changes that improved performance relative to previous attempts:
Other log-hanging fruit:
Overall summary: the lowest-hanging fruit towards further improvement would be (in order):
Aggregating these, it feels easy to imagine a >14% improvement, fairly plausible to get >21%, and >28% doesn’t seem out of the question. Concretely, conditional on Google or some other large organization deciding to try to further improve MATH performance, my prediction of how much they would improve it in the next year would be:
(This prediction is specifically using the "how much low-hanging fruit" frame. I'll also consider other perspectives, like trend lines, and average with these other perspectives when making a final forecast.)
As noted above, the 62B parameter model has best-of-256 performance (filtered for correct reasoning) of at least 68%. My guess is that the true best-of-256 performance is in the low-to-mid 70s for 62B. Since Minerva-540B is 7% better than Minerva-62B, the model is at least capable of generating the correct answer around 80% of the time.
We can also look at errors by type of error. For instance, we estimated that calculation errors accounted for around 30% of the remaining errors (or around 15% absolute performance). These are probably fairly easy to fix.
In the other direction, the remaining MATH questions are harder than the ones that Minerva solves currently. I couldn’t find results grouped by difficulty, but Figure 4 of the Minerva paper shows lower accuracy for harder subtopics such as Intermediate Algebra.
We estimated that using all of arXiv would only generate about 10B words of mathematical content, compared to the 20B tokens used in Minerva. At a conversation rate of 2 tokens/word, this suggests that Minerva is already using up most relevant content on arXiv. I’d similarly guess that Minerva makes use of most math-focused web pages currently on the internet (it looks for everything with MathJax). I’d guess it’s possible to find more (e.g. math textbooks) as well as to synthetically generate mathematical exposition, and probably also to clean the existing data better. But overall I’d guess there aren’t huge remaining gains here.
For math specifically, it’s possible to use calculators and verifiers, which aren’t used by Minerva but could further improve performance. Table 9 of the PaLM paper shows that giving PaLM a calculator led to a 4% increase in performance on GSM8K (much smaller than the gains from chain-of-thought prompting).
In the same table, we see that GPT-3 gets a 20% gain using a task-specific verifier. Given that the MATH problems are fairly diverse compared to GSM8K, I doubt it will be easy to write an effective verifier for that domain, and it’s unclear whether researchers will seriously try in the next year. The calculator seems more straightforward and I’d give a ~50% chance that someone tries it (conditional on there being at least one industry lab paper that focuses on math in the next year).
This is a roughly 2.9% accuracy gain per month (but almost certainly will be slower in future). Taking this extrapolation literally would give 85.1% for 06/30/2023.
The Dynabench paper plots historical progress on a number of ML datasets, normalized by baseline and ceiling performance (see Figure 1, reproduced below).
We seem to often see immediate huge gains, while the next ones are somewhat slower.
Here’s another benchmark for reference. It got 67% -> 86% within 1-2 months, then took 4 months to break 90%.
Overall, it seems clear we should expect some sort of slow-down. In some cases, the slow-down was huge. I think progress should not slow down that much in this case since there’s still lots of low-hanging fruit. Maybe progress is 60% as fast as before? So that would give us 71% on 06/30/2023.
Two sources of progress:
How many language papers have been released historically?
(This only counts language models that achieved broad state-of-the-art performance. E.g. I'm ignoring OPT, BLOOM, GPT-J, etc.)
By this count, there have been 6 papers since the beginning of 2019. So base rate of around 1.7 / year. If we use a Poisson process, predicts that we will see 0 new papers with probability 18%, 1 with probability 31%, 2 with probability 26%, and >2 with probability 25%.
What about math-specific work? Harder to measure what “counts” (lots of math papers but how many are large-scale / pushing state-of-the-art?). Intuitively I’d expect more like 1.1 such papers per year. So around 33% chance of zero, 37% chance of 1, 20% chance of 2, 10% chance of >2.
An important special case is if there are no developments on either the language models or the math-specific front. Under the above model these have probabilities 18% and 33%, and are probably positively correlated. Additionally, it's possible that language model papers might not bother to evaluate on MATH or might not use all the ideas in the Minerva paper (and thus fail to hit SOTA). Combining these considerations, I’d forecast around a 12% chance that there is no significant progress on MATH on any front.
From the above lines of reasoning, we have a few different angles on the problem:
If I intuitively combine these, I produce the following forecast:
The Metaculus community is at 74 median, upper 75% of 83. So I’ll adjust up slightly more. New forecast adjusted towards community prediction:
Rough approximation of this distribution on Metaculus (red is me, green is the community prediction):
Interestingly, Hypermind forecasts a much smaller median of 64.1%.
Again borrowing from Metaculus:
The Massive Multitask Language Understanding (MMLU) dataset is a dataset of high school, college, and professional multiple choice exams that test expert subject knowledge. It was constructed by Hendrycks et al. (2021). Hypermind forecasters were commissioned to predict state-of-the-art performance on June 30, 2022, '23, '24, and '25. The 2022 result of 67.5% was significantly outside forecasters' prediction intervals, so we're seeing what the updated forecasts are for 2023, '24, and '25.
What will be state-of-the-art accuracy on the Massive Multitask dataset in the following years?
These questions should resolve identically to the Hypermind forecasts:"These questions resolve as the highest performance achieved on MMLU by June 30 in the following years by an eligible model. Eligible models must not have been specifically trained on data from the MMLU dataset. A model need not be publicly released, as long as the resulting performance itself is reported in a published paper (on arxiv or a major ML conference) or through an official communication channel of an industry lab (e.g. claimed in a research blog post on the OpenAI blog, or a press release). If there's uncertainty about whether something counts, we will defer to this leaderboard."
At a high level, these are fairly similar to those of the MATH dataset. Since more people have worked on MMLU and there’s been steadier progress, we rely more on base rates and less on detailed considerations of how one could improve it further.
Below is a time series of MMLU results, taken from the MMLU leaderboard (note MMLU was published in Jan. 2021). I've bolded few-shot/zero-shot results.
Model | Date | Average |
---|---|---|
Chinchilla (70B, few-shot) | Mar 29, 2022 | 67.5 |
Gopher (280B, few-shot) | Dec 8, 2021 | 60.0 |
GPT-3 (175B, fine-tuned) | Jul 22, 2020 | 53.9 |
UnifiedQA | Oct 7, 2020 | 48.9 |
GPT-3 (175B, few-shot) | Jul 22, 2020 | 43.9 |
GPT-3 (6.7B, fine-tuned) | 43.2 | |
GPT-2 | 32.4 |
If we restrict to few-shot results, we see:
It's not clear which time horizon is best to use here. I came up with an approximate base rate of 1.2 pts / month.
Other notes:
We analyzed this already in the previous section on MATH. It seems like there's usually an initial period of rapid progress, followed by a slow-down. However, MMLU has had enough attempts that I’d say it’s past the “huge initial gains” stage. Therefore, I don’t expect as much as a level-off compared to MATH, even though there is less obvious low-hanging fruit---maybe we'll get 75% as fast of progress as before. This would suggest +10.8 points over the next year.
The current SOTA of 67.5 comes from Chinchilla. But Minerva does much better than Chinchilla on the MMLU-STEM subset of MMLU. Here’s a rough calculation of how much taking max(Chinchilla, Minerva) would improve things:
So adding in Minerva would add (75% - 54.9%) * 19/57 = 6.7% points of accuracy.
Will this happen? It's not obvious, since PaLM is owned by Google and Chinchilla is owned by DeepMind. At least one org would need to train a new model. I think there’s a good chance this happens, but not certain (~65% probability).
Result of a quick brainstorm:
In addition, the STEM-specific improvements (e.g. Minerva) will continue to improve MMLU-STEM. Based on the MATH forecast above, on median I expect about half as much improvement over the next year as we saw from the Minerva paper, or around another 3% improvement on MMLU overall (since Minerva gave a 6.7% improvement).
We thought it was possible but unlikely that there are significant advances in general knowledge retrieval in the next year that also get used by MMLU (~20% probability).
Unlike MATH, there is nothing “special” that makes MMLU stand out from other language modeling benchmarks. So I’d guess most gains will come from general-purpose improvements to language models, plus a bit of STEM-specific improvement if people focus on quantitative reasoning.
In some sense, MMLU performance is already “at” 74.2% because of the Minerva result. Additional low-hanging fruit would push us up another 5 points to 79.2%. Alternately, simply extrapolating historical progress would suggest 10.8 points of improvement, or 85%. Putting these together, I’d be inclined towards a median of 83%.
If we instead say that progress doesn’t slow down at all, we’d get 89%.
As before, I’d give an 18% chance of no new SOTA language model papers, in which case MMLU performance likely stays between 67.5% and 74.2%. This also means we should adjust the previous numbers down a bit.
Overall forecast:
This seems pretty similar to the Metaculus community prediction, so I won’t do any further adjustment.
Interestingly, the Hypermind median is only at 72.5% right now. Given the ability to combine Minerva + Chinchilla, this intuitively seems too low to me.
My personal forecasts ended up being pretty similar to the Metaculus community forecasts, aside from me expecting slightly slower MATH progress (but only by about a percentage point). So, we can ask what Metaculus expects for 2024 and 2025 as well, as an approximation to what I "would" believe if I thought about it more.
MATH forecast (community prediction in green, top row of each cell):
MMLU forecast (community prediction in green):
So, on median Metaculus expects MATH to be at 83% in 2024 and at 88% in 2025. It expects MMLU to be at 88% in 2024 and at 93% (!) in 2025. The last one is particularly interesting: since MMLU tests domain-specific subject knowledge across many areas, it is predicting that a single model will be able to match domain-specific expert performance across a wide variety of written subject exams.
Do you agree with these forecasts? Disagree? I strongly encourage you to leave your own forecasts on Metaculus: here for MATH, and here for MMLU.