In 2021, I proposed measuring progress in the perplexity of language models and extrapolating past results to determine when language models were expected to reach roughly "human-level" performance. Here, I build on that approach by introducing a more systematic and precise method of forecasting progress in language modeling that employs scaling laws to make predictions.

The full report for this forecasting method can be found in this document. In this blog post I'll try to explain all the essential elements of the approach without providing excessive detail regarding the technical derivations.

This approach can be contrasted with Ajeya Cotra's Bio Anchors model, providing a new method for forecasting the arrival of transformative AI (TAI). I will tentatively call it the "Direct Approach", since it makes use of scaling laws directly to make predictions about compute requirements for AI.

Naturally, the Direct Approach is a very speculative framework and might end up being useless for forecasting TAI (in fact, I consider this the most likely outcome). Nonetheless, I'm hopeful that something like it can serve as a better foundation than current TAI timelines models, which I currently think are likely even worse. Note that there may be errors in the report and Colab notebook, as they were not extensively fact-checked.

Some background

In a nutshell, this approach is simply about taking the cross-entropy loss of an autoregressive model and trying to find a way of interpreting that quantity qualitatively: that is, something we can put on a chart and extrapolate until the quantity reaches a natural threshold that we identify with something important.

In my 2021 post about predicting language model performance, I drew a trendline through a plot of language model perplexities on various benchmarks and noted when the trendline went through estimates of "human-level" perplexity. This approach felt reasonable to me at the time, but I now think it too easily hand-waved away some important details.

The error of omission I committed in my old approach becomes more apparent when you think about language model performance from the perspective of scaling laws, for example the parametric scaling law from Hoffmann et al. 2022:

Here, we see cross-entropy loss as a function of parameters  and training tokens  seen during training. Notably, if we take the limit as the number of parameters and training tokens goes to infinity, then we're left with . Theoretically,  corresponds to the "entropy of natural text" under certain assumptions, which is precisely the thing I identified with "roughly human-level" performance in my previous post. In other words, if we take this scaling law naively, it seems as though it will take infinite compute to reach human-level performance.

I believe the resolution to this apparent issue is to say that "human-level" performance will not be obtained when loss hits , but rather some small level above . How close to  is enough? Well, that's the question we tried to answer with this report.

Summary of the Direct Approach

We begin by considering a language task, which in this post will be scientific research for illustration. For simplicity, let's imagine that this task consists of writing high-quality research papers or reports, although more nuanced specifications are possible.

Of course, real scientific research involves more than merely writing research papers. It involves proposing hypotheses, devising experiments, and collecting data, but for now, let's imagine that we can simplify all these steps into one step that involves writing high quality research papers. This simplification may not be entirely unrealistic, since if the papers are genuinely judged to be high quality and not fraudulent or p-hacked etc., then presumably they are the end result of a process that reliably performs all the essential steps to proper scientific research.

Next, we estimate a sufficient horizon length, which I'll call the k-horizon, over which we expect the most complex reasoning to emerge during the task. For the case of scientific research, we might reasonably take the k-horizon to roughly be the length of an average scientific paper, which is likely between 3,000 and 10,000 words. However, we can also explicitly model our uncertainty about the right choice for this parameter.

Our goal is to find a value for the KL-divergence of the "true" distribution of text from a model that roughly corresponds to "the model cannot be reliably distinguished from the true distribution over lengths equal to the k-horizon". Note that if KL-divergence were exactly zero, then there would be no detectable difference between the two distributions, and thus, the model could directly substitute for scientific researchers. In the more realistic case, the KL-divergence will be non-zero, allowing us to calculate the expected number of tokens over which it becomes theoretically possible to discriminate between the model and the true distribution, to some degree of confidence.

In general, the key insight of this approach is that indistinguishability implies competence. The reason is simple. If there is no way of reliably distinguishing between what the model produces and the true distribution, then there cannot be defects in the model's competence, as otherwise, we could exploit those defects to distinguish its outputs from the true distribution.

I am not saying that competence implies indistinguishability, as a model can be superhuman and quite distinguishable from human performance. I think the Direct Approach is best seen as a way of calculating an upper bound on the hardness of training a model that can think reliably over long sequences, rather than a statement about how transformative models will be trained (especially that they will be trained to copy human behavior exactly).

In the next sections, I introduce the concept of k-performance, which informally means that the model produces indistinguishable outputs according to a trained judge, up to the horizon length k. All of this can be made more formal by introducing a simple model of human abilities to discriminate between outputs.

When combined with empirically derived scaling laws, the result permits us to directly calculate the compute requirements to train a model with a KL-divergence corresponding to a target k-performance value. Then, just as in Bio Anchors, we can forecast a distribution over the arrival date of this model by forecasting future growth in price performance, willingness to spend, and algorithmic efficiency, among other variables we might wish to consider.

Interpreting the training loss

In language modeling the training loss is generally an estimate of the cross-entropy between the model and the "true" distribution. This loss is convenient for our purposes, since there is a neat mathematical relationship between the cross-entropy between two distributions and our ability to distinguish between samples from those distributions.

First, note that the cross-entropy between  and  can be decomposed as a sum of entropy and a KL-divergence,

Since the Hoffmann et al. scaling law also decomposes training loss into an irreducible loss, which they associate with the intrinsic entropy of internet text, and a reducible loss component, we can use their scaling law to estimate the KL-divergence given data and parameter training inputs. The KL-divergence can then be used to calculate how "distinguishable"  and  are in the following sense.

Suppose you were given i.i.d samples either  or  but you did not initially know what distribution they were being sampled from. If you start with some prior over whether you are sampling from  or  then you can use the KL-divergence to calculate the number of samples it should take, in expectation, for your prior to exceed some threshold confidence about whether you are sampling from either  or .

In full report, I derive this formula by borrowing from these lecture notes on the Sequential Probability Ratio Test.

Let  be the stopping threshold probability, and let  be the hypotheses that we are sampling from  or  respectively. We will stop when our odds  either falls below  or exceeds . We can then show that the expected stopping time, or  is approximately inversely linear in the KL-divergence,

and 

Importantly, these are the expected times with respect to an ideal discriminator, meaning that no matter what method one uses, it can be proved that there is no possible way they can discriminate between samples from these distributions in a shorter time.

In other words, for any desired level of confidence  and over any horizon length k we can find KL-divergences of  from  such that it will be impossible to reliably distinguish between fewer than k samples from   and , with confidence .

In plainer language, we can use scaling laws in data and parameters to calculate the loss of a language model with respect to its training data, which can be used to calculate the horizon over which the model's output is essentially indistinguishable from the training data. If this horizon length is longer than the horizon length of the transformative task, then our approach is to predict that the model is competent, and can directly substitute for humans in performing the task.

(For technical reasons, the preceding analysis is slightly incomplete, since in the case of language models, we do not obtain i.i.d. samples. In full report, I address this concern by appealing to the Shannon–McMillan–Breiman theorem.)

Building a more realistic model

In the last section, I only showed how to calculate the horizon length over which two distributions are hard to distinguish between for an ideal discriminator. A more realistic model is to assume that there is some imperfection in real human discrimination abilities.

To make things slightly more precise, let the k-performance of a model refer to the horizon length k over which a model cannot be reliably distinguished from the true distribution relative to some expert human judge. In particular, we are interested in calculating the k-performance relative to a trained human judge, who, despite falling short of being an ideal discriminator, possesses a variety of tools at their disposal which they can use to discriminate between outputs from the model, and outputs from the true distribution.

In the full report, I show that two intuitive models of human discrimination abilities yield the same result, or roughly, that the k-performance of a model with respect to an human discriminator will be some constant multiple of the k-performancd with respect to an ideal discriminator, which we can calculate directly. Performing experiments to measure this factor, which I call the human slowdown factor, is a tractable way of reducing uncertainty in this approach to forecasting AI.

When will TAI arrive?

Given an estimate of the slowdown factor of human discrimination abilities, a horizon length k over a transformative task (in our illustrative case, scientific research), and a scaling law in compute for the relevant distribution, it is possible to calculate a distribution over the upper bound of the training compute for TAI, assuming those scaling laws hold.

In the chart below (from Tamay Besiroglu), we can see the relationship between training compute and k-performance under various possible values of the slowdown parameter, according to the Hoffmann et al. compute-optimal scaling law. However, to be clear, the Hoffmann et al. scaling law was derived from internet data and by scaling a particular class of models, which may not be a reliable guide for measuring the difficulty of learning how to do scientific reasoning. Ultimately, better data will be needed on scaling properties on higher quality distributions to make this analysis more conclusive.

One way to read this chart is to imagine what horizon length of text you think Chinchilla is capable of reliably (~90% of the time) performing coherent reasoning over. For example, suppose you thought that Chinchilla was capable of reliably reasoning coherently over tweet-length tasks, but not short-blog-post-length tasks. Then, your estimate of the slowdown parameter would be at least 10x but not 50x. You can then look at the corresponding lines and extrapolate until it reaches the threshold for scientific-manuscript-length tasks. This type of reasoning personally convinced me that a reasonable hard upper bound for training TAI was about 10^40 FLOP, with something between 10^30 to 10^35 FLOP as my central estimate for the training requirements, using 2022 algorithms.

If we're also given estimates for growth in computing price-performance, willingness to spend, and algorithmic progress, then it is possible to provide a distribution over dates when we expect TAI to arrive.

You can visit this Google Colab notebook to create a timeline estimate using this approach. Note that I did not tweak these parameters in order to produce a desired result (indeed, I was surprised by the end result). However, I have been updating it as I find better estimates for each parameter.

[Note: originally I didn't put this plot in the blog post because I didn't want it shared everywhere without context. I was convinced to put it in here with the caveat that this is highly preliminary and sensitive to the parameters in the frequently updated notebook. 

Perhaps the most misleading thing right now is that the model does not yet update on the fact that TAI has not already arrived. Thus, it puts non-trivial credence on us already having TAI. See this comment for a longer explanation. Don't take this plot too seriously.]

Alternatively, you can incorporate this approach into Tom Davidson's takeoff model to build a more theoretically grounded timelines model but I have not done that yet.

Also, it is worth noting that I am trying to model my uncertainty over the underlying parameters, yielding a very uncertain bottom-line result, despite my opinion that this model slightly reduces true uncertainty about the arrival of TAI relative to Bio Anchors.

Comparison to Bio Anchors

A good summary of Bio Anchors can be found in this comment from Rohin Shah.

The Bio Anchors report actually considers a version of the "direct extrapolation" model, but eventually dismisses the idea since it's unclear what metric we should use to measure performance. Cotra wrote,

A very different high-level approach to estimating TAI timelines (which in our experience most people initially gravitate toward) involves more holistically assessing progress in AI systems’ capabilities, rather than leaning heavily on biological anchors. Essentially, this approach is to: 

  1. Judge how “impressive”, “capable”, “general”, or “useful” state-of-the-art (SOTA) AI systems currently are (for example by synthesizing information from various key benchmarks and AI challenges such as performance on board games, Winograd schemas, adversarial examples, etc). 
  2. Assess how quickly the impressiveness of AI systems has been improving recently.
  3. Extrapolate how many years of progress at the current pace would be required to reach the level of impressiveness required for TAI.

I’ll call this approach the subjective impressiveness extrapolation approach, which stands in contrast with the biological anchors framework used in this report. Here is a visualization of a hypothetical TAI timelines forecast using a subjective impressiveness extrapolation approach, where the x-axis is the year (from 2000 to 2100), and the red line represents the holistic “impressiveness” of AI systems in that year (which reaches the level of TAI around ~2065-2070): 

[...]

The most important disadvantage of the subjective impressiveness extrapolation is that it is extremely unclear what exactly the y-axis refers to, and different people will have different intuitions about it.

My alternative is simply to take scaling laws at face value, and try my best to calculate the training loss associated with something like "transformative" or "human-level" abilities. Since many transformative tasks can arguably be translated into language modeling tasks, and the best data we have on neural scaling laws comes from language modeling, it's natural to examine language models, though I'd be excited if someone tried this approach for other modalities too, including RL.

Since it was published, Cotra's report has been subject to a number of critiques. As I pointed out last year, Cotra's final distribution over the compute required for training TAI is extremely broad, spanning over 20 orders of magnitude, making her model relatively uninformative. In addition, her report arguably underestimates our true uncertainty over TAI timelines since it relies on point estimates for algorithmic progress and price-performance declines in hardware, rather than modeling our uncertainty over these parameters.

Furthermore, Cotra's model is bit complex in some places, requiring considerable guesswork before producing any results. In the absence of historical precedent for biological anchor models anticipating AI developments, the lack of model simplicity, the uninformative nature of the bottom-line results, among other methodological issues, it is worth seeking alternative approaches for modeling TAI timelines.

By contrast, while I think my model still requires lots of guesswork and has massive uncertainty, I think there are several advantages of the Direct Approach relative to Bio Anchors. These include,

  1. It's simpler than Bio Anchors, with fewer free parameters.
  2. The model can be generalized to many tasks, even non-transformative ones, enabling us to actually test the model in the short term, and see if it generates reliable predictions.
  3. We can plausibly greatly reduce our uncertainty over the bottom line distribution through experiment, especially by measuring the human slowdown factor.

That said, I'm also aware of several issues with the Direct Approach, including,

  1. It employs a somewhat dubious interpretation of the cross-entropy loss of language models. In general, it is unclear whether we can readily obtain much information about the abilities of a language model given only its cross-entropy loss on a given task. Ultimately, downstream benchmarking data may be required instead.
  2. It assumes that we can naively extrapolate neural scaling laws over many orders of magnitude. (Note however that Bio Anchors also assumes this).
  3. The model currently lacks confirmation about many key sources of data, without which it is very difficult to use it to make good predictions. For example, I wouldn't put too much trust in the Hoffmann et al. scaling law.
  4. The Direct Approach only produces a soft upper bound over the compute distribution required for TAI, since a more efficient method than simply scaling language models could be employed. This upper bound may be far above the actual requirements, which could render the Direct Approach fairly useless.

Overall, I'm hopeful that the Direct Approach can supplement Bio Anchors as a means of forecasting advanced AI, even if it does not serve as an adequate replacement.

New Comment
2 comments, sorted by Click to highlight new comments since:

Nice post, and nice argument! I think I agree that this is a worthy alternative to the Bio Anchors brain-and-genome-size-comparison stuff. I even tentatively agree that it's better overall, though I'd want to think about it more. (When I go on about how great Bio Anchors is, it's not because I'm in love with the brain size comparison--though I used to like it more than I do now--it's because I'm in love with the "useful core" of it, the breakdown into computing price-performance, willingness to spend, algorithmic progress, and compute requirements. Which it seems you also are doing.)

This type of reasoning personally convinced me that a reasonable hard upper bound for training TAI was about 10^40 FLOP, with something between 10^30 to 10^35 FLOP as my central estimate for the training requirements, using 2022 algorithms.

...
If we're also given estimates for growth in computing price-performance, willingness to spend, and algorithmic progress, then it is possible to provide a distribution over dates when we expect TAI to arrive.
...

Alternatively, you can incorporate this approach into Tom Davidson's takeoff model to build a more theoretically grounded timelines model but I have not done that yet.

Tom's model just uses a training requirements variable, it doesn't appeal to all the fancy bio anchors stuff that your method is a viable alternative to. Insofar as you are still using what I consider the "useful core" of the bio anchors model then I think your bottom line numbers for compute requirements can just be straightforwardly plugged into Tom's model.

So I just went and plugged in the values of 10^30 and 10^35 FLOP for the training requirements variable at takeoffspeeds.com. Playing around with it a bit (modifying their preset scenarios) it looks like this gives you somewhere between 2029 and 2044.

Though if you also do what I recommend and increase software returns from 1.25 to 2.5, to be more consistent with the data we have so far about algorithmic progress, the top end of the range cuts down considerably: Now you get 2027 - 2033 as the range.

 

I'm confused about how heterogeneity in data quality interacts with scaling. Surely training a LM on scientific papers would give different results from training it on web spam, but data quality is not an input to the scaling law... This makes me wonder whether your proposed forecasting method might have some kind of blind spot in this regard, for example failing to take into account that AI labs have probably already fed all the scientific papers they can into their training processes. If future LMs train on additional data that have little to do with science, could that keep reducing overall cross-entropy loss (as scientific papers become a smaller fraction of the overall corpus) but fail to increase scientific ability?