Forecasting future gains due to post-training enhancements

elifland; Joel Becker; simeon_c

This is a linkpost for https://docs.google.com/document/d/19VL31R_0oZUGe7aE5OD5njX2l46fXgpgL6HvTrAJVZI/edit

This work has been done in the context of SaferAI’s work on risk assessment. Equal contribution by Eli and Joel. I'm sharing this writeup in the form of a Google Doc and reproducing the summary below.

Disclaimer: this writeup is context for upcoming experiments, not complete work. As such it contains a lot of (not always well-justified) guess-work and untidy conceptual choices. We are publishing now despite this to get feedback.

If you are interested in this work — perhaps as a future collaborator or funder, or because this work could provide helpful input into e.g. risk assessments or RSPs — please get in touch with us at joel@qallys.com and/or simeon@safer-ai.org.

Summary

A recent report documented how the performance of AI models can be improved after training, via post-training enhancements (PTEs) such as external tools, scaffolding, and fine-tuning. The gain from a PTE is measured in compute-equivalent gains (CEG): the multiplier on training compute required to achieve equivalent performance to a model combined with a PTE.
We are interested in understanding the contribution that PTEs make to AI system capabilities over time.
1. This question in turn is motivated by SaferAI’s work on quantitative risk assessments of frontier models. In particular, any risk assessment of open-sourcing models or of having closed-source models stolen or leaked should take into account PTEs. A system’s capabilities will increase over time as PTEs are added to the system built on top of a given base model.
We extend a recent analysis of PTEs in order to understand the trend in CEG over time, arriving at very rough estimates for the rate of improvement of PTEs. Our primary takeaways are that current data is insufficient and experiments are needed to better forecast the effects of PTEs, as described below.
There are serious limitations in our preliminary analysis, including: problems with the CEG metric, many uninformed parameter estimates, and reliance on an ill-defined “average task”.
High-priority future work includes running experiments to get more evidence on important uncertainties for our forecasts of capability gains due to PTEs. In particular, we think it will be important to understand how well different PTEs combine, as well as to directly study performance on benchmarks relevant to dangerous capabilities rather than relying on the CEG and average task abstractions.

In this write-up, we will:

Outline our methodology. (More.)
Present CEG estimates for various PTEs. (More.)
Aggregate total CEG, using subjective estimates of ‘composability.’ (More.)
Note limitations of our analysis and important future work. (More.)