TLDR: We reviewed METR’s “Example evaluation protocol” and found a couple of points where there is room for improvement or unclear information. We also make a couple of suggestions around scoring, outsourcing, etc.

This review was done by 2 people without previous knowledge of METR in ~1.5 days at the end of ML4Good UK in April, followed by limited (yet procrastinated) editing before posting this today.
We find overall the work from METR very well documented and our opinions are our own.

Background

METR announced a few months ago a list of resources for evaluating potentially dangerous autonomous capabilities of frontier models. This include their:

Task suite: a public set of example tasks that involve some kinds of general autonomous capabilities.
Guidelines for capability elicitation: for evaluators can follow to reduce the risk of dramatically underestimating model capabilities.
Task standard and “workbench”: A specification for how to define tasks in code, a very basic “workbench” to allow running an agent on a task, and some simple example agents.
Example evaluation protocol: An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods. Includes:
- An explanation of what the evaluations do and do not cover
- Guidelines for how to run your model on the tasks
- An example of how to calculate an overall score
- Guidance on getting a meaningful estimate of capabilities given the potential for elicitation gaps
- A list of checks to run to ensure that the evaluation results are meaningful

Here we review the last one only.

Tasks Scoring

The example protocol uses a set of tasks to accomplish. The tasks are arranged in levels of complexity and time they would take a human to complete.

METR proposes a way to aggregate a model's score on (mutiple) runs of each task in all levels, into an overall capability score.

Score aggregation doesn't weigh complex tasks more

The proposed method for aggregating the success rate of each level into a global score gives the same weight to the model managing short, simple tasks and long, complex tasks. The global score is expressed as a total level $L$ or a time horizon $H$ in hours.

$L = \sum_{l e v e l s} s u c c e s s_r a t e_{l e v e l}$

$H = \frac{4^{L}}{60}$

The specific aggregation function over task levels is meant to have limited impact as the protocol aims for models to always have roughly the same distributio...

METR (org)