While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.
Contains detailed descriptions of multiple LM agent evals across four categories. Also explores new methodologies for estimating evals success probabilities.
We think it is the best “all around” evals paper, i.e. giving the best understanding of what frontier LM agent evals look like
We tested the calibration of their new methodologies in practice in Hojmark et al., 2024, and found that they are not well-calibrated (disclosure: Apollo involvement).
They find that it is possible to find a low-rank decomposition of models’ capabilities from observed benchmark performances. These can be used to predict the performance of bigger models in the same family.
Marius: I think this is the most exciting “science of evals” paper to date. It made me more optimistic about predicting the performance of future models on individual tasks.
Describes the training procedure of the Llama 3.1 family in detail
We think this is the most detailed description of how state-of-the-art LLMs are trained to date, and it provides a lot of context that is helpful background knowledge for any kind of evals work.
Shows how to use LLMs to automatically create large evals datasets. Creates 154 benchmarks on different topics. We think this idea has been highly influential and thus highlight the paper.
The original paper used Claude-0.5 to generate the datasets, meaning the resulting data is not very high quality. Also, the methodology section of the paper is much more confusingly written than it needs to be.
For an improved methodology and pipeline for model-written evals, see Dev et al., 2024 or ARENA chapter 3.2 (disclosure: Apollo involvement).
Introduces LM agent evals for model autonomy. It’s the first paper that rigorously evaluated LM agents for risks related to loss of control, thus worth highlighting.
We recommend reading the Appendix as a starting point for understanding agent-based evaluations.
Design an automated emulation environment based on LLMs to run agent evaluations in a sandboxed environment.
Opinion: We think it is important to understand how LM agents are being built. However, we recommend that most evaluators (especially individuals) should not spend a lot of time iterating on different scaffolding and instead use whatever the public state-of-the-art is at that time (e.g. AIDER). Otherwise, it can turn into a large time sink, and frontier AI companies likely have better internal agents anyway.
Open source evals library designed and maintained by UK AISI and spearheaded by JJ Allaire, who intends to develop and support the framework for many years.
Supports a wide variety of types of evals, including MC benchmarks and LM agent settings.
Argues that behavioral-only evaluations might have a hard time catching deceptively aligned systems. Thus, we need understanding-based evals that e.g. involve white-box tools.
Marius: This aligns very closely with Apollo’s agenda, so obviously we love that post
Argues that a lot of safety benchmarks are correlated with capabilities. Therefore, progress on these benchmarks cannot be just assigned to improvements in safety techniques.
Marius: I think the idea is great, though I would expect many of the authors of the safety benchmarks selected in the paper to agree that their benchmarks are entangled with capabilities. I think the assumption that any safety benchmark cannot be related to capabilities is false since some of our worries come from increased capabilities. Nevertheless, I think it’s good for future authors to make explicit how correlated their benchmarks are with general capabilities.
Marius: I like the direction, but I think Ethan’s approach undervalues theoretical insight and the value of “thinking for a day before running an experiment,” e.g. to realize which experiments you don’t even need to run.
Large collection of qualitative and quantitative experiments with GPT-4. It's not super rigorous and emphasizes breadth over depth. Good to get some intuitions on how to do some basic tests to investigate model reasoning.
Argues that model distillation is less successful than many people think.
Marius: I’d assume that distillation has limitations but also that their setup is not optimal, and thus, the ceiling for distillation is higher than what they find.
Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that Anthropic is committed to uphold.
Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that OpenAI is committed to uphold.
Specifies if-then relationships where specific events, e.g. evals passing, trigger concrete responses, e.g. enhanced cybersecurity, that GDM is committed to uphold.
Discusses where and how AI agents are likely to be used. Then introduces various ideas for how society can keep track of what these agents are doing and how.
Defines a framework for thinking about AI regulation backchaining from risks through the entire development pipeline to identify causal drivers and suggest potential mitigation strategies.
Discusses the limitations of black-box auditing and proposes grey and white box evaluations as improvements.
Disclosure: Apollo involvement
Contributions
The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke.
While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.
Our favorite papers
Other evals-related publications
LM agents
Core:
Other:
Benchmarks
Core:
Other:
Science of evals
Core:
Other:
Software
Core:
Other:
Miscellaneous
Core:
Other:
Related papers from other fields
Red teaming
Core:
Other:
Scalable oversight
Core:
Other:
Scaling laws & emergent behaviors
Core:
Other:
Science tutorials
Core:
Other:
LLM capabilities
Core:
Other:
LLM steering
RLHF
Core:
Other:
Supervised Finetuning/Training & Prompting
Core:
Other:
Fairness, bias, and accountability
AI Governance
Core:
Other:
Contributions
The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke.