This blog sets out our thinking to date on how to design and run third-party evaluations, including key elements to consider and open questions. This is not intended to provide robust recommendations; rather we want to start a conversation in the open about these practices and to learn from others.
We discuss the role of third-party evaluators and what they could target for testing, including which systems to test, when to test them, and which risks and capabilities to test for. We also examine how to evaluate effectively, including which tests to use for which purpose, how to develop robust testing, and how to ensure safety and security.
Most important section, I think:
7. What is needed for effective testing?
We have learned a lot in our approach to evaluations to date, but there are significant challenges and areas of progress to make going forward.
Access
It is clear from our experience that to run high quality evaluations that elicit high fidelity information to the potential risks posed by frontier systems, it is important for independent evaluators to have:
Access to a Helpful Only (HO) version of the model, alongside the Helpful, Honest and Harmless (HHH) version of the model that will be deployed, the ability to turn off/on trust and safety safeguards, and fine-tuning API access. It is essential to elicit the full capabilities of a model as far as possible to evaluate the level of potential risk in a system and the sufficiency of existing mitigations.
Regular technical discussions before and during testing with teams at the given company who have most experience with evaluating the model/system in question, including providing information on model capabilities and elicitation techniques, the safeguards that are put in place/are intended to be put in place for the deployed system, and results from internal evaluations which we can use for calibration.
Testing window
It is important that we have sufficient time to conduct a rigorous evaluation of model capabilities, to provide quality assurance of results, and report back to frontier AI labs in advance of model deployment, in line with the testing tiers outlined in section 4. An option we are considering is to focus the bulk of testing on “proxy” models, which are available earlier and sufficiently similar to the model that will be deployed. How to structure such evaluations and validate results, and how to precisely define similarity are open scientific questions.
Working together
It is important that evaluators can protect the integrity of their evaluations, for example, through non-logging guarantees. We need to work with companies to better understand the appropriate and necessary level of information to share which enables trust in independent evaluation results, and the implementation of effective mitigations, without revealing our full testing methodology.
Additionally, we need to establish a shared framework and standard for risk and capability thresholds (as mentioned above), and what these thresholds entail in terms of mitigations expectations. Relatedly, we are developing our understanding of how to disclose findings to a developer where these might relate to potential national security risks.
Most important section, I think:
Related: Model evals for dangerous capabilities [self-promotion].