Third-party testing as a key ingredient of AI policy

Zac Hatfield-Dodds

This is a linkpost for https://www.anthropic.com/news/third-party-testing

(nb: this post is written for anyone interested, not specifically aimed at this forum)

We believe that the AI sector needs effective third-party testing for frontier AI systems. Developing a testing regime and associated policy interventions based on the insights of industry, government, and academia is the best way to avoid societal harm—whether deliberate or accidental—from AI systems.

Our deployment of large-scale, generative AI systems like Claude has shown us that work is needed to set up the policy environment to respond to the capabilities of today’s most powerful AI models, as well as those likely to be built in the future. In this post, we discuss what third-party testing looks like, why it’s needed, and describe some of the research we’ve done to arrive at this policy position. We also discuss how ideas around testing relate to other topics on AI policy, such as openly accessible models and issues of regulatory capture.

Policy overview

Today’s frontier AI systems demand a third-party oversight and testing regime to validate their safety. In particular, we need this oversight for understanding and analyzing model behavior relating to issues like election integrity, harmful discrimination, and the potential for national security misuse. We also expect more powerful systems in the future will demand deeper oversight - as discussed in our ‘Core views on AI safety’ post, we think there’s a chance that today’s approaches to AI development could yield systems of immense capability, and we expect that increasingly powerful systems will need more expansive testing procedures. A robust, third-party testing regime seems like a good way to complement sector-specific regulation as well as develop the muscle for policy approaches that are more general as well.

Developing a third-party testing regime for the AI systems of today seems to give us one of the best tools to manage the challenges of AI today, while also providing infrastructure we can use for the systems of the future. We expect that ultimately some form of third-party testing will be a legal requirement for widely deploying AI models, but designing this regime and figuring out exactly what standards AI systems should be assessed against is something we’ll need to iterate on in the coming years - it’s not obvious what would be appropriate or effective today, and the way to learn that is to prototype such a regime and generate evidence about it.

An effective third-party testing regime will:

Give people and institutions more trust in AI systems
Be precisely scoped, such that passing its tests is not so great a burden that small companies are disadvantaged by them
Be applied only to a narrow set of the most computationally-intensive, large-scale systems; if implemented correctly, the vast majority of AI systems would not be within the scope of such a testing regime
Provide a means for countries and groups of countries to coordinate with one another via developing shared standards and experimenting with Mutual Recognition agreements

Such a regime will have the following key ingredients ^[1]:

Effective and broadly-trusted tests for measuring the behavior and potential misuses of a given AI system
Trusted and legitimate third-parties who can administer these tests and audit company testing procedures

Why we need an effective testing regime

This regime is necessary because frontier AI systems—specifically, large-scale generative models that consume substantial computational resources—don’t neatly fit into the use-case and sector-specific frameworks of today. These systems are designed to be ‘everything machines’ - Gemini, ChatGPT, and Claude can all be adapted to a vast number of downstream use-cases, and the behavior of the downstream systems always inherits some of the capabilities and weaknesses of the frontier system it relies on.

These systems are extremely capable and useful, but they also present risks for serious misuse or AI-caused accidents. We want to help come up with a system that greatly reduces the chance of major misuses or accidents caused by AI technology, while still allowing for the wide deployment of its beneficial aspects. In addition to obviously wanting to prevent major accidents or misuse for its own sake, major incidents are likely to lead to extreme, knee-jerk regulatory actions, leading to a 'worst of both worlds' where regulation is both stifling and ineffective. We believe it is better for multiple reasons to proactively design effective and carefully thought through regulation.

Systems also have the potential to display emergent, autonomous behaviors which could lead to serious accidents - for instance, systems might insert vulnerabilities into code that they are asked to produce or, when asked to carry out a complex task with many steps, carry some actions which contradict human intentions. Though these kinds of behaviors are inherently hard to measure, it’s worth developing tools to measure for them today as insurance against these manifesting in widely deployed systems.

At Anthropic, we’ve implemented self-governance systems that we believe should meaningfully reduce the risk of misuse or accidents from the technologies we’ve developed. Our main approach is our Responsible Scaling Policy (RSP), which commits us to testing our frontier systems, like Claude, for misuses and accident risks, and to deploy only models that pass our safety tests. Multiple other AI developers have subsequently adopted or are adopting frameworks that bear a significant resemblance to Anthropic's RSP.

However, although Anthropic is investing in our RSP (and other organizations are doing the same), we believe that this type of testing is insufficient as it relies on self-governance decisions made by single, private sector actors. Ultimately, testing will need to be done in a way which is broadly trusted, and it will need to be applied to everyone developing frontier systems. This type of industry-wide testing approach isn’t unusual - most important sectors of the economy are regulated via product safety standards and testing regimes, including food, medicine, automobiles, and aerospace.

What would a robust testing regime look like?

A robust third-party testing regime can help identify and prevent the potential risks of AI systems. It will require:

A shared understanding across industry, government, and academia of what an AI safety testing framework looks like - what it should and shouldn’t include
An initial period where companies complete practice runs of implementing such testing, sometimes with third-party oversight, to make sure the tests work, are feasible to run, and can be validated by a third party
A two-stage testing regime: there should be a very fast, automated testing stage that companies apply to their systems. This stage should cover a wide area and be biased towards avoiding false negatives. If this stage spots potential problems, there should be a more thorough secondary test, likely using expert human-led elicitation
Increased resources to the parts of government that will oversee and validate tests - building and analyzing tests is detailed, expensive, technical work, so governments will need to find a way to fund the entities that do this
A carefully scoped set of mandated tests - we’ll need specific, legally mandated tests where it becomes clear there are poor incentives for industry self-governance, and the benefits of public safety from government oversight outweigh the regulatory burdens. We should ensure this is a well scoped, small set of tests, or else we’ll create regulatory burdens and increase the possibility of regulatory capture
An effective balance of the assurance of safety with ease of administration of these tests

When it comes to tests, we can already identify one area today where testing by third-parties seems helpful and draws on the natural strengths of governments: national security risks. We should identify a set of AI capabilities that, if misused, could compromise national security, then test our systems for these capabilities. Such capabilities might include the ability to meaningfully speed up the creation of bioweapons or to carry out complex cyberattacks. (If systems are capable of this, then that would lead to us changing how we deployed the model - e.g, remove certain capabilities from broadly deployed models and/or gate certain model capabilities behind ‘know your customer’ regimes, and ensuring relevant government agencies were aware we had systems with these capabilities.) We expect there are several areas where society will ultimately demand there be legitimate, third-party testing approaches, and national security is just one of them.

When it comes to the third party doing the testing, there will be a multitude of them and the tests will be carried out for different reasons, which we outline here:

Private companies: Companies may subcontract other companies to build tests and evaluations for their systems, as we have done with firms like Gryphon Scientific. We can also imagine companies doing tests for other companies where the tests are mandated by law but not carried out by government agencies, similar to how accounting firms audit the books of private companies.
Universities: Today, many researchers at many academic institutions have free or subsidized access to models developed by AI labs; in the future, we could imagine some of these research institutions administering their own testing initiatives, some of which may be supervised or elevated via government bodies.
Governments: Some tests (we suspect, a relatively small number) may be mandated by law and carried out by government actors - for instance, for testing for national security misuses of AI systems. Here, government agencies may carry out the tests directly.

Ultimately, we expect that third-party testing will be accomplished by a diverse ecosystem of different organizations, similar to how product safety is achieved in other parts of the economy today. Because broadly commercialized, general purpose AI is a relatively new technology, we don’t think the structure of this ecosystem is clear today and it will become clearer through all the actors above running different testing experiments. We need to start working on this testing regime today, because it will take a long time to build.

We believe that we - and other participants in AI development - will need to run multiple testing experiments to get this right. The stakes are high: if we land on an approach that doesn’t accurately measure safety but is easy to administer, we risk not doing anything substantive or helpful. If we land on an approach that accurately measures safety but is hard to administer, we risk creating a testing ecosystem that favors companies with greater resources and thus reduces the ability for smaller actors to participate.

How Anthropic will support fair, effective testing regimes

In the future, Anthropic will carry out the following activities to support governments in the development of effective third-party testing regimes:

Prototyping a testing regime via implementing our RSP and sharing what we learn
Testing third-party assessment of our systems via contractors and government partners
Deepening our frontier red teaming work to give us and the broader sector a clearer sense of the risks of AI systems and their mitigations
Advocating for governments to fund the agencies and organizations that could help to develop an effective third-party testing regime (e.g, in the United States, NIST, the US AI Safety Institute, the National AI Research Resource, the usage of DoE supercomputers for AI testing, and so on)
Encouraging governments to build their own ‘National Research Clouds’ (like the National AI Research Resource in the US) so that they can a) develop independent capacity in academia and government to build, study, and test frontier AI systems, and b) work on the science of evaluating AI systems, including those developed by private companies like Anthropic

Developing a testing regime and associated policy interventions based on the insights of industry, government, and academia is the best way to avoid societal harm—whether deliberate or accidental—from AI systems.

How testing connects to our broader policy priorities

Our overarching policy goal is to have appropriate oversight of the AI sector. We believe this will mostly be achieved via there being an effective ecosystem for third-party testing and evaluation of AI systems. Here are some AI policy ideas you can expect to see us advocating for in support of that:

Greater funding for AI testing and evaluation in government

Effective testing and evaluation procedures are a necessary prerequisite of any effective form of AI policy. We think that governments should stand up and support institutions that develop AI evaluations, as well as bringing together industry, academia, and other stakeholders to agree on standards for the safety of AI systems. In the US, we specifically advocate for greater funding for NIST.

Support greater evaluation of AI systems through public sector infrastructure for doing AI research

We urgently need to increase the number and breadth of people working to test and evaluate AI systems, for both current and future risks. It’s therefore crucial that governments create experimental infrastructure to help academic researchers test out and evaluate frontier AI systems, and develop their own frontier systems for beneficial purposes. For more information, see our support for a US national research cloud via the CREATE AI Act, and our written Senate testimony.

Developing tests for specific, national security-relevant capabilities

We should know if AI systems can be used in ways that strengthen or (if fielded by another entity) weaken national security. Whereas the private sector and academia can develop the vast majority of tests, some testing and evaluation questions relate to national security capabilities which are classified, so only certain governments are able to effectively evaluate them. Therefore, we want to support US Government efforts to develop ways of testing AI systems for national security-relevant capabilities. We will also continue our own work to better understand the capabilities of our own systems.

Scenario planning and test development for increasingly advanced systems

Our Responsible Scaling Policy is designed to frontload work about evaluating and testing future, hypothetical capabilities of AI systems. This is to ensure we have the relevant tests to better assess and minimize accident and misuse risks from increasingly powerful AI systems. But we don’t claim that our RSP delineates all the tests that need to be run on increasingly powerful models. As AI advances driven by growing computational power increase, a broader set of actors should work to anticipate the future capabilities of AI systems, and develop tests for them.

Aspects of AI policy we believe are important to discuss

While developing our policy approach, we’ve also found ourselves returning again and again to a few specific issues such as openly accessible models and regulatory capture. We’ve outlined our current policy thinking below but recognize these are complicated issues where people often disagree.

Openly-disseminated and/or open-source models: Science moves forward largely due to a culture of openness and transparency around research. This is especially true in AI, where much of the currently-unfolding revolution is built on the open publication of research and models like the Transformer, BERT, Vision Transformers, and so on. There is also a long history of open source and openly accessible systems increasing the robustness of the security environment by helping a greater number of people experiment with technologies and identify their potential weaknesses.

We believe that the vast majority of AI systems today (perhaps even all of them) are safe to openly disseminate and will be safe to broadly disseminate in the future. However, we believe in the future it may be hard to reconcile a culture of full open dissemination of frontier AI systems with a culture of societal safety.

If — and ‘if’ is a key and unresolved point — increasingly capable AI models can lead to detrimental effects, or hold the possibility of catastrophic accidents, then we’ll need to adjust the norms of what is openly disseminated at the frontier.

Specifically, we’ll need to ensure that AI developers release their systems in a way that provides strong guarantees for safety - for example, if we were to discover a meaningful misuse in our model, we might put in place classifiers to detect and block attempts to elicit that misuse, or we might gate the ability to finetune a system behind a ‘know your customer’ rule along with contractual obligations to not finetune towards a specific misuse. By comparison, if someone wanted to openly release the weights of a model which was capable of the same misuse, they would need to both harden the model against that misuse (e.g, via RLHF or RLHAIF training) and find a way to make this model resilient to attempts to fine-tune it onto a dataset that would enable this misuse. We will also need to experiment with disclosure processes, similar to how the security community has developed norms around pre-notification of disclosures of zero days.

Though what we’ve described is inherently very costly we also believe it is necessary - we must do everything we can to avoid AI systems enabling significant misuses or causing major accidents. But carrying out any restrictions on the open dissemination of AI systems depends on there being broad agreement on what unacceptable misuses of AI systems or system behaviors are.

Anthropic is not an impartial actor here - we are a company that primarily develops proprietary systems, and we don’t have the legitimacy to make claims here about what should or shouldn’t be acceptable in openly disseminated systems. Therefore, to resolve questions of open source models we need legitimate third parties to develop testing and evaluation approaches that are broadly accepted as legitimate, we need these third parties (or other trusted entities) to define a narrow and serious set of misuses of AI systems as well as adverse AI system behaviors, and we will need to apply these tests to models that are both controlled (e.g., via API) or openly disseminated (e.g., via the weights being released).

Third party testing of openly disseminated and closed proprietary models can generate the essential information we need to understand the safety properties of the AI landscape ^[2]. If we don’t do this, then you could end up in a situation where either a proprietary model or openly accessible model directly enables a serious misuse or causes a major AI accident - and if that happens, there could be significant harm to people and also likely adverse regulations applied to the AI sector.

Regulatory capture: Any form of policy can suffer regulatory capture by a sufficiently motivated and well-resourced actor: for example, a well-capitalized AI company. Some of the ideas we discuss above about openly accessible models are the kinds of things which themselves are prone to regulatory capture. It’s important that the AI ecosystem remains robust and competitive - AI is a complicated field and humanity’s best chance at getting it right likely comes from there being a diverse, broad set of actors engaged in its development and oversight.

We generally advocate for third-party testing and measurement initiatives because they seem like the kind of policy infrastructure that helps us to identify and prevent concrete harms as well as building capacity that exists independently of large companies. Therefore, we think that focusing on the development of third-party testing capacity can reduce the risk of regulatory capture and create a level playing field for developers. Conversely, industry-led consortia might have a tendency to favor approaches that involve high compliance costs on the parts of companies regardless of their scale - an approach that inherently advantages larger businesses which can spend more money on policy compliance.

Why we’re being careful in what we advocate for in AI policy

When developing our policy positions, we assume that regulations tend to create an administrative burden both for the party that enforces the regulation (e.g, the government), and for the party targeted by the regulation (e.g, AI developers). Therefore, we should advocate for policies that are both practical to enforce and feasible to comply with. We also note that regulations tend to be accretive - once passed, regulations are hard to remove. Therefore, we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.

Why AI policy is important

The AI systems of today and those of the future are immensely powerful and are capable of yielding great benefits to society. We also believe these systems have the potential for non-trivial misuses, or could cause accidents if implemented poorly. Though the vast majority of our work is technical in nature, we’ve come to believe that testing is fundamental to the safety of our systems - it’s not only how we better understand the capabilities and safety properties of our own models, but also how third-parties can validate claims we make about AI systems.

We believe that building out a third-party testing ecosystem is one of the best ways for bringing more of society into the development and oversight of AI systems. We hope that by publishing this post we’ve been able to better articulate the benefits of third-party testing as well as outline our own position for others to critique and build upon.

Some countries may also experiment with ‘regulatory markets’ where AI developers can buy and sell AI testing services and compete with one another to try to build and deploy successively safer, more useful systems. ↩︎
For example, if you openly release an AI model, it’s relatively easy for a third-party to fine-tune that model on a dataset of their own choosing. Such a dataset could be designed to optimize for a misuse (e.g, phishing or offensive hacking). If you were able to develop technology that made it very hard to fine-tune an AI model away from its original capability distribution, then it’d be easier to confidently release models without potentially compromising on downstream safety. ↩︎

[-]Stephen Casper1mo10

Thanks for the useful post. There are a lot of things to like about this. Here are a few questions/comments.

First, I appreciate the transparency around this point.

we advocate for what we see as the ‘minimal viable policy’ for creating a good AI ecosystem, and we will be open to feedback.

Second, I have a small question. Why the use of the word "testing" instead of "auditing"? In my experience, most of the conversations lately about this kind of thing revolve around the term "auditing."

Third, I wanted to note that this post does not talk about access levels for testers and ask why this is the case. My thoughts here relate to this recent work. I would have been excited to see (1) a commitment to providing auditors with methodological details, documentation, and data or (2) a commitment to working on APIs and SREs that can allow for auditors to be securely given grey- and white-box access.

AI ALIGNMENT FORUM
AF

Third-party testing as a key ingredient of AI policy

9

Policy overview

An effective third-party testing regime will:

Such a regime will have the following key ingredients ^[1]:

Why we need an effective testing regime

What would a robust testing regime look like?

How Anthropic will support fair, effective testing regimes

How testing connects to our broader policy priorities

Greater funding for AI testing and evaluation in government

Support greater evaluation of AI systems through public sector infrastructure for doing AI research

Developing tests for specific, national security-relevant capabilities

Scenario planning and test development for increasingly advanced systems

Aspects of AI policy we believe are important to discuss

Why we’re being careful in what we advocate for in AI policy

Why AI policy is important

9

Third-party testing as a key ingredient of AI policy

9

Policy overview

An effective third-party testing regime will:

Such a regime will have the following key ingredients [1]:

Why we need an effective testing regime

What would a robust testing regime look like?

How Anthropic will support fair, effective testing regimes

How testing connects to our broader policy priorities

Greater funding for AI testing and evaluation in government

Support greater evaluation of AI systems through public sector infrastructure for doing AI research

Developing tests for specific, national security-relevant capabilities

Scenario planning and test development for increasingly advanced systems

Aspects of AI policy we believe are important to discuss

Why we’re being careful in what we advocate for in AI policy

Why AI policy is important

9

Such a regime will have the following key ingredients ^[1]: