Frontier Model Security

Vaniver

As the capabilities of frontier artificial intelligence models continue to increase rapidly, ensuring the security of these systems has become a critical priority. In our previous posts, we’ve focused on Anthropic’s approach to safety, and Claude’s capabilities and applications. In this post, we are sharing some of the steps we are taking to ensure our models are developed securely. We hope to advance public discussion about how all labs can deploy top models securely, as well as share recommendations for government regulatory approaches that encourage adoption of strong cybersecurity practices. Below we discuss some of our recommendations for cybersecurity best practices, which Anthropic itself is in the process of implementing.

Summary

Future advanced AI models have the potential to upend economic and national security affairs within and among nation-states. Given the strategic nature of this technology, frontier AI research and models must be secured to levels far exceeding standard practices for other commercial technologies in order to protect them from theft or misuse.

In the near term, governments and frontier AI labs must be ready to protect advanced models and model weights, and the research that feeds into them. This should include measures such as the development of robust best practices widely diffused among industry, as well as treating the advanced AI sector as something akin to “critical infrastructure” in terms of the level of public-private partnership in securing these models and the companies developing them.

Many of these measures can begin as voluntary arrangements, but in time it may be appropriate to use government procurement or regulatory powers to mandate compliance.

Cybersecurity Best Practices

We believe “two-party control” is necessary to secure advanced AI systems. Two-party control is already used in a range of domains; for example, two people with two keys are needed to open the most secure vaults, and multi-party review patterns have been applied in manufacturing (GMP, ISO 9001), food (FSMA PCQI, ISO 22000), medical (ISO 13485) and finance tech (SOX).

This pattern should be applied to all systems involved in the development, training, hosting, and deployment of frontier AI models.
This pattern is already in widespread use within major tech companies to defend against the most advanced threat actors and mitigate insider risk.
It is manifested as a system design where no person has persistent access to production-critical environments, and they must ask a coworker for time-limited access with a business justification for that request.
Even emerging AI labs, without large enterprise resources, can implement these controls.

We call this multi-party authorization to AI-critical infrastructure design. This is a leading security requirement that depends on the gamut of cybersecurity best practices to implement correctly.

In addition, secure software development practices should pervade the frontier AI model environment. The gold-standard for these practices are the NIST Secure Software Development Framework (SSDF) and the Supply Chain Levels for Software Artifacts (SLSA). Executive Orders have been leveraged successfully to encourage major tech companies to adopt higher development standards: in 2021, EO 14028 directed OMB to set Federal Procurement guidelines. This motivated action: the software industry has invested heavily to meet the SSDF’s requirements to retain federal contracts.

While frontier AI research is already benefiting from the implementation of some of these standards by dint of cloud providers hosting their models, applying these existing standards can step-change the security of these AI systems:

SSDF and SLSA are largely translatable into the development of models and their coupled software; producing a model and then deploying it is almost identical to building software and then deploying it.

SSDF and SLSA coupled together mean that any AI system deployed has a chain of custody. By this, we mean that when applied correctly these practices make it so you are able to tie back a deployed model to the company that developed it, which helps provide provenance.

We call this a secure model development framework. We encourage extending SSDF to encompass model development inside of NIST’s standard-setting process.

In the near term, these two best practices could be established as procurement requirements applying to AI companies and cloud providers contracting with governments – alongside standard cybersecurity practices that also apply to these companies. As U.S. cloud providers provide the infrastructure that many current frontier model companies use, procurement requirements will have an effect similar to broad market regulation and can work in advance of regulatory requirements.

Anthropic is implementing two-party controls, SSDF, SLSA, and other cybersecurity best practices. As model capabilities scale, we will need to further enhance security protections, moving beyond the above recommendations. This will necessarily be an iterative process in consultation with government and industry.

Public-Private Cooperation

Frontier AI research labs should participate in public-private cooperation in the same way as companies in critical infrastructure sectors like financial services. This sector could be designated as a special sub-sector of the existing IT sector, for example. Such a designation would be a vehicle for enhanced cooperation and information sharing between and among industry labs and government agencies, helping all labs better guard against highly resourced malicious cyber actors.

Conclusion

It can be tempting to deprioritize security: when everything is going well, it can feel like it isn’t necessary or that it is in tension with other company goals. But this technology is becoming more powerful and will require enhanced precautions. We also believe that while security can sometimes interfere with productivity, that there are creative ways to ensure that its effects are limited and that research and other work can proceed effectively.

The development of artificial intelligence has incredible potential to benefit humanity, but it also comes with risks if not pursued thoughtfully. As an AI company working on the frontier of this technology, we take our responsibility seriously to build and deploy Claude in a way that is safe, secure and aligned with human values. We will continue sharing our perspectives on the responsible development of AI.

I'm happy to see Anthropic taking this seriously (and a little sad about the paucity of details in the public-private cooperation section, the part that I've thought about the most, but hope there will eventually be more). Suggestions for what else they should think about or do?

Nice. (I'm interested in figuring out what labs should do regarding security and would appreciate information/takes.)

Suggestions for what else they should think about or do?

I've collected quotes from some other sources; here are ~~two~~ three:

Towards best practices in AGI safety and governance (Schuett et al. 2023):

"Security incident response plan. AGI labs should have a plan for how they respond to security incidents (e.g. cyberattacks)."
"Industry sharing of security information. AGI labs should share threat intelligence and information about security incidents with each other." They should also share infosec/cyberdefense innovations and best practices.
"Security standards. AGI labs should comply with information security standards (e.g. ISO/IEC 27001 or NIST Cybersecurity Framework). These standards need to be tailored to an AGI context."
"Background checks. AGI labs should perform rigorous background checks before hiring/appointing members of the board of directors, senior executives, and key employees."
"Model containment. AGI labs should contain models with sufficiently dangerous capabilities (e.g. via boxing or air-gapping)."

"Appropriate security" in "Model evaluation for extreme risks" (DeepMind 2023):

Models at risk of exhibiting dangerous capabilities will require strong and novel security controls. Developers must consider multiple possible threat actors: insiders (e.g. internal staff, contractors), outsiders (e.g. users, nation-state threat actors), and the model itself as a vector of harm. We must develop new security best practices for high-risk AI development and deployment, which could include for example:
Red teaming: Intensive security red-teaming for the entire infrastructure on which the model is developed and deployed.
Monitoring: Intensive, AI-assisted monitoring of the model's behaviour, e.g. for whether the model is engaging in manipulative behaviour or making code recommendations that would lower the overall security of a system.
Isolation: Appropriate isolation techniques for preventing risky models from exploiting the underlying system (e.g. sole-tenant machines and clusters, and other software-based isolation). The model's network access should be tightly controlled and monitored, as well as its access to tools (e.g. code execution).
Rapid response: Processes and systems for rapid response to disable model actions and the model's integrations with hardware, software, and infrastructure in the event of unexpected unsafe behaviour.
System integrity: Formal verification that served models, memory, or infrastructure have not been tampered with. The development and serving infrastructure should require two-party authorization for any changes and auditability of all changes.

From a private doc (sharing with permission):

Many common best practices are probably wise, but could take years to implement and iterate and get practice with: use multi-factor authentication, follow the principle of least privilege, use work-only laptops and smartphones with software-enforced security policies, use a Crowdstrike/similar agent and host-based intrusion detection system, allow workplace access via biometrics or devices rather than easily scanned badges, deliver internal security orientations and training and reminders, implement a standard like ISO/IEC 27001, use penetration testers regularly, do web browsing and coding in separate virtual machines, use keys instead of passwords and rotate them regularly and store them in secure enclaves, use anomaly detection software, etc.
Get staff to not plug devices/cables into their work devices unless they're provided by the security team or ordered from an approved supplier.
For cloud security, use AWS Nitro enclaves or gVizor or Azure confidential, create flags or blocks for large data exfiltration attempts. Separate your compute across multiple accounts. Use Terraform so that infrastructure management is more auditable and traceable. Keep code and weights on trusted cloud compute only, not on endpoints (e.g. by using Github Codespaces).
Help make core packages, dependencies, and toolchains more secure, e.g. by funding the Python Software Foundation to do that. Create a mirror for the software you use and hire someone to review packages and updates (both manually and automatically) before they're added to your mirror.
What about defending against zero-days, i.e. vulnerabilities that haven't been broadly discovered yet and thus haven't been patched?
One idea is that every month or week, you could pick an employee at random to give a new preconfigured laptop and hand their previous laptop to a place like Citizen Lab or Mandiant that can tear the old one apart and find zero-days and notify the companies who can patch them. After several zero days are burned this way, maybe attackers with zero-days will decide you're not worth constantly burning their zero-days on.
At some point in the future it'll probably be wise to run the largest training runs on air gapped compute like what Microsoft and Amazon provide to the US intelligence community already, so that you can do lots of testing and so on behind the airgap before deployment.
What about insider threat? A lab could create their own version of a security clearance/personnel reliability process, and/or find a way to get most/all staff with the most privileged access to get a US security clearance (with all that entails). . . . Create a process for vetting cleaners/etc. . . .
Set up stronger physical security measures than a typical tech company, e.g. armed guards and strong physical security around critical devices.

More miscellanea:

Insider risk program
Source code should live in a hardened cloud environment; it should never touch company/developer machines
Security audits & pentests
Talk about security best practices (yay Anthropic for this post) and be transparent where appropriate (yay Anthropic and OpenAI for their security portals)
Incident tracking and sharing/reporting; maybe a breach disclosure policy

Also OWASP Top 10 for Large Language Model Applications has a section on Model theft. (In addition to standard hacking, they include generating model outputs to use as training data to replicate the model.) They recommend:

Implement strong access controls (E.G., RBAC and rule of least privilege) and strong authentication mechanisms to limit unauthorized access to LLM model repositories and training environments.
This is particularly true for the first three common examples, which could cause this vulnerability due to insider threats, misconfiguration, and/or weak security controls about the infrastructure that houses LLM models, weights and architecture in which a malicious actor could infiltrate from inside[] or outside the environment.
Supplier management tracking, verification and dependency vulnerabilities are important focus topics to prevent exploits of supply-chain attacks.
Restrict the LLM's access to network resources, internal services, and APIs.
This is particularly true for all common examples as it covers insider risk and threats, but also ultimately controls what the LLM application "has access to" and thus could be a mechanism or prevention step to prevent side-channel attacks.
Regularly monitor and audit access logs and activities related to LLM model repositories to detect and respond to any suspicious or unauthorized behavior promptly.
Automate MLOps deployment with governance and tracking and approval workflows to tighten access and deployment controls within the infrastructure.
Implement controls and mitigation strategies to mitigate and|or reduce risk of prompt injection techniques causing side-channel attacks.
Rate Limiting of API calls where applicable and|or filters to reduce risk of data exfiltration from the LLM applications, or implement techniques to detect (E.G., DLP) extraction activity from other monitoring systems.
Implement adversarial robustness training to help detect extraction queries and tighten physical security measures.
Implement a watermarking framework into the embedding and detection stages of an [LLM's lifecycle].

Also https://llmsecurity.net has collected lots of papers.

AI ALIGNMENT FORUM
AF

15

Frontier Model Security

15