People talk about model weight security being really hard and crucial around the advent of AGI. (E.g. RAND reportLeopold; see here for some distinctions in these threat models that I think are important.) But I think that the thinking on this has not been sufficiently attentive to the fact that during that crucial time period, by assumption we’ll have access to powerful AIs. I think that such access might make security wildly easier, by a bunch of different mechanisms, some of which I’ll describe in this post.

The story I’m telling here is pretty different from the main story I've heard people talk about in the past about AIs helping with computer security, which is that AIs can help with hardening software and other infrastructure. Though I agree that that seems really useful and important, here I'm not talking about AIs automating the process of security research, I'm talking about AI making it affordable to adopt policies that improve security in the organization but that wouldn’t work (because they’d be too labor-intensive or lead to who-guards-the-guards problems) if you just had humans.

In this post I’m going to be ignoring problems related to AIs themselves trying to cause security failures. There are two ways that such AIs affect this situation:

  • The techniques I describe in this post could also be used, AI-control style, to use trusted models to reduce the risk of security failures caused by untrusted models. You can basically just replace “humans” by “untrusted models” everywhere in the text below.
  • When applying these techniques to reduce risk from humans, you have to worry that these techniques will be sabotaged by scheming models. I’ll ignore that possibility in this post, and talk about ways that you can improve security using AIs that are assumed not to be intentionally trying to subvert these security measures (which we call “trusted”). If you’re worried about your models scheming, you’ll have to do some other stuff too (mostly various sting operations), but I think the arguments in this post still mostly hold up. I’ll talk about this more in future.

I’m going to focus on insider threats here; most threats from outside humans involve first gaining access to impersonation of some insider.

Much of this post came out of thinking about the question “how could AI control techniques be applied to humans, instead of AIs”?

I don’t think these arguments are an airtight case for “SL5 security will be trivial”. But they do seem like a plausible hypothesis under which infosec will end up being quite simple to resolve, if labs decide it’s an important priority.

Four strategies for using powerful AI to improve security

Monitoring

As your humans take actions, you configure your system to have an AI that looks at these actions and then flags them for further review if they seem suspicious. If the actions seem potentially catastrophic, then the AI might even flag them before they can run.

Concretely, I'm imagining that instead of being allowed to directly run any command you write in the bash console of the computer you're working on, whenever you write a command, it and the surrounding context are shown to an AI, which is charged with deciding how suspicious the action is. If it is very suspicious, the AI blocks the command and sends an alert to someone else at the company who is supposed to look it over. If the action is just somewhat suspicious, then it flags it for further review but does not block the action or alert the user.

This is somewhat similar to using multi-party authorization. But with abundant AI labor, we could be more comprehensive, with fewer false positives and fewer false negatives, and evaluating a much greater proportion of inputs (i.e. all of them). 

Trust displacement

The most extreme version of trust displacement is: suppose the AI is able to do a task that, for humans to do it, requires those humans to have affordances that they could use to cause security failures. You can just have the AI do the task instead, remove the affordances from the humans, and when the humans need the task done they ask the AI to do it.

For example, you might make it so that your employees usually are not allowed to write bash commands in the course of their system administration. Instead, they must make requests to the system administrator AI.

The basic advantage of trust displacement over monitoring is that it might be possible for humans to carefully craft commands which subtly cause problems in ways that the monitoring AI can't notice. One defense against this is restricting the ability of our humans to write carefully crafted bash commands by demanding that they just produce English descriptions. The hope here is that it's hard for an attacker to persuade the model to write something which has nefarious, unanticipated consequences.

Fine-grained permission management

A core constraint in computer security is that it's annoying and dampens productivity for your employees to have to work with extremely fine-grained security policies. Within this constraint, it’s a best practice to only let users have access to as restricted a set of affordances as possible. They should be allowed to do everything required to do their job (well), but they shouldn't have the capability to do anything else. This is a crucially important computer security principle.

One big bottleneck on this is just that it's slow for humans to set up. For example, SELinux is a set of security tools that allow much more fine-grained permission management on linux. It’s almost never used by people who aren’t in the military, because it’s extremely annoying to configure; people in industry tend to (if they use something like this at all) use AppArmor instead, which is generally understood to have a worse security model but is much more approachable.

But AI seems potentially strongly comparatively advantaged at this kind of fine-grained permissioning. Humans don't want to write 100 lines of fine-grained permissions required to run a script, then try running the script, and go through the debugging process of understanding what permissions it missed and why. But this seems like a task that AIs could be very fast at.

AI investigation of automatically detected suspicious activity

A classic strategy in computer security is to detect when certain classes of suspicious things happen, and then look into them. One big bottleneck is how much time you have to look into why something happened, and to understand the relevant code. Cheap access to powerful AI will drastically widen that bottleneck.

How vulnerable to jailbreaks or trickery is this?

Now I want to talk about how robust these techniques are to models being jailbroken or scheming. Jailbreaks are a crucial problem. 

Ideally, we will get good at making jailbreak-free models. This might require a more specific notion of what it means to be jailbreak-free than we've previously gotten (see here). I think this is an important topic for future research. 

Even if we are unable to make generically jailbreak-free models, there's plausibly a wide variety of strategies you can apply. It wouldn't surprise me if some classifier for whether the model is currently jailbroken would actually work fine here. There's a lot of stuff to try. I'm excited for further research on this.

Even if you are unable to produce a model that your red team cannot jailbreak, you might still be okay because your internal enemies might have a lot of trouble iterating against your security measures. The situation is high stakes for them. Even if they would have succeeded on their hundredth attempted attack, if you can catch (and fire) them during their first attempt, you'll be fine. The spies can go home after work (where you’re probably not constantly surveilling them) and look at publicly accessible models for inspiration when constructing attempted jailbreaks (or their state collaborators can do this research for them). The situation here is harder for schemers than spies, because the schemers are probably monitored at all times, so we might be able to prevent them from ever learning about the state of the art in model jailbreaking in general.

Another thing to note is that not all of these techniques rely on robustness to jailbreaks. For example, with monitoring, it's plausible that in order for a jailbroken model to be dangerous, it needs to have access to some carefully chosen text produced by the attacker. It's plausible that for some of these roles, we are able to redact text that could plausibly contain a jailbreak while still having the technique work. For instance, when investigating irregularities in network accesses (that is, using the “AI investigation of automatically detected suspicious activity” strategy listed above), you don't necessarily need to show the investigating model any text that was freely chosen by the suspicious process.

These techniques seem really powerful

Do I think that this set of techniques radically reduces computer security risk overall? I’m very unsure. My inside view is that if you have access to powerful AI with a very good understanding of computer networks and what's happening inside your AI lab, it should be possible to drastically reduce the risk of insiders exfiltrating substantial amounts of data. It's pretty plausible that the kind of interventions we're discussing here will be fairly easy to implement, and would drastically reduce the risk of rogue deployments.

Labs currently aren’t planning to implement these types of security measures, but they might be able to do so quickly once powerful AI is available, and these interventions seem like the kind of things targeted pressure might be able to achieve.

Another question is whether we actually have the software required to implement this set of interventions (particularly relevant for new companies). My understanding is that there's a reasonable amount of open-source software available for fine-grained permissioning, like SELinux and other tools. If you were at a company like Google or Amazon that had already invested a lot in this kind of thing, you'd probably be totally fine.

One obvious question, as someone who loves analyzing safety problems through near-term perspectives whenever possible, is what if the models we currently have access to are the most trusted models we'll ever have? Would these kinds of security methods work, or are these models not powerful enough? I think this question would be a great topic for research (and might turn into a lucrative startup).

New Comment
3 comments, sorted by Click to highlight new comments since:

I think this is an application of a more general, very powerful principle of mechanism design: when cognitive labor is abundant, near omni-present surveillance becomes feasible. 

For domestic life, this is terrifying. 

But for some high stakes, arms race-style scenarios, it might have applications. 

Beyond what you metioned, I'm particularly interested in this being a game-changer for bilateral negotiation. Two parties make an agreement, consent to being monitored by an AI auditor, and verify that the auditor's design will communicate with the other party if and only if there has been a rule breach. (Beyond the rule breach, it won't be able to leak any other information. And, being an AI, it can be designed to have its memory erased, never recruited as a spy, etc.) However, one big challenge of building this is how two adversarial parties could ever gain enough confidence to allow such a hardware/software package into a secure facility, especially if it's whole point is to have a communication channel to their adversary. 

However, one big challenge of building this is how two adversarial parties could ever gain enough confidence to allow such a hardware/software package into a secure facility, especially if it's whole point is to have a communication channel to their adversary.

Isn't it enough to constrain the output format to so that no steganographic leaks would be possible? Wont the counterparty usually be satisfied just with an hourly signal saying either "Something is wrong" (encompassing "Auditor saw a violation" / "no signal, the host has censored the auditor's report" / "invalid signal, the host has tampered with the auditor system" / "auditor has been blinded to the host's operations, or has ascertained that there are operations which the auditor cannot see") or "Auditor confirms that all systems are nominal and without violation."?

The host can remain in control of their facilities, as long as the auditor is running on tamperproof hardware. It's difficult to prove that a physical device can't be tampered with, it may be possible to take some components of the auditor even further and run them in a zero knowledge virtual machine, which provides a cryptographic guarantee that the program wasn't tampered with, so long as you can make it lithe enough to fit (zero knowledge virtual machines currently run at a 10,000x slowdown, though I don't think specialized hardware for them is available yet, crypto may drive that work), though a ZKVM wont provide a guarantee that the inputs to the system aren't being controlled, the auditor is monitoring inputs of such complexity — either footage of the real world or logs of a large training run — that it may be able to prove algorithmically to itself that the sensory inputs weren't tampered with either and the algorithm does have a view into the real world (I'm contending that even large state actors could not create Descarte's evil demon).

Good to see people thinking about how AI can make security better. It's usually the other way round, so I expect there is unexplored benefit there.