Conditioning Predictive Models: Deployment strategy

Adam Jermyn; Johannes Treutlein; Rubi J. Hudson; kcwoolverton

This is the sixth of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper.

6. Deployment strategy

Previously, we have been focusing on how to make conditioning predictive models as safe and competitive as possible. Now, we want to take a step back and discuss considerations for using conditioning predictive models to address AI existential risk and what sorts of difficulties we might run into doing so in practice.

In particular, just as AI safety researchers naturally think of using predictive models for advancing AI safety research, AI capabilities researchers might naturally jump to using predictive models for advancing capabilities. It may not even be necessary to generate additional research to build AGI with a powerful predictive model. Simply ignoring the previously-discussed ELK-related difficulties and training a model to take actions that lead to predicted futures that a predicted human approves of may be sufficient. Either way, the existence of powerful predictive models seems likely to rapidly contract AI timelines.

As a result, by the time predictive models can be used to predict a full solution to AI safety, the time available to do so is minimal—and as such, it is important to have fleshed out plans on how to use them safely well ahead of time.

Dealing with other, less careful actors

As we mentioned previously, using a predictive model to generate alignment research is only one possible use case—one that we restricted our attention to on the basis that we thought it contained the difficult aspects of using a predictive model safely. Restricting our attention to these sorts of particular conditionals—and figuring out how to do them safely—is fine if we have control over the ways in which our model will be used. If we don’t have that control, however—e.g. we are in a world where people are using predictive models in all sorts of different ways—then we have to consider what might happen when our predictive model is used in a much less careful way than described here and figure out how to either deal with or prevent that from happening.

We think that getting other actors to use predictive models at all should be quite doable, for standard homogeneity reasons: why would a non-leading actor want to invest a ton of resources training a model in a different way than the way that the leading actor has already demonstrated successfully produces transformative AI? The problem, however, is that this same argument does not apply to what particular conditionals the non-leading actors might try, since trying a particular conditional is likely to be substantially cheaper than training an entire predictive model.

In a multipolar world, one team using very careful conditioning to get a predictive model to generate good alignment research means that other teams will likely soon have equivalently good models and might use them less carefully—e.g. resulting in them accidentally predicting malign superintelligences. Even in a unipolar world, a member of the team that created the predictive model might try to predict their future great-grandchildren out of curiosity, or check the predicted stock prices when they plan to retire, and inadvertently become exposed to manipulative outputs.

Since powerful predictive models can easily be used in less careful ways, any deployment strategy built around them needs to ensure that leading actors, once they gain access to such models, are able to use them to quickly, proactively, and safely do something transformative to tackle AI existential risk—e.g. substantially advance AI safety, substantially improve coordination, etc.

Furthermore, leading actors themselves need to be capable of the internal coordination required to pull off such a transformative task without at any point using such powerful predictive models in less careful, potentially unsafe ways. Such coordination additionally needs to happen at many scales, from simple filters to prevent obviously unsafe inputs from being passed to models all the way to significant internal planning around testing and deployment.

This is a pretty tall list of asks, and in our opinion making this sort of deployment strategy work is one of the largest difficulties with an approach that focuses primarily on predictive models. Nevertheless, we’re optimistic that safe and effective strategies can be developed.

Possible pivotal acts with predictive models

The first, perhaps most obvious thing one could do with a predictive model to substantially decrease overall AI existential risk—i.e. perform a “pivotal act”—is to significantly advance AI safety, e.g. by directly producing AI safety research. Since this is the example we’ve primarily been using throughout this text, we won’t go into much more detail on what that might look like here.

Besides boosting AI safety research directly, however, predictive models can be used in many other ways to reduce AI existential risk as well. Note that this is not meant to be an exhaustive list, but rather just some other possibilities.

First, we could use a predictive model to get an alignment warning shot. While we absolutely cannot trust a prediction that the world will appear great after an AGI is deployed, we probably can trust a prediction that the world will be ruined. For weakly superintelligent AGI, this may be sufficient to catch them and convince their creators not to deploy them.^[1]

Second, we could try to use narrow, short-term predictions to form superintelligent plans. If strong predictive models exist and AGI is imminent, then the predictive models cannot be safely used to evaluate the long-term consequences of taking different actions. However, short-term consequences, on the scale of a few days to a week, could potentially still be safe—though doing so would still be quite risky. Notably, though, convincing other people/engineering human coordination falls in this category.^[2] A predictive model could iterate through many possible arguments and actions to identify which would be most effective at getting other people to slow capability advancement, postpone deployment, change focus to safety, and/or join forces at a single organization to avoid a race.

Third, we could try to use a predictive model for STEM-AI-style tasks, perhaps achieving something like whole brain emulation or nano-scale manufacturing. What one would then do with such a technology to substantially decrease AI existential risk, however, is somewhat unclear—whole brain emulation could potentially speed up alignment research, and nano-scale manufacturing could potentially give a leading actor additional resources and compute to build even more powerful predictive models, but as we’ve discussed we don’t actually know how to align arbitrarily intelligent predictive models, we only know how to align models up to the level of capabilities that some human or group of humans could ever accomplish without AI assistance.

Fourth, predictive models also have many commercial applications, so using them to predict technological developments in fields unrelated to AI could be extremely lucrative. The resulting amassed resources could then be used to invest in AI safety work, buy off or redirect capabilities researchers, etc.

Continuous deployment

Though we just talked about a “pivotal act” as being some sort of a discrete action, we think that this is not a very accurate model of deployment. Rather, we think that more realistic scenarios will likely look far more continuous, where at each stage as we get better and better predictive models we continuously ratchet up our ability to use them for all of the different sorts of goals listed above.^[3]

For example, persuasion is still a possibility for weaker models (though less effective), and predicting consequences over longer time horizons also becomes safer because weaker models arise earlier, and so are further in time from potential malign superintelligences. This opens up the possibility of using predictive models to evaluate AI policy and determine how best to delay AGI, as well as pathways to get such policies implemented.^[4]

Perhaps the primary continuous deployment target, however, would be using early predictive models to accelerate AI safety—e.g. as described by Jan Leike in “A minimal viable product for alignment.” Consider, for example, a predictive model that just mimics one researcher and can condition on them being in a good mood, having slept well the night before, etc. The difference between researchers on their best and worst days is large, so this model could be much more productive than the person it mimics—e.g. at tasks such as brainstorming directions, designing experiments, critiquing arguments, and coding. If all we can get is a long string of someone’s best research days in a row that could still be incredibly helpful. Furthermore, the benefits of creating such models could go beyond just the alignment research produced. They could also yield empirical insights into the conditioning process that could help with aligning more advanced models.

Additionally, if this paradigm of conditioning predictive models proves competitive on capabilities, it may be possible to attract key actors into using models in this way. This could be beneficial if, as we suggest, inner alignment is an easier task with predictive models than with many other kinds of models. Moreover there could be broader benefits to the extent that safety work on predictive models could then readily be applied by all actors with the resources to build them, which improves the overall safety story.

We are excited about the strategy of continuously doing the best we can to use these techniques to help with alignment research as we go—that is, staying as close as we can to the capability elicitation frontier for eliciting AI safety research specifically, while using the safest careful conditioning approaches that we’re aware of at each point in time. This approach has the added benefit that we’ll learn how to use models to do research early, which puts us in a better position as things start to heat up during various takeoff scenarios.

Using a predictive model to execute a pivotal plan

In “Strategy For Conditioning Generative Models,” Lucassen et al. discuss how we might be able to use powerful predictive models to generate plans for reducing AI existential risk that we can then implement—and the various risks that arise when attempting to do so. In the Lucassen et al. model, there are three sources of risk arising from any predictive model deployment strategy:

Timeout risk: the risk that, by doing too little, another actor will come along and cause AI existential risk.
Simulation risk: the risk that, by trying to ask for too much, we end up predicting a manipulative malign superintelligence.
Non-simulation downside risk: the risk that we directly cause AI existential risk, excluding simulation risk (e.g. our predictive model is actually a deceptive agent, the outputs of our predictive model are unsafe despite being outputs from predicted humans, etc.).

By being more aggressive with what we ask our model for—e.g. directly asking for a plan that solves AI existential risk—we can reduce timeout risk, but in exchange we substantially increase simulation risk, probably in that example by enough that it’s not worth the tradeoff. Most of the strategies discussed in Section 2 are ways of decreasing simulation risk without increasing other risks, but at some point we’ll also likely have to deal with strategies that trade some amount of one risk for some amount of another. Importantly, whenever we ask for more out of our predicted model, we must be doing so by conditioning on something that is less likely, which means we’re incurring additional simulation risk.

Given that our strategy for avoiding simulation risk is to try to predict humans solving the problem themselves without the assistance of AIs, it’s worth dealing with why using a predictive model to predict humans is ever actually better than just, as humans, trying to solve the problem without the help of AIs ourselves.

First, even though every time we condition on things going better from an alignment perspective we incur some additional simulation risk, the additional simulation risk might be substantially less than the additional timeout or non-simulation downside risk we’d be taking on if we tried to just hope that whatever we’re conditioning on actually happens in practice. As Lucassen et al. point out, this is true primarily in situations where the baseline probability of there currently existing a superintelligent malign AI as assessed by the predictive model is low—when that probability is substantially under 50%, conditioning a predictive model on something good happening can substantially outperform just hoping that the good thing will happen in practice.^[5]

Second, as Lucassen et al. point out, the victory condition is different for predicted humans than real humans—when predicting humans, we only need the predicted humans to eventually figure out the thing we should do, not actually be able to implement it. For example, conditioning on something like an all-out nuclear war that leaves some alignment researchers alive to keep doing alignment research isn’t a thing we actually want to happen, not least because there’s not clearly anything that those alignment researchers can do with their research—but it’s potentially a great thing to condition on, since we get to extract their research and can potentially implement it ourselves.

Misuse risk

Even if predictive models are used safely and conditioned to avoid predicting worlds with malign superintelligences, predictive models still pose a danger. Malicious actors could use predictive models for a discontinuous jump in technology, for example by getting access to weapon designs from worlds where the outbreak of global war prevented AGI development. This would be particularly appealing to nations who could then use the outputs for military purposes, possibly justified by the idea of preventing others from doing so. Terrorist groups could similarly use predictive models to simulate worlds where AGI was not developed due to e.g. a pandemic and then build that virus in reality.

The incentive for state actors and terrorists to access predictive models reinforces the necessity of information security, since powerful predictive models could be dangerous even if all the groups that develop them are responsible users. This includes security about the capabilities of these models: once it is known that a capability is achievable the odds of another group producing models with that capability rises substantially, and so some level of security around the knowledge of what these models can do is also important.

Unknown unknowns

When thinking about strategies for building advanced AI systems, we think it’s important to consider not just how likely they are to fail, but also how well we understand the potential failure modes.

As an analogy, suppose we are building what will become the longest-ever bridge and are trying to decide between two building materials, steel and concrete. We have a computer simulation that tells us that steel is likely to make the bridge sturdier than concrete, but we know that the computer is less good at modeling steel compared to concrete. Which material do we pick in that situation? If we trust the computer’s estimate and its modeled uncertainties we should pick steel—but if we think that the computer’s uncertainty in the steel case might be hiding unknown unknowns that could make steel substantially worse, we should pick concrete.

On this spectrum, we think predictive models are comparatively easier to reason about than many other sorts of AI systems, and pure prediction as a task is substantially easier to understand than something much more complex like trying to predict exactly what sorts of individual values will be incentivized by a complex multi-agent RL system. Furthermore, we think that almost any approach, if investigated to a sufficient level of detail, will likely find similar problems to those detailed here. Thus, overall, though we have extensively discussed many potential difficulties with conditioning predictive models, we think that the case for their use, at least compared to many other competing approaches, remains solid.

Specifically, this approach catches ruinous AGIs that are not capable of manipulating the world to still look good to us (given we have access to arbitrary cameras) after they’ve taken over. While we’re pretty uncertain here, that could catch a lot of the malign AGI takeover probability mass. ↩︎
This sort of persuasion could be dangerous and/or backfire if the model helps us make convincing arguments for false statements. We think it’s important to be sure that the statements we’re convincing people of are true before using a model to amplify persuasion efforts. ↩︎
We could even get a warning shot of an early predictive model’s poor attempts to mimic a deceptive superintelligence. We could even try to induce something like that on purpose, though that is of course a very risky thing to try. In particular, for that to work, it would need to be the case that the model becomes capable of imitating a deceptive AI before it becomes capable of imitating it at high enough fidelity to be actively dangerous—or if the predictive model imitates the deceptive AI well enough to effectively hide the deception until it becomes dangerous. ↩︎
While such uses of predictive models to choose actions may have a positive impact, it is important to remember that they burn the weirdness commons. Each unusual action taken based on a predictive model is a signal that something bizarre is going on, which will marginally increase the model’s credence in a manipulative AI existing (even if the true explanation of a non-manipulative predictive model is more likely). This reduces how much new weirdness a later counterfactual can add before a predictive model assumes the correct explanation is manipulative AI. As such, actions should be partially chosen based on their reasonableness, and a few actions with broad effects would be better than many actions with narrow effects. ↩︎
Additionally, the Lucassen et al. model assumes that all future malign AIs know exactly what we’re going to condition on and can always spoof it exactly—but that’s clearly not fully true, and to the extent that it isn’t, it should work in our favor. ↩︎

AI ALIGNMENT FORUM
AF