A few questions:
This suggests that we must model the entire distribution of activations simultaneously, instead of modeling each individual layer.
The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.
I was meaning to include such a section, but forgot :). Perhaps I will edit it in. I think such work is qualitatively similar to what we're trying to do, but that the key difference is that we're interested in "best guess" estimates, as opposed to formally verified-to-be-correct estimates (mostly because we don't think formally verified estimates are tractable to produce in general).
Relatedly, what's the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I'm not sure how weak the estimates actually are, since they usually report fraction of inputs which could be certified robust, rather than an estimate of the probability that a sampled input will cause a misclassification, which would be more analogous to your setting.)
The main hope comes from the fact that we're using a "best guess" estimate, instead of trying to certify that the model won't produce catastrophic actions. For example, Method 1 can be thought of as running a single example with a Gaussian blob around it through the model, but also tracking the "1st order" contributions that come from the Gaussian blob. If we wanted to bound the potential contributions from the Gaussian blob, our estimates would get really broad really fast, as you tend to see with interval propagation.
Although, this also comes with the opposite issue of how to know if the estimates are at all reasonble, especially when you train against them.
If your catastrophe detector involves a weak model running many many inferences, then it seems like the total number of layers is vastly larger than the number of layers in M, which seems like it will exacerbate the problems above by a lot. Any ideas for dealing with this?
I think fundamentally we just need our estimates to "not get that much worse" as things get deeper/more complicated. The main hope for why we can achieve this is that the underlying model itself will not get worse as it gets deeper/the chain of thought gets longer. This implies that there is some sort of stabalization going on, so we will need to capture the effect of this stabalization. It does seem like in order to do this, we will have to model only high level properties of this distribution, instead of trying to model things on the level of activations.
In other words, one issue with interval propagation is that it makes an assumption that can only become less true as you propagate through the model. After a few layers, you're (perhaps only implicitly) putting high probability on activations that the model will never produce. But as long as your "activation model" is behaving reasonably, then hopefully it will only become more uncertain insofar as the underlying reasoning done by the model becomes more uncertain.
What's your proposal for the distribution P0 for Method 2 (independent linear features)?
You can either train an SAE on the input distribution, or just try to select the input distribution to maximize the probability of catastrophe produced by the estimation method (perhaps starting with an SAE of the input distribution, or a random one). Probably this wouldn't work that well in practice.
Why think this is a cost you can pay? Even if we ignore the existence of C and just focus on M, and we just require modeling the correlations between any pair of layers (which of course can be broken by higher-order correlations), that is still quadratic in the number of parameters of M and so has a cost similar to training M in the first place. In practice I would assume it is a much higher cost (not least because C is so much larger than M).
Our ultimate goal is vaguely to "only pay costs that SGD had to pay to produce M" Slightly more specifically, M has a bunch of correlations between its layers. Some of these correlations were actively selected to be those particular values by SGD, and other correlations were kind of random. We want to track the ones that were selected, and just assume the other ones are random. Hopefully, since SGD was not actively manipulating those correlations, the underlying model is in some sense invariant to their precise values, and so a model that treats such correlations as random will predict the same underlying behavior as a model that models the precise values of those correlations.
Fwiw the thing that scares me is the combination of three things:
You presumably agree with (1) and (3). For (2), it seems like there's a lot of correlations you need to take into account for this to produce estimates that aren't wildly off:
I agree this isn't a proof of impossibility, since a purely algorithmic approach (SGD) produced the "understanding" in the first place, so in theory a purely algorithmic approach could still capture all that understanding to produce accurate estimates. But it does seem heuristically like you should assign a fairly low probability that this pans out.
Yes I agree with what you have written, and do think it’s overall not that likely that everything pans out as hoped. We do also have other hopes for how this general picture can still cohere if the specific path doesn’t work out, eg we’re open to learning some stuff empirically and adding an “algorithmic cherry on top” to produce the estimate.
Do you think that catastrophe detectors will be large neural networks?
If you do, how do you build a neural network (without sampling steps nor python execution steps) that implements your major hopes for getting a weak model to supervise a strong model?
- We could give \(M_{weak}\) more time to think than \(M\), so it could examine each portion of the action in detail.
- We could allow \(M_{weak}\) to interrogate \(M\) about the function of various parts of its action, and do detailed checks for inconsistencies in \(M\)'s answers.
- We could allow \(M_{weak}\) to perform separate experiments to verify \(M\)'s claims about its actions, and to evaluate whether or not the actions have any side effects.
- \(M_{weak}\) could be conservative when evaluating \(M\)'s action, flagging part of the action it did not understand as potentially catastrophic.
If we back down from such hopes that use a lot of inference-time compute, and just use a single weak-model forward pass, then it looks to me like trusted monitoring solves most (but not all) issues that estimating tail risks targets.
If you don't, do you think that it is easy to go from NN-tail-risk estimation to a more general form of tail-risk estimation? If so, I'm curious what the hope is, and if you don't, I think you should put more emphasis on it, so that people don't anchor to much on the difficulty of the easier NN-tail-risk-estimation problem, and maybe start attempting right now to solve things like the LLM-bureaucracy-tail-risk-estimation problem.
I think catastrophe detectors in practice will be composed of neural networks interacting with other stuff, like scientific literature, python, etc.
With respect to the stuff quoted, I think all but "doing experiments" can be done with a neural net doing chain of thought (although not making claims about quality).
I think we're trying to solve a different problem than trusted monitoring, but I'm not that knowledgeable about what issues trusted monitoring is trying to solve. The main thing that I don't think you can do with monitoring is producing a model that you think is unlikely to result in catastrophe. Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Separately, I do think it will be easy to go from "worst-case" NN-tail-risk estimation to "worst case" more general risk estimation. I do not think it will be easy to go from "typical case" NN-tail-risk estimation to more general "typical case" risk estimation, but think that "typical case" NN-tail-risk estimation can meaningfully reduce safety despite not being able to do that generalization.
Re. more specific hopes: if your risk estimate is conducted by model with access to tools like python, then we can try to do two things:
(these might be the same thing?)
Another argument: one reason why doing risk estimates for NN's is hard is because the estimate can rely on facts that live in some arbitrary LLM ontology. If you want to do such an estimate for an LLM bureaucracy, some fraction of the relevant facts will live in LLM ontology and some fraction of facts will live in words passed between models. Some fraction of facts will live in a distributed way, which adds complications, but those distributed facts can only affect the output of the bureacracy insofar as they are themselves manipulated by an LLM in that bureacracy.
Thanks for your explanations!
Monitoring lets you do online training when you find catastrophe, but e.g. there might be no safe fallback action that allows you to do monitoring safely.
Good point. I agree that if you don't have time or resources to train the policy against the monitor on-the-fly, doing the catastrophe-suppression upfront is very nice.
It is worth noting that if "training against the monitor" didn't work because the monitor was not robust enough, then my understanding is that the same issue would plague the tail-risk estimation approach: for optimization against tail-risk to not get gamed, you not only need the tail-risk estimation to be robust, but you also need the catastrophe detector to be robust. Is that right?
all but "doing experiments" can be done with a neural net doing chain of thought
I would have thought there is a big gap between "doing tail-risk estimation for an NN forward pass" and "doing tail-risk estimation for an NN CoT". Argmax (the one in T=0 sampling) is not a nice non-linearity, and my guess is that none of the hopes for NN tail-risk estimation present in the post allow you to make claims about weak model CoT. If you do layer-by-layer modeling (going through n_layers x n_sampled_tokens steps), I would be surprised if you ended up being able to show sth like "after careful consideration, and despite the naive warning signs (which you need to have, otherwise the CoT is not very useful), the CoT always converges to saying that there is no catastrophe" - it's not like in single-forward-pass land where there is a clear sense in which the activations should not go in certain regions. I haven't thought about it much, so maybe I am too pessimistic about dealing with the argmax?
yes, you would need the catastrophe detector to be reasonably robust. Although I think it's fine if e.g. you have at least 1/million chance of catching any particular catastrophe.
I think there is a gap, but that the gap is probably not that bad (for "worst case" tail risk estimation). That is maybe because I think being able to do estimation through a single forward pass is likely already to be very hard, and to require being able to do "abstractions" over the concepts being manipulated by the forward pass. CoT seems like it will require vaguely similar struction of a qualitatively similar kind.
Machine learning systems are typically trained to maximize average-case performance. However, this method of training can fail to meaningfully control the probability of tail events that might cause significant harm. For instance, while an artificial intelligence (AI) assistant may be generally safe, it would be catastrophic if it ever suggested an action that resulted in unnecessary large-scale harm.
Current techniques for estimating the probability of tail events are based on finding inputs on which an AI behaves catastrophically. Since the input space is so large, it might be prohibitive to search through it thoroughly enough to detect all potential catastrophic behavior. As a result, these techniques cannot be used to produce AI systems that we are confident will never behave catastrophically.
We are excited about techniques to estimate the probability of tail events that do not rely on finding inputs on which an AI behaves badly, and can thus detect a broader range of catastrophic behavior. We think developing such techniques is an exciting problem to work on to reduce the risk posed by advanced AI systems:
This document will discuss the problem of estimating the probability of tail events and explore estimation strategies that do not rely on finding inputs on which an AI behaves badly. In particular, we will:
This document describes joint research done with Jacob Hilton, Victor Lecomte, David Matolcsi, Eric Neyman, Thomas Read, George Robinson, and Gabe Wu. Thanks additionally to Ajeya Cotra, Lukas Finnveden, and Erik Jenner for helpful comments and suggestions.
A Toy Scenario
Consider a powerful AI engineering assistant. Write M for this AI system, and M(x) for the action it suggests given some project description x.
We want to use this system to help with various engineering projects, but would like it to never suggest an action that results in large-scale harm, e.g. creating a doomsday device. In general, we define a behavior as catastrophic if it must never occur in the real world.[1] An input is catastrophic if it would lead to catastrophic behavior.
Assume we can construct a catastrophe detector C that tells us if an action M(x) will result in large-scale harm. For the purposes of this example, we will assume both that C has a reasonable chance of catching all catastrophes and that it is feasible to find a useful engineering assistant M that never triggers C (see Catastrophe Detectors for further discussion). We will also assume we can use C to train M, but that it is prohibitively expensive to use C to filter all of M's outputs after M is trained.[2]
We are interested in estimating the probability that our model M behaves catastrophically on a particular distribution of inputs[3]: Px∼D(C(M(x)))
Deficiencies of Adversarial Training
To reduce the risk of our AI system behaving catastrophically, we could use adversarial training: having a red team attempt to find catastrophic inputs and training M until those inputs are no longer catastrophic.
However, such systems can still behave catastrophically in a range of possible scenarios:
As catastrophic behavior becomes extremely rare, it can be computationally prohibitive for the red team to explore the input space thoroughly enough to find even a single catastrophic input. To determine which AI systems robustly avoid catastrophic behavior, we must be able to estimate the frequency of catastrophic inputs even when they are so rare that we cannot find any such inputs.
Rather than viewing a red team as searching for particular inputs, we can view the red team as defining a process D for producing inputs. We can then ask the question: how likely is the red team to find an input on which the system behaves catastrophically? Equivalently, what is PD(C(M(x)))? Even if a red team is usually unable to find concrete inputs on which a system behaves catastrophically, the process they use could still have a small probability of success.
The possibility of distributional shift complicates the way in which estimates of the rate of catastrophic behavior can be used to mitigate such behavior, which we discuss in Distributional Shift.
Deliberate Subversion of Adversarial Training
Malign intelligent AI systems might have incentives to pretend to be benign during training so they can cause harm later. Call a system "deceptively aligned" if it is a malign system pretending to be benign.[5] For example, a deceptively aligned engineering assistant might intentionally avoid suggesting an action that creates a doomsday device during training, but seek such opportunities when they would cause real-world harm. A sufficiently advanced system might be able to maintain this deception even through extensive adversarial training.[6]
Even if a deceptively aligned model has extremely high accuracy at determining whether it is being adversarially trained, it will likely have some probability of making a mistake. If a deceptive system makes mistakes very rarely, then adversarial training would likely fail to find an input that distinguishes it from a benign model. However, the possibility of these mistakes would result in the deceptive model having a higher probability of behaving catastrophically than its benign counterpart, even if the probability is small in absolute terms (we discuss ways in which this might not be true in Catastrophe Detectors. By estimating the probability of catastrophic behavior, we would be able to distinguish between benign and deceptive systems.
Importantly, our estimates do not need to be perfectly accurate to detect deceptive systems. Catastrophic actions are highly complex, so AI systems cannot randomly stumble into catastrophic actions. Deceptive systems behave catastrophically at much higher rates than their benign counterparts because they sometimes deliberately decide to take catastrophic action. To distinguish between benign and deceptive systems, we would not need to characterize the chance a benign model accidentally acts catastrophically; we would only need to notice that there are some conditions under which a model might deliberately act catastrophically.
A Possible Approach for Estimating Tail Risk
This section an approach for estimating the probability of catastrophic behavior of a neural network. We present several simple methods that represent our overall approach. We are optimistic that this kind of approach can ultimately deliver better risk estimates than can be obtained from conventional red teaming, but in experiments so far we have not surpassed baselines based on importance sampling.[7]
Consider a neural network M:X→Y, a distribution D over X, and a binary event C:Y→{0,1}. Our goal is to estimate PD(C(M(x))).[8] For our engineering research assistant example above, X is a project description, Y is a suggested action, and C is our catastrophe detector.
If PD(C(M(x))) is very low, then even billions of samples might not contain an x such that C(M(x))=1. However, it might still be possible to estimate PD(C(M(x))) by identifying structure that suggests M may behave catastrophically on some inputs. For example, suppose C embeds an action M(x) into some latent space and flags it as catastrophic if it is "too large" in 20 specific directions simultaneously. For each of these directions, we could attempt to identify features in M that would result in the embedding of M(x) being large in that direction. If each such feature were active with probability 1100, then we could estimate the chance that M(x) is too large in all 20 directions simultaneously as (1100)20=10−40.
Our goal is to develop methods that can reliably detect significant risks of catastrophic behavior by identifying such structure.
Layer-by-layer Activation Modeling
We will present one possible approach to producing estimates of tail events in neural networks based on modeling the distribution of each layer of activations in the network. This approach illustrates one way in which a mechanistic analysis of a model could improve estimates of tail events. However, this particular framework also has a few fundamental flaws, which we discuss in Issues with Layer-by-layer Activation Modeling.
We will assume that C is also a neural network and express the composition of C and M as a single function C∘M:X→{0,1}. This composition is a single function which is 1 if and only if x is a catastrophic input for M. Since C∘M is itself just a larger neural network, we can express it as the composition of n functions f0,f1,...,fn−1. Each fi represents a transition between layers in our model, such as a linear transformation followed by a ReLU activation. We will write Xi for the domain of fi, which is typically equal to Rk for some k. More specifically, for input x define:
Our input distribution D is a distribution over X0. Through the composition of the transition functions fi, D also induces a distribution over X1,X2,...,Xn. Our general method aims to estimate PD(C(M(x))) by approximation these induced distributions over Xi as they flow through the network, from input to output. Each implementation of this method will have two key components:
With these components in place, we can estimate PD(C(M(x))) for any D∈P0 as follows:
Toy Example: Finitely Supported Distributions
If D was finitely supported, then it would be trivial to estimate the probability of catastrophe on D, but we can use this example to illustrate some general principles. Let all Pi be the class of finitely supported distributions over the associated spaces Xi.
Given a finitely supported distribution D=P0, we can apply f1 to each datapoint to generate the empirical distribution P1, which will be the exact distribution of x1. By repeating this process for all layers, we eventually obtain Pn. The probability Pn assigns to 1 will be the exact frequency of catastrophe on D.
This calculation is not helpful for adversarial training; if we cannot find any inputs where a catastrophe occurs, then we also cannot find any finitely supported distribution D with non-zero probability of catastrophe. Instead, we would like to allow a red team to define a broader distribution that puts positive (although potentially very small) probability on catastrophic inputs.
Method 1: Gaussian Distribution
To move beyond empirical evaluations, we can approximate the distributions over activations by multivariate Gaussians. Let Pi be the class of all multivariate Gaussians over the activations Xi. Write a normal distribution with mean vector μ and covariance matrix Σ as N(μ,Σ).
To specify an algorithm, we need to a method for choosing Pi+1 given Pi. In this case, we want to choose the multivariate Gaussian N(μi+1,Σi+1) that best approximates fi(xi) where xi∼N(μi,Σi). A non-linear function distribution is typically no longer Gaussian, so perfect modeling is impossible. Instead, we can use various methods to select N(μi+1,Σi+1) based on different notions of approximation quality.[9]
A standard notion of approximation quality is the Kullback-Leibler (KL) divergence between fi(N(μi,Σi)) and N(μi+1,Σi+1). By a well-known "moment matching" theorem of Gaussian distributions, we can minimize KL(fi(N(μi,Σi))||N(μi+1,Σi+1)) by setting μi+1 and Σi+1 to the mean vector and covariance matrix of fi(N(μi,Σi)).[10]
This Gaussian approximation allows us to move beyond adversarial training on concrete catastrophic inputs. Instead, we can pick μ0 and Σ0 to maximize PN(μ0,Σ0)(C(M(x))), then train M to minimize PN(μ0,Σ0)(C(M(x))), hopefully capturing a broader range of catastrophic scenarios.
Method 2: Independent Linear Features
Some authors have attempted to extract meaningful features of a model's computation using sparse auto-encoders (SAEs), which find an overcomplete basis in which the activations are typically sparse. SAEs are typically learned by minimizing an empirical estimate of a reconstruction loss. However, if catastrophic inputs are rare, the SAE's training data is unlikely to contain any catastrophic inputs. As such, we might not learn features that are informative for estimating the probability of catastrophic inputs.
We will describe a modified SAE for which we can analytically calculate the loss over a distribution that includes catastrophic inputs, instead of empirically estimating it with training data. The details of this algorithm can be found here, which we summarize below.
Define a given distribution Pi∈Pi by specifying:
To generate a single sample from this distribution, we:
SAEs are typically learned independently for different layers of a neural network. By contrast, our algorithm will use the features from a layer Xi to learn the features for the next layer Xi+1. More precisely, we will assume that the activations at layer i are distributed according to Pi, and then find the distribution Pi+1 that minimizes KL(fi(Pi)||Pi+1)).
We show that we can analytically compute an bound for this KL divergence by using variational inference.[11] We can then train to Pi+1 to minimize this lower bound; the result is a variational autoencoder trained with an objective very similar to the conventional objective for a sparse autoencoder. More details of this bound and an algorithm for calculating it efficiently can be found here.
Because we compute bounds analytically, in principle they can be very sensitive to rare events that would never be detected by sampling. However, because KL divergence is defined as an expectation, this loss function may still incentivize the learning of features that are relevant to the typical behavior of the model and not informative about the tails. For example, a "lethality" feature might be informative about the probability an AI system behaves catastrophically, but almost useless for modeling behavior on typical inputs. To overcome this problem, we would like to replace the KL divergence with a loss function that is more sensitive to distributional differences that affect catastrophe probability. Precisely defining this loss function is one of the major open questions for this approach.
Relation to Other Work
Formalizing the Presumption of Independence
Formalizing the Presumption of Independence studies the problem of estimating the expectation of a circuit through the lens of heuristic estimation. The general framework described above is a particular family of heuristic estimation methods based on modeling the activations of successive layers of a neural network.
Many of the methods we have described are inspired by algorithms for heuristic estimation. Most directly, Method 1 Gaussian Distribution is exactly the covariance propagation algorithm described in appendix D.3. Additionally, Method 2 Independent Linear Features can be thought of as finding a basis for which the presumption of independence approximately applies.
For more examples of methods for heuristic estimation that can potentially be translated into techniques for estimating tail risk in neural networks, see Appendix A of Formalizing the Presumption of Independence.
Mechanistic Interpretability
Mechanistic interpretability is a field of research that aims to understand the inner workings of neural networks. We think such research represents a plausible path towards high-quality estimates of tail risk in neural networks, and many of our estimation methods are inspired by work in mechanistic interpretability.
For example, a mechanistic understanding of a neural network might allow us to identify a set of patterns whose simultaneous activation implies catastrophic behavior. We can then attempt to estimate the probability that all features are simultaneously active by using experiments to collect local data and generalizing it with the presumption of independence.
We also hope that estimating tail risk will require the development of methods for identifying interesting structure in neural networks. If so, then directly estimating tail risk in neural networks might lead to greater mechanistic understanding of how those neural networks behave and potentially automate portions of mechanistic interpretability research.
Relaxed Adversarial Training
Relaxed Adversarial Training (RAT) is a high-level proposal to overcome deficiencies in adversarial training by "relaxing" the problem of finding a catastrophic input. We expect our methods for estimating PD(C(M(x))) to be instances of the relaxations required for RAT.
Eliciting Latent Knowledge
In Eliciting Latent Knowledge (ELK), we describe a SmartVault AI trained to take actions so a diamond appears to remain in the room. We are concerned that if the sensors are tampered with, the diamond can be stolen while still appearing safe.
A key difficulty in ELK is the lack of sophisticated tampering attempts on which we can train our model to protect the sensors. In the ELK document, we describe some ways of training models that we hoped would generalize in desirable ways during tampering attempts, but ultimately concluded these methods would not always result in the desired generalization behavior.
Instead of trying to indirectly control generalization, we can attempt to directly measure the quality of generalization by estimating the probability of tampering. We will not have a perfect tampering detector, but even if a robber (or our AI system itself) was skilled at tampering, they might get caught one-in-a-trillion times. Thus, by estimating and minimizing the probability of detectable tampering, we might be able to produce a SmartVault that defends the sensors even with no examples of sophisticated sensor tampering.
More generally, we believe there are deeper connections between methods for estimating tail risk and ELK, which we might explore in a later post.
Conclusion
Contemporary methods for reducing risk from AI systems rely on finding concrete catastrophic inputs. As AI systems become more capable, such methods might not be able to reduce risk down to acceptable levels.
In this document, we have argued that it is both useful and possible to develop approaches for estimating tail risk in neural networks that does not rely on finding concrete catastrophic inputs. We are currently pursuing this problem from both theoretical and empirical angles, and are excited to see others do the same.
We have given a high-level overview of our motivation and approach. In the appendices, we explore various details in more depth:
Appendices
Catastrophe Detectors
In the body, we assume that we can construct a catastrophe detector C that has the following properties:
However, we do not currently know how to construct a catastrophe detector with those properties. In this appendix, we will describe the a default picture for how to attempt such a construction, discuss its potential flaws, and briefly overview our hope for why these flaws are surmountable.
Our default picture for how we would attempt to construct such a catastrophe detector is to use a weaker, trusted AI system Mweak to examine the suggested action of a stronger system M. We could give our weaker system multiple advantages to give it the best chance of catching catastrophic actions:
However, even given these advantages, our catastrophe detector C might not have our desired properties:
In the body of this document, we discuss methods for estimating the chance that an AI system violates some specification of catastrophe. We think it is more likely that we fail to find such methods, than that we fail to find a specification of catastrophe that would be valuable to estimate. As such, we have not prioritized research into how we could construct more robust catastrophe specifications.
Our current plan for how to produce such specifications involves increasing the robustness of an imperfect catastrophe detectors using mechanistic anomaly detection. Roughly speaking, we think it should be possible to distinguish between two different "reasons" why an action looked benign:
This bears resemblance to our plan for how mechanistic anomaly detection will help with ELK, which we discuss in Finding gliders in the game of life.
Issues with Layer-by-layer Activation Modeling
In the body, we present a framework for producing estimates of tail events in a neural network by successively modeling layers of activations. We present this approach because it is easy to communicate and reason about, while still containing the core challenge of producing such estimates. However, we believe the framework to be fundamentally flawed for two reasons:
Layer-by-layer modeling forgets too much information
Thanks to Thomas Read for the example used in this section.
When modeling successive layers of activations, we are implicitly forgetting how any particular piece of information was computed. This can result in missing large correlations between activations that are computed in the same way in successive layers.
For example, suppose h is a pseudorandom boolean function that is 1 on 50% of inputs. Let x be distributed according to some simple distribution D. Define the following simple 2 layer neural network:
Since h(x)==h(x) definitionally, this network always outputs 1. However, layer-by-layer activation modeling will give a very poor estimate.
If h(x) is complex enough, then our activation model will not be able to understand the relationship between h(x) and x, and be forced to treat h(x) as independent from x. So at layer 1, we will model the distribution of activations as (D,Bern[0.5]), where Bern[0.5] is 1 with 50% chance and 0 otherwise. Then, for layer 2, we will treat h(x) and y as independent coin flips, which are equal with 50% chance. So we will estimate PD(f2(f1(x))) as 0.5, when it is actually 1.
In general, layer-by-layer activation modeling makes an approximation step at each layer, and implicitly assumes the approximation errors between layers are uncorrelated. However, if layers manipulate information in correlated ways, then approximation errors can also be correlated across layers.
In this case, we hope to be able to notice that f1 and f2 are performing similar computations, and so to realize that the computation done by layer 1 and layer 2 both depend on the value of h(x). Then, we can model the value of h(x) as an independent coin flip, and obtain the correct estimate for PD(f2(f1(x))). This suggests that we must model the entire distribution of activations simultaneously, instead of modeling each individual layer.
Probability distributions over activations are too restrictive
Thanks to Eric Neyman for the example used in this section.
If we model the entire distribution over activations of M, then we must do one of two things:
Every set of activations actually produced by M is consistent with some input. If we performed consistency checks on M, then we would find that every set of activations was always consistent in this way, and the consistency checks would always pass.
If, however, our approximate distribution over M's activations placed positive probability on inconsistent activations, then we would incorrectly estimate the consistency checks as having some chance of failing. In these cases, our estimates could be arbitrarily disconnected from the actual functioning of M. So it seems we must strive to put only positive probability on consistent sets of activations.
Only placing positive probability on consistent sets of activations means that our distribution over activations corresponds to some input distribution. This means that our catastrophe estimate will be exact over some input distribution. Unfortunately, this implies our catastrophe estimates will be often be quite poor. For example, suppose we had a distribution over boolean functions with the following properties:
For example, an appropriately chosen distribution over 3-CNFs likely has these properties.
If our input space was of size N, then it seems reasonable to estimate a P(f) as 12N for randomly sampled f. However, since our catastrophe estimate is exact (over some input distribution), it can be non-zero only if f is not the all 0 function. By assumption, it is computationally hard to tell if f is not the all 0 function, so we must generally estimate P(f) as 0, making it impossible to be "reasonable".
In general, the requirement that our estimate be derived from a logically consistent distribution means we cannot reliably put positive probability on all "reasonable" possibilities. If we wish to be able to produce estimates like 12N, then we must be able to represent logically inconsistent possibilities. However, the possibility of models that do consistency checks means we must place almost no probability on any particular logical inconsistency.
This line of reasoning suggests an overall picture where our estimates are not attached to any particular probability distribution over activations, but rather one where our estimates are derived directly from high-level properties of the distribution. For example, we might instead represent only the moments of the distribution over activations, and our estimate might come from an inconsistent set of moments that cannot come from any possible distribution (but we would not know of any particular inconsistency that was violated).[12]
Distributional Shift
In Layer-by-layer Activation Modeling, we describe a few methods for estimating the probability of a tail event on a fixed input distribution. However, due to distributional shift, these estimates might be highly uninformative about the true probability our AI systems will act catastrophically "in practice". This problem is not unique to our approach; practitioners of adversarial training also desire to find catastrophic inputs likely to appear "in practice", which might be very different from inputs used to train the AI system.
We have a few rough ideas for constructing distributions that place non-trivial probability on realistic catastrophic inputs:
Ultimately, however, these mitigations will not detect all instances of deceptive alignment. For example, suppose a deceptively aligned system knows it will encounter a particular input x that is a good opportunity for acting catastrophically. If the adversarial training team does not know about x, or merely cannot synthesize x, then they will be unable to determine if the system will behave catastrophically.[13]
In these cases, we hope to detect deceptive alignment by understanding the input distribution "as well as the AI system". Any time the AI system knows that it will encounter some future catastrophic input, we need to incorporate this knowledge into our estimate. This is very related to ELK, and is discussed further in Main hope estimates competitive with the AI system or the training process.
Complexity Theoretic Barriers to Accurate Estimates
It is at least NP-hard to accurately estimate the expectation of a function computable in polynomial time. As such, it is likely infeasible to develop methods for accurately estimating tail events. However, we believe that it is feasible to develop methods that can estimate tail risks accurately enough to detect risks from AI systems deliberately acting catastrophically.
In this appendix, we will first work an example involving 3-SAT to demonstrate some estimation methods that can be applied to problems widely believed to be computationally infeasible to even approximate. Then, we will discuss how we hope to obtain estimates accurate enough to detect risk from AI systems deliberately acting catastrophically by obtaining estimates that are competitive with the AI system or the process used to train it.
3-SAT
A boolean formula is called a 3-CNF if it is formed of an AND of clauses, where each clause is an OR of 3 or less literals. For example, this is a 3-CNF with 3 clauses over 5 variables:
(x1∨¬x3∨x4)∧(¬x2∨x3∨¬x5)∧(x2∨¬x4∨x5)
We say a given setting of the variables xi to
True
orFalse
satisfies the 3-CNF it it makes all the clausesTrue
. 3-SAT is the problem of determining whether there is any satisfying assignment for a 3-CNF. #3-SAT is the problem of determining the number of satisfying assignments for a 3-CNF. Since the number of assignments for a 3-CNF is 2number of variables, #3-SAT is equivalent to computing the probability a random assignment is satisfying. The above 3-CNF has 20 satisfying assignments and 25=32 possible assignments, giving a satisfaction probability of 58.It's widely believed that 3-SAT is computationally hard in the worst case, and #3-SAT is computationally hard to even approximate. However, analyzing the structure of a 3-CNF can allow for reasonable best guess estimates of the number of satisfying assignments.
In the following sections, F is a 3-CNF with C clauses over n variables. We will write P(F) for the probability a random assignment is satisfying, can be easily computed from the number of satisfying assignments by dividing by 2n.
Method 1: Assume clauses are independent
Using brute enumeration over at most 8 possibilities, we can calculate the probability that a clause is satisfied under a random assignment. For clauses involving 3 distinct variables, this will be 78.
If we assume the satisfaction of each clause is independent, then we can estimate P(F) by multiplying the satisfaction probabilities of each clause. If all the clauses involve distinct variables, this will be (78)C. We will call this this naive acceptance probability of F.
Method 2: Condition on the number of true variables
We say the sign of xi is positive, and the sign of ¬xi is negative. If F has a bias in the sign of its literals, then two random clauses are more likely to share literals of the same sign, and thus be more likely to be satisfied on the same assignment. Our independence assumption in method 1 fails to account for this possibility, and thus will underestimate P(F) in the case where F has biased literals.
We can account for this structure by assuming the clauses of F are satisfied independently conditional on the number of variables assigned
true
. Instead of computing the probability each clause istrue
under a random assignment, we can compute the probability under a random assignment where m out of n variables istrue
. For example, the clause (x1∨¬x3∨x4) will be true on (n−3m−2) possible assignments out of (nm), for a satisfaction probability of (n−3m−2)/(nm)=m(m−1)(n−m)n(n−1)(n−2)≈mn(1−mn)mn, where the latter is the satisfaction probability if each variable wastrue
with independent mn chance. Multiplying together the satisfaction probabilities of each clause gives us an estimate of P(F) for a random assignment where m out of n variables aretrue
.To obtain a final estimate, we take a sum over satisfaction probabilities conditional on m weighted by (nm)2n, the chance that m variables are
true
.Method 3: Linear regression
Our previous estimate accounted for possible average unbalance in the sign of literals. However, even a single extremely unbalanced literal can alter P(F) dramatically. For example, if xi appears positively in 20 clauses and negatively in 0, then by setting xi to
true
we can form a 3-CNF with 1 fewer variable and 20 fewer clauses that has naive acceptance probability of 2N−1(78)(C−20). xi will be true with 12 chance, so this represents a significant revision.We can easily estimate P(F) in a way that accounts for the balance in any particular literal xi. However, it is not simple to aggregate these estimates into a overall estimate for P(F).
One approach is to combine these estimates linearly in a way that minimizes some measure of error. For instance, if we wanted to minimize mean-square error, then we would treat each estimate as a feature and combine them using linear regression. If we estimate the covariance of each feature using our naive acceptance probability, then this is equivalent to doing linear regression over the reference class of 3-CNF's where the sign of each literal is flipped uniformly at random. For more details, see section 9.2 of Neyman 2024.
This method produces an estimate for P(F) that has lower mean-square error over random F than method 2, but lacks other intuitively desirable properties like producing estimates in [0,1]. We could clamp our estimate, or attempt to do logistic regression instead, but we will have to trade off between many intuitively desirable properties. Deciding which set of tradeoffs to take requires clarifying what kind of accuracy we want, which in turn depends on downstream applications.
Main hope: estimates competitive with the AI system or the training process
As analogy, suppose we have an AI system M that acts catastrophically whenever the input x is a satisfying assignment to a particular 3-CNF F. Accurately estimating the probability M behaves catastrophically is likely computationally infeasible. Nevertheless, we still believe we can obtain estimates that are "good enough" to detect if M poses a catastrophic risk.
If F was a randomly selected 3-CNF, then it is still generally computationally infeasible to tell if there exists a satisfying assignment. However, Method 1 presented above is an exact calculation for the expected value of P(F) for a randomly selected 3-CNF. Since F was in fact randomly selected, it seems reasonable estimate P(F) using this "reference class" of random Fs.
If F was not selected randomly, this will not be a "good enough" estimate. For instance, even if we could calculate the probability of acting catastrophically for a random neural network, treating our trained neural network M as a randomly sampled neural network would produce a terrible estimate for the probability it behaved catastrophically. Similarly, if our training process selected F to have a high satisfaction probability, then treating F as a random 3-CNF would produce a very bad estimate for P(F).
However, in these cases, there is information we could learn that suggests treating F (or M) as a random instance would produce a poor estimate. If we knew how F was selected, then we would know that F was not selected randomly, and so we would know not to treat it as such. Since the training process selected F, we have an "existence proof" for a process that is as efficient as the training process that knows how F was selected, and thus has the information necessary to produce a more informed estimate of F. Our hope is to produce a parallel "training process" for these more informed estimates by starting with an estimate that assumes M is random, then gradually updating the estimate to incorporate all the information used to produce M.
More generally, we hope that if M sometimes deliberately decides to act catastrophically then there is some information that our estimate could incorporate to notice this fact. So if can we use all information that we have available to estimate the probability that M acts catastrophically, then we will be able to reliably distinguish between benign and deceptive AI systems.
For example, it would be catastrophic to destroy all life on Earth. It would be undesirable, but not catastrophic, for a self-driving car to hit a pedestrian. ↩︎
It might also be itself catastrophic to filter M's outputs during high-stakes situations if there is no safe fall-back action. ↩︎
It might be difficult to determine whether M(x) is catastrophic without the input x. To be more general, we could also provide x to C. Since this does not change any the overall presentation and adds notational burden, we use C(M(x)). ↩︎
For instance, AI systems are often jailbroken shortly after release, despite developer efforts. ↩︎
See Carlsmith 2023 for discussion about how such incentives might arise. ↩︎
Even if inputs are indistinguishable between training and deployment, an AI typically spends more time deployed than in training. Thus our AI could behave catastrophically with some constant probability that is low enough to avoid acting catastrophically while being trained, but high enough that it is likely to take a catastrophic action eventually. Section 4.5 of Risks from Learned Optimization makes a similar argument. ↩︎
We will release some empirical work on estimating tail events later this year. ↩︎
Unfortunately, obtaining accurate estimates of the probability of arbitrary binary functions is computationally infeasible; we discuss such objections in Complexity Theoretic Barriers to Accurate Estimates. ↩︎
This problem is well known; for an overview of such methods, see section 8 of Murphy 2023. ↩︎
See section 2 of Minka 2013 for the "moment matching" theorem. If fi is a low-degree polynomial, these values can be computed exactly using Isserlis' Theorem. If fi involves a ReLU, then we can use a polynomial approximation or estimate it numerically. ↩︎
See Chapter 10 of Murphy 2023 for an overview of variational inference. ↩︎
The sum of squares algorithm is an example of how such an estimation procedure might look. ↩︎
For example, the AI system might pick a cryptographic puzzle it thinks will likely be solved, or make some prediction of what the future will look like based on arcane sociological principles. ↩︎