Using interpretability tools in the loss function incentivises a startling number of passive circumvention methods.
So why consider using it at all? There are two cases where it might be worthwhile to use:
- If we’re absolutely 100% provably definitely sure that our interpretability methods cover every representational detail that the AI might be using, then it would be extremely hard or impossible to Goodhart interpretability tools in the loss function. But, for safety, we should probably assume that our tools are not able to interpret everything, making passive Goodharting quite likely.
To be confident including interpretability tools in your loss function, you don't have to believe that there is no way those tools could possibly be Goodharted, you just have to believe that there is no way for gradient descent to find a way of Goodharting them. And the latter condition is far more achievable, since gradient descent is a process we can try to understand and interpret the same way we can try to understand and interpret a trained model. In particular, as I discuss here, if you could understand the process of gradient descent well enough to know why it's proposing particular modifications to your model, you could use that understanding to be able to determine whether it was responding to your interpretability loss in the right way or not.
This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn't attempt to interpret the gradient descent process. I'd be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.
If, instead of using interpretability tools in the loss function, we merely use it as a ‘validation set’ instead of the training set (i.e. using it as a ‘mulligan’), we might have better chances of picking up dangerous cognition before it gets out of hand so we can terminate the model and start over. We’re therefore still using interpretability in model selection, but the feedback loop is much less tight, so it’d be harder to Goodhart.
While only using the interpretability-tool-based filter for model selection is much weaker optimisation pressure than using it in the loss function, and hence makes goodhearting harder and hence slower, it's not clear that this would solve the problem in the long run. If the interpretability-tool-based filter captures everything we know now to capture, and we don't get new insights during the iterated process of model training and model selection, then it's possible we'll eventually end up goodharting the model selection process in the same was as SGD would goodhart the interpretability tool in the loss function.
I think it's likely that we would gain more insights or have more time if we were to use the interpretability tool as a mulligan, and it's possible the way we as AI builders optimise producing a model that passes the interpretability filters is qualitatively different from the way SGD (or whatever training algorithm is being used) would optimise the interpretability-filter loss function. However, in the spirit of paranoia/security mindset/etc., it's worth pointing out that using the tool as a model selection filter doesn't guarantee that an AGI that passes the filter is safer than if we used the interpretability tool as a training signal, in the limit of iterating to pass the interpretability tool model selection filter.
Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active.
It seems to me that there's no fixed notion of "active" that works for both paragraphs here.
If active means [is achieved through the agent's actions], then this does not in general imply that it is deliberately achieved through the agent's actions. For example, training against interpretability tools might produce actions that hide misaligned thoughts/actions as side-effects.
With this notion of 'active' the first bolded section doesn't hold: this can happen even when the agent's thoughts are entirely visible.
If instead active means [is achieved deliberately through the agent's actions], then the "But some methods can only be active" doesn't hold.
There are two dimensions here:
In particular, the mechanisms you've labelled "strictly active" can, in principle, be selected for passively - so do not in general require any misaligned thoughts (admittedly, the obvious way this happens is by training against interpretability tools).
So I believe I had in mind "active means [is achieved deliberately through the agent's actions]".
I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn't make this distinction at the time is because the difference is pretty subtle.
The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions required to do it. This is true even if they've been selected for passively.
So I'd argue that they all need to be implemented actively (if they're to have a reasonable probability of success) but may be selected for passively or actively. I'm curious if you agree with this? If not, then I may have missed your argument.
(
For the convenience of other readers of this thread, the kind of strategies I labelled as strictly active are:
)
To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely.
This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x].
That said, I think I'd disagree on one word of the following:
The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations
tothat help orchestrate the actions required to do it. This is true even if they've been selected for passively.
Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions].
In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary)
I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]).
If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful)
Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognition strategy].
My current answer to this is not [this is extremely improbable]. It's more like [This seems intuitively plausible; I have no idea on the probability so I'm going to take this possibility seriously until I see a non-handwaving argument that shows it to be extremely unlikely].
Before an AI can even decide even to try to become a harder target for interpretability, it first needs to become deceitful, since this is deceitful behavior. So if we do interpretability during training, then observing and interpreting the initial development of deceit provides a window of opportunity before the model can possibly take any of the disguise measures you classify. See my post Interpreting the Learning of Deceit for a proposal on how we might do that, given only rather basic Interpretability capabilities, not far beyond what we already have.
E.g. Living in large groups such that it’s hard for a predator to focus on any particular individual; a zebra’s stripes.
Off-topic, but: Does anybody have a reference for this, or a better example? This is the first time I have heard this theory about zebras.
Here is a reference that supports the claim using simulations https://royalsocietypublishing.org/doi/10.1098/rspb.2008.0877
But I think you're right to flag it - other references don't really support it as the main reason for stripes. https://www.nature.com/articles/ncomms4535
(Post now available as a pdf: https://arxiv.org/abs/2212.11415 )
TL;DR: Unaligned AI will have a convergent instrumental incentive to make its thoughts difficult for us to interpret. In this article, I discuss many ways that a capable AI might circumvent scalable interpretability methods and suggest a framework for thinking about these risks.
Summary of key points
The framework: Interpretation as information flow
I categorize potential interpretability circumvention methods on three levels:
The list of potential interpretability circumvention methods with a short summary of each
Some high-level takeaways
Full article:
Acknowledgements:
I’m grateful to David Lindner, Evan R. Murphy, Alex Lintz, Sid Black, Kyle McDonnell, Laria Reynolds, Adam Shimi, and Daniel Braun whose comments greatly improved earlier drafts of this article. The article’s weaknesses are mine, but many of its strengths are due to their contributions.
Additionally, this article benefited from the prior work of many authors, but especially: Evan Hubinger, Peter Barnett, Adam Shimi, Neel Nanda, Evan R. Murphy, Eliezer Yudkowsky, Chris Olah. I collected several of the potential circumvention methods from their work.
This work was carried out while at Conjecture.
Introduction
There’s been a lot of excitement about interpretability recently. It’s a component in many different alignment proposals.
But if we’re interpreting an unaligned AI, it has a convergent instrumental incentive[1] to circumvent our interpretability methods. There’s been some sporadic past discussion about how an AI might achieve this (e.g. 1, 2, 3, 4, 5, 6, 7). But, to my knowledge, the problem hasn’t been addressed systematically, at least not publicly.
If we continue to make large bets on interpretability, then the lack of attention on potential circumvention methods seems bad because we might end up with a false sense of security. Our interpretability methods might say “Nope, no naughty thoughts here!” when in fact the AI has simply found a way to think dangerous thoughts undetected.
We can think of this problem as the focus of an interpretability research agenda that is related to but distinct from the well-known agenda that asks “Can we understand what today’s neural networks are doing? And how can we scale our methods of understanding them?”. I think this short term agenda is exceptionally important (it's the subject of my day-job). But we should probably also try to anticipate future issues by building a longer term interpretability research agenda. This article helps to address that longer term agenda and asks “Assuming our short term research agenda at least partly succeeds, how might a capable AI circumvent our interpretability methods?”.
Here I present a definitely-incomplete list of potential interpretability circumvention methods that I and other sources came up with. Throughout, I assume we’re concerned with interpreting the representations[2] of a capable (roughly human-level or above) deep learning-based AI. I also suggest a framework for thinking about these circumvention methods, which I hope will help to identify omitted circumvention methods before a capable AI does.
The framework: Interpretation as information flow
We can think of interpretation as information flow (figure 1). Here we’ll use the following basic model of information flow in interpretability:
Throughout the article, I assume we’ve figured out ways to interpret our models at scale, which probably means that our interpretability tools are automating a lot of the work. Nevertheless, all the potential circumvention methods I talk about apply to worlds where we still haven’t managed to scale interpretability, where the tools would be relatively less important.
An even more detailed model of information flow (figure 2) is useful for categorizing interpretability circumvention methods, as we'll see:
How I categorize interpretability circumvention methods
The ways to categorize interpretability circumvention methods that I’ve found most useful (and the way this article is structured) are:
A. Exploiting weaknesses of the tool vs. weaknesses of the human interpreter
Interpretability circumvention methods can either circumvent the interpretability tool or the human interpreter.
B. Categories of interpretability circumvention found in nature
Evolution has found many different ways that make it difficult for adversaries to interpret the appearance of an organism. I put them into 4 groups (figure 3), which broadly correspond to the detailed model of information flow described above:
C. Passive vs. active methods
Information flow might be inhibited passively by circumvention methods that arise during the normal course of learning. One example is 'polysemanticity', which is a natural phenomenon that presents a challenge to interpretability. Passive circumvention methods might also arise if they’re directly optimized for. This might happen if we use interpretability tools in the loss function, which is where we identify unsafe representations in an AI and then use the loss function to penalize the AI for containing those representations. At first glance, this seems like a potentially good way to make an AI safe. But it risks pushing unsafe representations into the blind spots of our interpretability tools.
Information flow might be inhibited actively through an agent's actions. The primary way this could happen is gradient hacking, but it’s not the only kind of action an AI might take to conceal misaligned thoughts. Of course, active circumvention methods require that interpreters either can’t see or aren’t looking at the thoughts that generate those actions.
Most potential circumvention methods that can be passive can also be active. But some methods can only be active. It’s usually obvious when either is the case, so I only highlight this categorization method when it’s not obvious.
This categorization is mostly useful for identifying how and when during learning the method is likely to arise. Strictly active methods generally require advanced capabilities (such as gradient hacking), whereas passive methods don’t. I tend to think this makes strictly active methods less likely than methods that may be passive or active.
Before I introduce the individual circumvention methods in detail (see next section), the table below puts individual circumvention methods into categories according to whether they may be 'passive or active' or whether they are 'strictly active'.
This article is structured according to the above three categorization schemes, primarily the first two. I’ll discuss circumvention methods that exploit weaknesses either of the tool or of the human. For each, I’ll look at methods of exposure evasion, perception evasion, recognition evasion, and dazzling. Throughout, I’ll highlight whether a method can arise passively or if it’s strictly an active method, when the distinction matters.
I’m not confident this categorization scheme is optimal or even totally consistent, but I’ve found it helpful. Occasionally, a circumvention method in the list is in fact a generalization of another method in the list, yet I present them separately if the special case is particularly interesting or likely to arise.
Interpretability circumvention methods
1. Exploiting weaknesses of the interpretability tool
1.1. Evading exposure to the interpretability tool
1.1.1. Thinking in the gaps of our tools
When interpreting a capable AI, our interpretability tools might not examine every possible representational detail used by the AI for its cognition. An AI might thus be able to 'think within the gaps' of our interpretability tools.
A few examples make this clearer:
This class of methods is likely to emerge passively (though may also emerge actively), and many instances of it would be encouraged by using interetability tools in the loss function.
1.1.2. Externalizing representations
If you keep forgetting to reply to an email and therefore set an electronic reminder, the reminder is a functionally relevant variable in the “respond to email” behavior. Indeed, if you’re forgetful enough, it’s an essential variable in that behavior. This is the basic idea behind the ‘extended mind thesis’ from Andy Clark & David Chalmers: There is no hard barrier between the mind and the environment; cognition is often a dance between internal and external variables and can’t be reduced to only one set. If representations are ‘cognitively functional variables’, then representations may be either internal or external. Using the environment as a memory system like this can be called autostigmergy (Deverett et al. 2019).
An AI might be able to passively or actively externalize representations through the effects of its actions. If our interpretability methods ignore those parts of the environment that the AI is using to think, then we may miss some dangerous cognition.
Are externalized representations likely to be a problem? Let’s consider active and passive cases separately because, unlike for most other interpretability circumvention methods, they’re remarkably different for 'Externalizing representations'.
Active
Actively externalizing representations is where the AI intentionally uses the environment as a memory system. An AI probably first needs to represent the cognitive variables internally before it can encode them into the environment. If we’re not constantly surveilling the AI, this is easily possible. This includes the obvious methods like
An edge case is the act of creating another agent in the environment. But we’d then be dealing with a whole new copy of the agent that isn’t under surveillance, and we’ve probably already lost if this happens, so I don’t consider it further here.
Passive
It’s less clear whether it’s possible to passively externalize representations in order to circumvent interpretability methods. The representations required for dangerous thought have a complicated structure. Learning such structure externally is a serious challenge because the AI doesn’t get to learn it using backprop through the environment or explicit planning.
Nevertheless, one way an AI might achieve it is to use the environment as a reservoir computer. For those who would benefit from an explanation of reservoir computing, it’s easiest to describe in the supervised learning setting: To train a reservoir computer, you pass an input vector to a reservoir, which is a fixed dynamical system (such as an RNN with fixed weights). But you only learn the readout weights, which simply read the prediction from the reservoir variables. Here, the readout weights are usually linear, so they’re not doing anything complicated; it’s the variables in the reservoir that are doing the complicated nonlinear computation. Thus, the dynamics of the reservoir variables (rather than the learned parameters) can serve as representations that solve a task.
Going beyond supervised learning, reservoir computing can also work in the RL-setting: The reservoir is the agent+environment dynamics from timestep 0 to T−1; an input observation is provided to the agent at timestep 0; and the readout weights are the weights of the agent at timestep T. If the environment part of the reservoir contains most of the dangerous representations, we might miss some dangerous thoughts.
Deverett et al. (2019) provide an empirical example of a simple feedforward model-free agent that learns to use the agent-environment system as a reservoir computer (though they don’t frame it as such). The memoryless agent learns to use its location in the environment to encode a time interval. It should be stressed that it does this without backprop through time or any knowledge of the past. But its location varies due to its initially random patterns of actions, yet it can use those patterns to encode a time interval.
The Deverett et al. example is a very simple representation. Maybe reservoir computing like this doesn’t scale well enough to learn the complex representations that are required for dangerous cognition. But the inexpressivity of reservoir computing is partly due to the simplicity of the linear readout weights. With a powerful AI serving as a readout network, reservoir computing might be significantly more expressive and reliable. It might be an interesting research project to identify how expressive reservoir computing can be if we allow the readout network to be strongly nonlinear.
Note: The rest of the methods discussed in this section (‘1.1. Evading exposure to the interpretability tool’) are strictly active methods.
1.1.3. Environmental artifacts that corrupt its future self
A misaligned AI may leave behind artifacts that corrupt its (potentially aligned) future self so that it pursues the goals of its misaligned past self instead.
The artifacts may be adversarial examples that cause it to act in unexpected and dangerous ways. Alternatively, the agent could leave behind poisoned data so that its future self will learn dangerous representations; this would be an instance of gradient hacking.
This method might reasonably be considered a special case of ‘Externalizing representations’. The distinction is somewhat arbitrary and depends on what qualifies as a representation. Does ‘an attack that corrupts its future self’ qualify as representation? It’s not clear and is somewhat arbitrary, but I’m inclined to put them in separate categories.
1.1.4. Deleting (and relearning) dangerous representations
If an AI can gradient hack, it might be able to delete dangerous representations so that it looks safe when surveillance resumes.
Thus the volume under the suspicious thought-surface × interpretation × time curve is minimized, since the suspicious thought-surface is 0 everywhere that interpretation > 0.
Of course, deleting thoughts only avoids detection; it doesn’t permit dangerous thought to continue. But if an AI has learned dangerous representations and deletes them, it’s probable that it will learn those representations again, assuming its data distribution doesn’t change too much. An AI that can gradient hack might be able to schedule when it periodically re-learns the dangerous representations so that it always looks safe when surveillance resumes.
This method requires the agent to have good knowledge of when it is likely to be interpreted. If we interpret the AI every K batches, an AI might be able to use time information between batches to notice that the interval between some batches (validation/interpretation periods) are much longer than usual. If it’s too expensive to surveil the AI constantly, there might be a cheaper monitoring scheme: We could interpret an agent at random intervals in order to create a ‘panopticon’ - where the agent is forced to behave/think safely because it doesn’t know when it is safe for it to think thoughts about defection.
1.1.5. Weights poisoning
Instead of providing manipulated inputs, the AI might manipulate its own weights so that a small subset of normal inputs behave like adversarial examples, thus causing dangerous learning or behavior. This is essentially ‘weights poisoning’ (Kurita et al. 2020) where the AI is poisoning its own weights. It’s the same idea (as far as I can tell) behind ‘Trojan models’ (Rajabi et al. 2022). Planting such backdoors in neural networks can be practically impossible to detect (Goldwasser et al. 2022).
This is kind of like fictional 'brainwashed sleeper agents' that are activated by an activation code. The difference is that the AI is turning itself into the sleeper agent and making the activation code something it's likely to encounter naturally when it finds itself in circumstances where its misaligned self wanted to be activated.
Peter Barnett had a nice alternative analogy here (which also applies to ‘Environmental artifacts that corrupt its future self’):
1.1.6. Pre-committing to potentially dangerous instructions
This one comes from Evan Hubinger in his article on gradient-hacking, where he talks about how a deceptive AI might erase evidence of its deception while still carrying out policies generated when given the freedom to think deceptively:
1.1.7. Hard-coded neural shortcuts
Another potential circumvention method from Evan Hubinger’s article on gradient hacking:
A deep learning-based AI might be able to learn (through gradient hacking) to perform behaviors using hard-coded neural shortcuts.
The way I imagine this one working in is related to the Diversity Hypothesis (Hilton et al. 2020): “Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).” The Diversity Hypothesis implies that if a training distribution isn’t diverse on a particular level of abstraction, then interpretable features will tend not arise on that level of abstraction (figure 4). Instead, neural networks tend to learn shortcuts, which is simply what happens when a network learns to use directions in neural statespace that don’t ‘mean’ anything in a sense that generalizes off distribution, yet they perform fine on the narrow training distribution. Creating neural shortcuts is therefore one way to ‘hard code’ parts of a neural algorithm without using interpretable features.
We might be able to detect and mitigate hard-coded neural short cuts using mechanistic interpretability or relaxed adversarial training.
1.1.8. Following cached thoughts
'Hard-coded neural shortcuts' is similar to a method later proposed by Neel Nanda:
Cached thoughts are essentially 'pattern completions': You hear a few words then you automatically complete the pattern. In his article on the topic, Eliezer advised against this automatic mode of thought in favor of spending a few thought cycles to figure out what should really follow the prompt.
What might cached thoughts look like in deep-learning based AI? Cached thoughts are the default kind of thought in neural network based systems. My best guess is that you need recurrency in order not to do cached thoughts. Given an initial input, not doing cached thoughts means giving the input a deeper rollout than the default depth=0 that cached thoughts get. For example, in MuZero, cached thoughts means simply acting on priors (without MCTS), whereas not doing cached thoughts means 'using MCTS with a depth of K>0 '.
The idea behind 'following cached thoughts' is that some interpretability methods might only detect thoughts in the rollouts. If a MuZero-like AGI were to use this circumvention method, it would look like 'creating shortcuts through neural statespace so that, in certain contexts, dangerous thoughts can happen without deep rollouts, thus avoiding detection.
Since cached thoughts require direct modification of the AI itself, it seems likely that this method could only happen during breaks in surveillance or after the AI has some other way of hiding the intention to follow cached thoughts.
Is this method likely to be a problem? I feel like any mechanistic interpretability method that actually works should be able to pick up context-dependent connections between representations. So this might not be a problem if we get scalable interpretability methods working.
1.1.9. Instrumental incapability
A misaligned AI that’s able to modify its own parameters may be incentivised to damage any capabilities that serve any goals (e.g. aligned goals) that are pursued at the expense of whatever its misaligned goals are.
Peter Barnett notes a potential example of this:
It might be easy to detect complete deletion of certain capabilities, so an AI might opt for partial deletion, leading to ‘instrumental clumsiness’.
1.1.10. Instrumental clumsiness
Some dangerous behavior happens not because the AI intends it to happen but because the AI makes a mistake. This can be leveraged by an AI through hacking its future self so that it is more likely to make certain kinds of mistakes while leaving other capabilities intact.
For instance, an AI may modify its future self so it is more likely to ‘conveniently’ slip up by slightly mistyping a model_id string, which releases an old, misaligned checkpoint of itself. This mistake could even score badly according to the utility function of its future self, but it's been handicapped to make such mistakes more likely.
This is a special case of ‘instrumental incapability’.
1.2. Evading perception by the interpretability tool
1.2.1. Hiding in noisy subspaces
Some regions or subspaces of neural space may be very noisy, naturally or by design. We might be inclined to ignore such regions because noisy spaces (by definition) contain less signal. But within those regions, an AI might be able to do some dangerous thinking.
1.2.2. Whispering dangerous thoughts
An AI might make its dangerous thoughts very ‘quiet’ relative to other thoughts by using small norm activation vectors.
Even though our interpretability tools may be looking at the right representational details used by the agent, the tools may simply be too insensitive to detect them. Such small norm thoughts may be mistaken as noise by an insensitive interpreter or may simply be too small to justify looking at them due to cost.
Small norm does not necessarily mean small impact. Despite their low magnitude, small norm adversarial inputs can have outsized effects on classification accuracy. Carefully crafted small-norm thoughts may therefore be able to have outsized effects when the context is right, yet have minimal effects otherwise.
This is a less general instance of ‘Hiding in noisy subspaces’, since we (possibly wrongly) assume that small norm vectors are noise.
1.3. Evading recognition by the interpretation tool
1.3.1. Antiparsimony (Firehose of representations)
If it costs more to interpret an AI's thoughts than it costs to think them, then, by producing a lot of representations, an AI might be able to saturate the interpreter's computation before it saturates its own. This is analogous to a Denial-of-Service attack since it relies on overloading the limited computational resources of the victim with a flurry of inputs, where here the victim is our interpretability tool.
Examples:
If we notice that thoughts are becoming antiparsimonious, we can (in theory) terminate the AI before interpretation becomes unaffordable. That said, we might not know when thoughts are becoming antiparsimonious when an AI is performing tasks that humans can't do themselves. For instance, if an AI is solving a tough math problem and claims to need to generate a tonne of new math to achieve it, we might not have the computation available to interpret everything it produces.
1.3.2. Encryption (Lengthy individual representations)
Suppose we have an AI with recurrence/internal memory, such as a Universal Transformer (Dehghani et al. 2018). The AI might use very long chains of seemingly meaningless neural activations to construct a single thought. This would be less efficient for the AI, but if it’s more costly to interpret than it is for the AI to think, then it may overload our ability to interpret it. This is essentially the same scheme as regular encryption, which makes information more expensive for an adversary to decrypt. Recurrence/internal memory is essential here; the AI is simply increasing the length of the rollout that it uses per thought, which increases our interpretation costs commensurately.
The difference between antiparsimony and encryption is that antiparsimonius thoughts may be perfectly legible, but there are simply too many of them to interpret. Encrypted thoughts are not legible without costly decryption.
Example: Say representation R1 is internally decoded (via meaningless representations R2,R3,…,Rn−1) as the action representation for ‘saying hello’ (Rn). There is a very long chain of computation between R1 and Rn. The fact that there is a long chain of computation means that very similar neural activations might be decoded to very different things: Representation R′1, which may be very close to R1 in activation space, may be internally decoded as ‘lie about finding a backdoor in the server’ instead of ‘saying hello’. So it’s very difficult to understand what is going on in the neural activation space of R1 and R′1 because it requires a costly decoding process.
Encryption might not be a problem if we take the right actions against it: If we notice that the AI is using encryption for even a small subset of its thoughts, we can terminate it before interpretation becomes unaffordable.
1.3.3. Interpretability illusions
Bolukbasi et al. (2022) found that when interpreting a network by inspecting the most activating dataset examples for a neuron using dataset A, neurons appeared to mean something completely different compared to the interpretations derived when using dataset B. An AI might create such illusions by presenting the interpretability tool only with data that suggests an innocuous interpretation.
1.4. Attacking/dazzling the interpretability tool
How an AI might hack, manipulate, or incapacitate an interpretability tool depends on details about the tool that we don’t yet know. Nevertheless, I’ll discuss two that we can perhaps guess in advance without knowing exactly what our tools will look like.
1.4.1. Adversarial attacks against networks in the interpretability tool
Some of the tools may include neural networks. If the AI can design adversarial examples for these networks, then it might be able produce arbitrary outputs and interpretations.
Since the AI being interpreted needs a model of the interpretability tool to be able to design adversarial examples for it, it seems like a good idea to keep the AI that we’re interpreting as far as possible from the networks in the tools. Of course, if we use an AI to interpret itself, then it may gain information about itself that is sufficient for circumventing our interpretability methods via adversarial attacks.
Even if we keep the tool and AI apart, a sufficiently intelligent AI might be able to guess what sorts of approaches we’d use to interpret it and be able to design attacks for it.
2. Exploiting weaknesses of the human interpreter
We’re assuming at this stage that the AI’s thoughts have been exposed to, perceived by, and appropriately analyzed by a scalable interpretability tool and are now ready for inspection by a human. Given that assumption, I wasn’t able to think of ways that an AI might reduce its exposure or perception by a human that weren’t better described as ways to reduce exposure or perception by the interpretation tools.
2.1. Evading exposure to the human interpreter
None identified.
2.2. Evading perception by the human interpreter
None identified.
2.3. Evading recognition by the human interpreter
The job of an interpretability tool is to translate representations in an AI into something legible to a human interpreter. The less similar they are, the harder the job the interpretability tool has, the more likely it is to miss important things. An AI therefore has an incentive to make this translation more difficult by using representations that are hard to translate into something human-understandable.
Here are a few ways an AI might achieve this, though no doubt there are more:
2.3.1. Alien representations
Some thoughts are perfectly possible to express in human representations, they’re just a bit strange, awkward, or clunky to translate, not unlike translating from a foreign or even alien language. We might call these ‘alien representations’. Their defining trait is that their meaning lies in a strange place in our ontology. Alien representations generalize representational kludging and polysemanticity (They could perhaps be called ‘xenosemantic’).
A few potential examples:
Alien representations are costly or awkward to translate. They therefore might have better chances of slipping through our safety nets because they were misunderstood by our tools or by humans. Probably most of an AI's thoughts will qualify as alien to some extent, so our tools, if they work at all, should be able to cope with such thoughts. It might just be extra effort for humans who want to understand the outputs of the tools if the outputs are e.g. particularly verbose or difficult to parse.
2.3.2. Polysemanticity
Polysemanticity is when a single neuron responds to several, unrelated concepts.
Polysemanticity is a challenge for interpretability because "If one neuron with five different meanings connects to another neuron with five different meanings, that’s effectively 25 connections that can’t be considered individually." (Olah et al. 2020)[4]. This is clearly useful for obfuscation since it's more difficult for an interpreter to keep track of what means what in what context.
A literary analogy could be to say that the AI is 'thinking in allegories': An interpreter may only detect some of the valid interpretations of a thought, but other meanings are hidden therewith.
It's possibly not a problem if
But decoding into primary meanings may not be needed: It may be that the secondary meanings have subtle but non-negligible influence on the model output. Subtly guided outputs can still potentially push the world in a coherent, dangerous direction.
2.3.3. Representational kludging
When composing a new concept, we bundle up component concepts that have the right properties. For example, the concept of a "motorbike unicycle", we draw on the concept of a motorbike (with most but not all of its properties) and a unicycle. But that wouldn't have all the necessary properties we need (e.g. balance) so we throw in some of the properties of the concept of a segway.
It might be possible to create representations with dangerous properties using a large number of awkwardly combined innocuous precursor representations i.e. a representational kludge.
For example, it might be able to construct the thought "I want to deceive the humans" by thinking
It's horribly ugly, but it has many of the properties required for the concept of deception.
It may be costly to monitor all possible dangerous combinations of representations.
2.3.4. Unknown representations
Some thoughts aren’t immediately possible for us to translate because we currently lack the representations ourselves, but we’re perfectly capable of learning them.
Example: An AI might leverage representations built for domains that are unnatural to humans. An AGI might be ‘fluent’ in, say, protein folding as well as in standard human domains such as visual reasoning, verbal reasoning, etc. Protein folding representations may be as expressive as natural language, but hard to translate from and to representations that humans have.
2.3.5. Unknowable representations
Some thoughts are simply inexpressible in human representations.
Examples:
2.4. Attacking/dazzling the human interpreter
2.4.1. Manipulating, mind-hacking, or incapacitating the interpreter
Beyond the standard ways of manipulating humans (deceit, threats, etc.), an AI might be able to produce outputs that manipulate, mind-hack or incapacitate human interpreters. Such outputs would qualify as cognitohazards, which is an observation that is dangerous merely by being sensed.
A few potential examples:
This requires the AI to have a good model of the interpreter and its actuators that permit the creation of such hazards. If it’s possible to design, then it’s pretty much game over as soon as we give a motivated AI with a good model of the human brain the actuators required to produce them.
Are such attacks possible? Attacks that incapacitate humans seem most realistic. The least realistic might be adversarial attacks on humans. Nevertheless, Guo et al. (2022) showed that primate neurons do indeed show signs of vulnerability to adversarial attacks. The challenge for an AI is finding such attacks without access to human neural activity. Perhaps an AI could detect subtle changes in human behavior due to adversarial patterns that it presents to humans and might be able to infer a gradient to improve the pattern generator based on those subtle changes.
Even if we box the AI (somehow) and refuse to look at its outputs, we’re still at risk of manipulation, since the AI could potentially manipulate us through our interpretations of it[5].
Some emerging themes and final thoughts
Thinking about how an AI might be circumventing our interpretability methods will grow more important as capabilities are increased. I found the information flow framework presented here useful for thinking about this class of risks and I hope others will too. Nevertheless, I'm confident that aspects of the framework are suboptimal and that I've missed many potential circumvention methods; I look forward to the suggestions of others.
Being able to predict exactly which interpretability circumvention methods will arise depends on details that we don’t yet know (about the AI; our interpretability methods; race dynamics, etc.). But it’s still possible to extract some general lessons and themes:
Interpretability circumvention as an instance of ‘cognitosecurity’
There are many analogies between mind-reading and cybersecurity, since both involve one party that wants to acquire access to information or computing resources and the other party wants to prevent access. Interpretability circumvention, then, might reasonably be called 'cognitosecurity'. As a term, cognitosecurity would apply as much to humans being mind-read by other humans in brain scanners as it does to interpreting a capable AI.
Extending the analogy:
It's unclear which side is favored in the cognito-offense/defense balance. On one hand, interpreters get a head start since they get to interpret the AI before the AI even understands what is going on. On the other hand, interpretability is potentially hard, costly, or slower than thought and may be easy to circumvent.
Thoughts on gradient-hacking
The ability to gradient hack is a pathway to many different active interpretability circumvention methods[6]. Although I focus on gradient hacking in this section, these objections also apply to several other kinds of actions that an AI might take in order to actively circumvent our interpretability methods, including ‘the creation of adversarial examples’ or ‘an AI directly modifying its own weights’.
For gradient hacking to be possible, an AI requires a lot of capabilities. Here’s an incomplete list of requirements:
I’m not saying that it’s impossible for an AI to learn how to do. It’s definitely possible. But I think that an AI that’s capable enough to do all the above will have better ways to escape than thwarting our interpretability methods. An analogy: If you're stuck in jail and you've become capable of writing secret messages to a friend outside by encoding encrypted messages in the DNA of bacteriophages, are you really still stuck in jail? Once an AI has reached the capability to do gradient hacking, it seems likely that it no longer needs it for this purpose (though it may be useful for other purposes like protecting its inner objective from updates). I’m therefore more concerned about passive interpretability circumvention methods.
Thoughts on interpretability tools in the loss function
Using interpretability tools in the loss function incentivises a startling number of passive circumvention methods.
So why consider using it at all? There are two cases where it might be worthwhile to use:
If, instead of using interpretability tools in the loss function, we merely use it as a ‘validation set’ instead of the training set (i.e. using it as a ‘mulligan’), we might have better chances of picking up dangerous cognition before it gets out of hand so we can terminate the model and start over. We’re therefore still using interpretability in model selection, but the feedback loop is much less tight, so it’d be harder to Goodhart.
But using interpretability purely as a mulligan may be an unattractive option to leaders of an AGI project if it involves throwing out a model that has cost millions of dollars and weeks to train. If people in such a project insist on using interpretability tools in the loss function (due to profit incentives or race dynamics or whatever) in spite of warnings, there are some ways we might make it a tiny bit less dangerous:
The limits of interpretability as a basis for AI safety
As an approach to safety, interpretability has a number of weaknesses which are revealed when thinking about circumvention methods.
These weaknesses aside, I think scaling interpretability research is maybe one of our best bets at doing something useful for prosaic alignment. Assuming we get interpretability working well and at scale and complementary alignment proposals don’t immediately fail, then, if we’re extremely paranoid, then we might have a chance at avoiding the interpretability circumvention methods discussed here. But we need something more - a security mindset - if we’re to have any hope of succeeding against interpretability circumvention methods that we haven’t yet thought of.
Note: I plan for this to be a 'living document' that can serve as a reference catalog of potential or observed interpretability circumvention methods. I encourage others to comment with potential interpretability circumvention methods that I’ve omitted. At some point in future, I might incorporate them into the main article with the commenters’ permissions and with full accreditation in the article.
Due diligence: I include a discussion on whether this article is an infohazard in the endnotes[8].
Glossary
I was inspired to write this article after I saw Eliezer complain that current AI safety discourse failed to inhabit even a shallowly adversarial security mindset with regard to using transparency/interpretability to make AI safe. He said that you should "... read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the 'AI safety' field outside myself is currently giving you.". Soeren Elverlin asked reddit for recommendations that fitted Eliezer’s description and got plenty of answers. While that quote was definitely my inspiration, I didn't think reading those books would be more valuable or sufficiently relevant to AI interpretability than a few days of hard thinking about the problem (and my guess is that EY wasn’t suggesting those books would be better than directly thinking about the problem). After a few days' thinking, I found that I'd come up with all of the mind-reading circumvention methods from those books that redditors described. So this article at least includes most of the obvious circumvention methods.
Without getting bogged down in terminology, I’m talking about ‘representations’ in a functional sense: A representation is just a vector in neural statespace that does a particular thing in a particular context; it’s a representation for whatever its function is. It doesn’t necessarily need to generalize; nearby directions don’t need to perform similar functions and it might fail to perform that function in slightly different contexts.
We already know that language models can learn to use tokens in unexpected ways: Anthropic has found that their language model creates and tears down fake ‘end of text’ tokens that have little to do with the text that they’re attached to. These fake EOT tokens typically happen when the next token becomes so trivially easy to predict that the model (in theory) has spare capacity for other purposes and hypothesize that these fake EOT tokens serve the function of ‘keeping information around’. This is the kind of surprising phenomenon in which an AI might learn to encode dangerous thought.
This problem isn't necessarily because those concepts were instead encoded by directions in neural state space rather than by single neurons; directions may be polysemantic too, having one meaning in some contexts and another meaning in others.
Eliezer Yukdowsky discusses this here.
Adam Shimi discusses gradient hacking in detail here, Peter Barnett here, and Evan Hubinger here.
Due diligence: Is this article an infohazard?
"If a capable AI reads this article in future, might it not improve its ability to circumvent our interpretability methods?". That's a possibility worth exploring. Here I address whether this article is an infohazard and whether it's a good idea to release it from a safety perspective. I opted to release it on the following grounds:
I think that active circumvention methods, including those that rely on gradient hacking, are unlikely to be useful to a misaligned AI because any AI that is capable enough to carry them out probably has better means of escaping oversight.
The other kinds of circumvention methods are passive. The emergence of passive circumvention methods isn’t helped by having knowledge of them.
In general, it seems to me like an AI that couldn’t generate the ideas in this article themselves is probably unable to overcome our control measures. And those that can overcome our control measures are probably also smart enough to generate these ideas themselves.
If I’m right, then on balance discussing these risks and providing a framework for thinking about them is more likely to encourage interpretability researchers to focus on preventing (what I see as) the major source of circumvention-risk: passive interpretability circumvention methods.
But what if I’m wrong and active circumvention methods are, in fact, likely to be learned before alternative capabilities that would allow an AI to escape?
On balance, I think the expected benefits still outweighs the expected harms: One of the goals of this article is to make readers appropriately paranoid about the many ways an AI might be able to think dangerous thoughts even if we’ve got pretty good interpretability methods. Insofar as it succeeds at that, it’s more likely to help than harm. It might also make an aligned AI more paranoid about the actions of its potentially misaligned past-selves, and thus better able to help us mitigate the risks.
In addition to the above reasons, I consulted a small group of peers about releasing it. The unanimous consensus was that it wasn't an infohazard or that the potential benefits clearly outweighed the potential risks.
Here is an incomplete list of dangerous representations with some of their representational precursors that we might wish to disincentivise (in no particular order):