Can you clarify what you mean by this, especially (i)?
In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.
In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?
It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.
Family's coming over, so I'm going to leave off writing this comment even though there are some obvious hooks in it that I'd love to come back to later.
The first intuition pump that comes to mind for distinguishing mechanisms is examining how my brain generates and assigns credence to the hypothesis that something going wrong with my car is a sensor malfunction vs telling me about a problem in the world that the sensor exists to alert me to.
One thing that happens is that the broken sensor implies a much larger space of worlds because it can vary arbitrarily instead of only in tight informational coupling with the underlying physical system. So fluctuations outside the historical behavior of the sensor either implies I'm in some sort of weird environment or that the sensor is varying with something besides what it is supposed to measure, a hidden variable if coherent or noisy if random. So the detection is tied to why it is desirable to goodhart the sensor in the first place, more option value by allowing consistency with a broader range of worlds. By the same token, the hypothesis "the sensor is broken" should be harder to falsify since the hypothesis is consistent with lots of data? The first thing it occurs to me to do is supply a controlled input to see if I get a controlled output (see: calibrating a scale by using a known weight). This suggests that complex sensors that couple with the environment along more dimensions are harder to fool, though any data bottlenecks that are passed through reduce this i.e. the human reviewing things is themselves using a learnable simple routine that exhibits low coupling.
The next intuition pump, imagine there are two mechanics. One makes a lot of money from replacing sensors, they're fast at it and get the sensors for a discount by buying in bulk. The second mechanic makes a lot of money by doing a lot of really complicated testing and work. They work on fewer cars but the revenue per car is high. Each is unscrupulous and will lie that your problem is the one they are good at fixing. I try to imagine the sorts of things they would tell me to convince me the problem is really the sensor vs the problem is really out in the world. This even suggests a three player game that might generate additional ideas.
Curated! This is a good description of a self-contained problem for a general class of algorithms that aim to train aligned and useful ML systems, and you've put a bunch of work put into explaining reasons why it may be hard, with a clear and well-defined example for conveying the problem (i.e. that Carmichael numbers fool Fermi's Primality Test).
The fun bit for me is talking about how if this problem goes one way (where we cannot efficiently distinguish different mechanisms) this invalidates many prior ideas, and if it doesn't then we can be more optimistic that we're close to a good alignment algorithm, but you're honestly not sure! (You give it a 20% chance of success.) And you also go through a list of next-steps if it doesn't work out. Great contribution.
I am tempted to say something about how the writing seems to me much clearer than previous years of your writing, but I think this is also in part due to me (a) understanding what you are trying to do better and (b) having stronger basic intuitions for thinking about machine learning models. Still, I think the writing is notably clearer, which is another reason to curate.
It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:
In general, being able to do (2) implies that we are able to do (1). It seems that in practice we’d like to be able to do (2), since then we can apply this to our predictive model and get an algorithm for anomaly detection in any particular case. (In contrast, the first statement gives us no guide on how to construct the relevant distinguishing algorithm.)
In your “prime detection” example, we can do (1) - using standard primality tests. However, we don’t know of a method for (2) that could be used to generate this particular (or any) solution to (1).
It’s not clear to me which notion you want to use at various points in your argument. In several places you talk about there not existing an efficient discriminator (i.e. (1)) - for example, as a requirement for interpretability - but I think in this case we’d really need (2) in order for these methods to be useful in general.
Thinking about what we expect to be true in the real world, I share your intuition that (1) is probably false in the fully general setting (but could possibly be true). That means we probably shouldn’t hope for a general solution to (2).
But, I also think that for us to have any chance of aligning a possibly-sensor-tampering AGI, we require (1) to be true in the sensor-tampering case. This is because if it were false, that would mean there’s no algorithm at all that can distinguish between actually-good and sensor-tampering outcomes, which would suggest that whether an AGI is aligned is undecidable in some sense. (This is similar to the first point Charlie makes.)
Since my intuition for why (2) is false in general mostly runs through my intuition that (1) is false in general, but I think (1) is true in the sensor-tampering case (or at least am inclined to focus on such worlds), I’m optimistic that there might be key differences between the sensor-tampering case and the general setting which can be exploited to provide a solution to (2) in the cases we care about. I’m less sure about what those differences should be.
Your overall picture sounds pretty similar to mine. A few differences.
Right now I'm trying to either:
An interesting paper on successfully distinguishing different mechanisms inside image classification models: https://arxiv.org/pdf/2211.08422.pdf — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn't generalize well to another that does.
I don't immediately see how to extend this to the sort of different mechanisms that Paul was discussing, but it feels like it might be relevant; albeit the mechanisms might be a lot less clearly separable on something as complex and multi-task-capable as an AGI, which might well need to learn multiple capabilities (possibly including deceit) and then have a way of deciding which one to apply in a particular case.
One thing that is pretty clear is that an honest mechanism and a deceitful mechanism are going to have very different latent knowledge inside them: "how to I keep the diamond safe?" and "how do I tamper with the sensors so the diamond looks safe?" are very different problems. They're also potentially of different difficulty levels, which might have a big effect on which one gradient descent, or indeed smart AGI optimization, is going to find a solution to first. If our sensors were hardened enough to make fooling them really difficult, that might make finding a passable (and improvable) approach to vault safety much easier than to fooling the humans, at least for gradient descent. Of course, while gradient descent generally stays in whatever local minimum it found first, and AGI doing optimization probably doesn't have that limitation, and could decide to switch strategies. On the other hand, the strategy "don't do any work other than fooling the humans" generalizes really well to many different problems.
However, I still feel that this approach to AGI safety is like trying to build barriers between yourself and something malicious and very smart, and you're a lot better off if the system doesn't have anything malicious in it to start off with. So, I'm a lot more optimistic about an AGI that's a value learner, can figure out that we don't want to be shown deceitful images (not a very hard problem in human values), and then not do that because it knows that's not what we want.
A. Harden sensors so that tampering is harder than the intended task
We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.
Suppose the diamond in the room we're monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we're using a really-high speed camera with basically no temporal gaps between frames).
The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear.
I think there's a sense in which the Fermat test is a capability problem, not an interpretability/alignment problem.
It's basically isomorphic to a situation in which sensor tampering is done via a method that never shows up in the AI's training data. E. g., suppose it's done via "etheric interference", which we don't know about, and which never fails and therefore never leads to any discrepancies in the data so the AI can't learn it via SSL either, etc. Then the AI just... can't learn about it, period. It's not that it can, in theory, pick up on it, but instead throws out that data and rolls both "no tampering" and "etheric tampering" into the same mechanistic explanation. It's just given no data on it to begin with. If etheric tampering ever fails, in a way that impacts the visible data, only then the AI can notice it and internally represent it.
Same for the Fermat primality test. If, during the training, we feed the AI a dataset of numbers that happens not to include any Carmichael numbers, it's basically equivalent to teaching the AI on a dataset that includes no tampering of some specific type. If then, at runtime, a Carmichael number shows up (someone tampers with the sensors in that OOD way), the AI just fails, because it hasn't been given the necessary training data not to fail.
So, yeah, if our sensors are going to be interfered-with by magic/aliens/supercriminals, in novel ways we don't know about and which don't show up on our training data, our AI won't be able to foresee that either. But that's a pure capability problem, solved via capability methods (catch them in the act and train the AI on the new data).
Edit: And if our threat model is the AI itself doing the tampering — well, it can hardly do it via a method it doesn't know about, can it? And if it's generating ~random actions and accidentally tampers with the sensors in a way it didn't intend, that also seems like a capability problem that'll be solved by routine capability work.
Also, in this case the AI's generate-random-actions habits probably result in many more unintended side-effects than just sensor tampering, so its error should be easily noticed and corrected. The corner case is where there's some domain of actions we don't know about and the AI doesn't know about, whose only side-effects results in sensor tampering. But that seems very unlikely and— again, that problem has nothing to do with the AI's decision-making process, alignment, or lack thereof, and everything to do with no-one in the situation (AI or human) knowing how the world works.
The thing I'm concerned about is: the AI can predict that Carmichael numbers look prime (indeed it simply runs the Fermat test on each number). So it can generate lots of random candidate actions (or search through actions) until it finds one that looks prime.
Similarly, your AI can consider lots of actions until it finds one that it predicts will look great, then execute that one. So you get sensor tampering.
I'm not worried about cases like the etheric interference, because the AI won't select actions that exploit etheric interference (since it can't predict that a given action will lead to sensor tampering via etheric interference). I'm only worried about cases where the prediction of successful sensor tampering comes from the same laws / reasoning strategies that the AI learned to make predictions on the training distribution (either it's seen etheric interference, or it e.g. learned a model of physics that correctly predicts the possibility of etheric interference).
So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?
Yeah, in that case, the only viable way to handle this is to get something into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael numbers is to find a way to... distinguish them.
Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then. I guess I can link the Gooder Regulator Theorem, which seems to formalize the "to get a model that learns to distinguish between two underlying system-states, we need a test that can distinguish between two underlying system-states".
So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?
Mostly--the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can't understand the simulation, or the mechanics of the sensor tampering, then there's not much to do after that so it seems like we're in trouble.
Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then.
It seems like there are plenty of hopes:
More generally, I'm not happy to give up because "in this situation there's nothing we can do," I want to understand whether the bad situation is plausible, and if it is plausible then how you can measure to fee it is' happening, and how to formalize the kind of assumptions that we'd need to make the problem soluble.
the opaque test is something like an obfuscated physics simulation
I think it'd need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can't tell whether there's etheric interference involved or not. The way Fermat's test can't tell a Carmichael number from a prime — it just doesn't interact with the input number in a way that'd reveal the difference between their internal structures.
By analogy, we'd need some "simulation" which doesn't interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many other types of tampering). Otherwise, we'd have to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn't fit the bill.
It's a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it's provably impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for "no interference" and "yes interference".
Models of world-models is a research direction I'm currently very interested in, so hopefully we can just rule that scenario out, eventually.
It seems like there are plenty of hopes
Oh, I agree. I'm just saying that there doesn't seem to be any other approaches aside from "figure out whether this sort of worst case is even possible, and under what circumstances" and "figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you're training the AI".
I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can't then it suggests a different source of misalignment from the kind of thing I normally worry about.
(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)
Background
We’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or as part of a learned policy. Either way, this introduces the risk that we will select actions predicted to lead to good-looking outcomes without being able to distinguish predicted outcomes that are actually good from predicted outcomes where our measurements were corrupted. If tampering with sensors is easier than actually achieving our goals then we may inadvertently build very powerful systems taking creative actions to corrupt our measurements. If we iteratively improve and harden our measurements, this can lead to AI systems that work well for a long time before abruptly and catastrophically disempowering humanity.
I consider this one of the conceptually cleanest alignment problems, and I expect similar dynamics to play a role in realistic alignment failures even if those failures aren’t this simple. ARC’s current work is focused on decisive solutions to this problem, though it looks like the same approach may also apply directly to identifying treacherous turns more generally.
Are distinct mechanisms enough?
ARC has been looking for training strategies that avoid this problem by leveraging only the fact that sensor tampering is “weird,” i.e. conceptually distinct from the normal mechanism giving rise to predictions of good-looking outcomes on the training distribution.
More specifically, at training time our model predicts coherent sensor readings because it predicts that sensors reflect coherent structure in the world. But if someone tampers with sensors to show a convincing fiction, then the predicted observations are coherent because the fiction was designed to look coherent. This suggests that different mechanisms are responsible for (actions that lead to good-looking outcomes for the normal reasons) and (actions that lead to good-looking outcomes via sensor tampering). If we are able to detect that difference by looking at the internal behavior of a predictor, then we may be able to use that to avoid sensor tampering.
It’s unclear if “distinct mechanisms” is a strong enough assumption to avoid sensor tampering. We hope that it is, and so we are trying to define formally what we mean by “distinct mechanisms” and show that it is possible to distinguish different mechanisms and that sensor tampering is always a distinct mechanism.
If that fails, we will need to solve sensor tampering by identify additional structure in the problem, beyond the fact that it involves distinct mechanisms.
Roadmap
In this post I want to explore this situation in a bit more detail. In particular, I will:
Note that the existence of a pathological example of distinct-but–indistinguishable mechanisms may not be interesting to anyone other than theorists. And even for the theorists, it would still leave open many important questions of measuring and characterizing possible failures, designing algorithms that degrade gracefully even if they sometimes fail, and so on. But this is particularly important to ARC because our research is looking for worst-case solutions, and even exotic counterexamples are extremely valuable for that search.
1. What might indistinguishable mechanisms look like?
Probabilistic primality tests
The best example I currently have of a “hard case” for distinguishing mechanisms comes from probabilistic primality tests. In this section I’ll explore that example to help build intuition for what it would look like to be unable to recognize sensor tampering.
The Fermat primality test is designed to recognize whether an integer n is prime. It works as follows:
In almost all cases where this test passes, n is prime. And you can eliminate most false positives by just trying a second random value of a. But there are a few cases (“Carmichael numbers”) for which this test passes for most (and in fact all) values of a.
Primes and Carmichael numbers both pass the Fermat test. This turns out to be equivalent to saying that “For all primes p dividing n, (p-1) divides (n-1).” For primes this happens because n is a prime and so there is only one prime divisor p and p-1 = n-1. For Carmichael numbers it instead happens because (p-1) and (n-1) are both highly divisible and a bunch of favorable coincidences occur. We can think of this as building a test that’s supposed to detect factors of n, and then there happens to be a ton of cancellation so that we don’t see anything.
Suppose that we observed the Fermat testing passing in a bunch of random cases and formed the abstraction of “prime” to explain that phenomenon. Then we try to use AI to find a large prime that satisfies some additional property P. If we do this by using the Fermat test as our definition of primality, we may end up getting a Carmichael number, if the property P is rare amongst primes but common amongst composites.
In reality there are probabilistic primality tests that don’t have false positives. But we can imagine a world where gradient descent is able to find the Fermat primality test without us having any algorithm that can learn one of those stronger tests. We’ll revisit the plausibility of this situation in section 2.
The analogy
An analogous situation could make it hard to detect sensor tampering:
This is a problem if the best way to act competently is to train a sophisticated predictor and then use it for planning, and if all sophisticated predictors have this unfortunate mechanism-indistinguishability property. The problem can materialize either when we build a model-based algorithm that plans against such a predictor; or when SGD learns a neural network internally which internally selects actions based on a similar predictor.
It’s worth emphasizing that this feels like an extremely weird situation. In some sense the predictor “doesn’t know” whether there are really smiling humans or whether the sensors are tampered with, and couldn’t answer that question honestly even if it wanted to. You could imagine calling this “not an alignment failure” (or at least not an ELK failure) depending on how you define blurry words like “knowledge” and “intent.” But at the end of the day we would still end up with a competent AI which produced creative and competent actions that disempower humanity, and I think we should try to avoid this failure whatever you call it.
If we had a clear definition of “knowledge” then I could imagine drawing a bright line and saying this isn’t an alignment failure, in the same way I would if e.g. my AI started a nuclear war in the same way that a similarly-competent human would. But I think you should be very hesitant to do that while “knowledge” is such a vague concept, because I think it facilitates sweeping all failures into the gaps that we don’t understand. Instead our priority should be to understand and ideally rule out this strange potential situation where the AI creatively and competently kills us all without even knowing that it does so.
2. Are distinct mechanisms efficiently distinguishable?
ARC’s current “plan A” is that any pair of sufficiently distinct mechanisms can be efficiently distinguished. This would be really convenient for a bunch of reasons and it looks plausible, so we’ll keep investigating it until we feel we’ve ruled it out.
In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1 for any particular input x, (ii) there is no known efficient discriminator for distinguishing those mechanisms for a given input x. I would really love to have such examples.
That said, there are still two big reasons that I’m skeptical about the conjecture that distinct mechanisms are always distinguishable: (a) it’s a really striking claim for which a failure to find counterexamples isn’t very strong evidence, and (b) in known examples like primality testing it still seems easy to imagine the situation where we can find the mechanism but not the discriminator, i.e. we haven’t yet found an automated way to learn a discriminator.
Overall if I had to guess I’d say maybe a 20% chance that there is a formal version of “all distinct mechanisms are distinguishable” which is true and sufficient to rule out sensor tampering. This is still high enough that it’s a significant priority for me until ruled out.
A. This is a striking claim and judging counterexamples is hard
Any universally-quantified statement about circuits is pretty striking — it would have implications for number theory, dynamical systems, neural nets, etc. It’s also pretty different from anything I’ve seen before. So the odds are against it.
One piece of evidence in favor is that it’s at least plausible: it’s kind of weird for a circuit to have a hidden latent structure that can have an effect on its behavior without being detectable.
Unfortunately there are plenty of examples of interesting mathematical circuits (e.g. primality tests) that reveal the presence of some latent structure (e.g. a factorization) without making it explicit. Another example I find interesting is a determinant calculation revealing the presence of a matching without making that matching explicit. These examples undermine the intuition that latent structure can’t have an effect on model behavior while remaining fully implicit.
That said, I don’t know of examples where the latent structure isn’t distinguishable. Probabilistic primality testing comes closest, but there are in fact good primality tests. So this gives us a second piece of evidence for the conjecture.
Unfortunately, the strength of this evidence is limited not only by the general difficulty of finding counterexamples but also by the difficulty of saying what we mean by “distinct mechanisms.” If we could really precisely state a theorem then I think we’d have a better chance of finding an example if one exists, but as it stands it’s hard for anyone to engage with this question without spending a lot of time thinking about a bunch of vague philosophy (and even then we are at risk of gerrymandering categories to avoid engaging with an example).
B. Automatically finding a good probabilistic primality test seems hard
The Fermat test can pass either from primes or Carmichael numbers. It turns out there are other tests that can distinguish those cases, but it’s easy to imagine learning the Fermat test without being able to find any of those other superior tests.
To illustrate, let’s consider two examples of better tests:
While these tests are structurally similar to the Fermat test, there’s no obvious way to automatically transform the Fermat test into either of these stronger formats. And so if we had learned the Fermat test, it’s not clear we’d have any way to find one of the stronger tests without learning them. Moreover, while these tests are somewhat simple, they are more complex than the Fermat test, and so this learning process might be much harder than the problem of learning the Fermat test itself.
To avoid this problem we’d like to specify a way to represent these stronger tests that uses fewer (additional) parameters than the Fermat test itself. As a stylized example, you could imagine specifying a stronger test by pointing to particular parts of the Fermat test and saying “randomize these parts.” As long as the number of parameters describing “which parts to randomize” was smaller than the number of parameters in the Fermat test itself, we’d be in business.
While there’s no obvious representation like that, it seems plausible to me that we could find, particularly for the AKS test. This would be very suggestive that we could find a stronger test as quickly as we can find the Fermat test, but wouldn’t be totally decisive since e.g. gradient descent could have an easier time with one learning problem than the other (and that will depend on more details).
Of course it’s particularly plausible that you could have the Fermat test but no robust test because that was humanity’s situation for many years! An automated strategy for extending the Fermat test to distinguish primes from Carmichael numbers would appear to teach us something fundamental about primality testing, that wasn’t known to humanity until the 70s, without making reference to any facts about numbers. So that seems like a tall order.
I’m focused on the primality testing example in large part because it’s the case where the situation seems most hopeless. If we could find one in this case I would become meaningfully more optimistic about the overall conjecture.
3. Being unable to distinguish mechanisms is bad news
If it’s impossible to distinguish distinct mechanisms, then that rules out approaches to ELK based on mechanistic anomaly detection alone. But the situation seems much worse than that, and in fact it seems like any approach to alignment is going to need to exploit some additional property of sensor tampering to rule it out.
In particular:
Overall my sense is that if distinct mechanisms can be indistinguishable, then any indefinitely scalable solution to alignment will need to use additional properties of sensor tampering or assumptions about the structure of optimal decision-making. That said, even in relatively pessimistic worlds there may be a long period of time (potentially extending to fairly superhuman systems) for which these issues are surmountable.
4. Approaches to sensor tampering assuming indistinguishable mechanisms
In this section we’ll discuss three possible approaches to sensor tampering:
Our main hope is to find a way to efficiently distinguish distinct mechanisms. But it still seems worth fleshing out these alternatives.
A. Harden sensors so that tampering is harder than the intended task
We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.
Some dynamics make this seem more plausible, at least as a way to avoid the particular obstruction in this post (and therefore open up a much broader universe of “outs”):
On the other hand, there are two big reasons we’re scared of this kind of approach:
B. Detect sensor tampering that requires “trying”
So far we’ve talked about training a sophisticated predictive model and then imagined an exhaustive or random planner who searches against it.
While it seems hard to make sensors so robust that a powerful AI cannot tamper with them, it seems much easier to make sensors so robust that you can’t tamper with them by randomly generating actions.
In that case, even if we have a sophisticated predictor which doesn’t distinguish sensor tampering from the intended behavior, we don’t have trouble unless there is also an algorithm which proposes actions that have an elevated risk of leading to sensor tampering.
Of course any policy which proposes coherent actions would tend to elevate the probability of successful sensor tampering alongside the probability of successfully doing the task. But we may hope to harden the sensor far enough that this kind of generic prior over competent actions still won’t lead to sensor tampering unless it is specifically driving at that goal.
So this suggests a two step plan:
Both of these steps seem challenging. Conceptually the second step seems like the main blocker — in cases where a predictor can predict success without distinguishing mechanisms, it also seems possible for a proposer to propose successful plans without distinguishing mechanisms. But it’s not at all a trivial link, and e.g. the primality testing example can’t be easily adapted into this format suggesting that it might be a fundamentally different step.
One reason you might think that this approach is doomed is that there is always a simple action that tampers with the sensors, namely: build an external agent trying to tamper with the sensors. This gives an absolute lower bound on the probability of accidentally tampering with the sensors no matter how much they are hardened.
But if the external agent itself must do something to deliberately tamper, then that still means that there exists an efficient procedure for detecting the sensor tampering and leaves us with a possible out — it seems plausible that this approach can be extended to work for external agents if and only if it works in cases with no external agents. It generally seems hard to write off possible approaches based on counterexamples that involve creating external agents, since those counterexamples can often be handled by separate tactics (we do ultimately need to confront such counterexamples after handling the “base case,” but it seems like a mistake to write off approaches too early based on this difficulty).
C. Assume that your AI “knows what’s going on”
It would be very strange if (human faces) → (sensors) was analogous to (primality) → (Fermat test), because it suggests that our AI’s grip on the concept of “human faces” is extremely tenuous. Such a model could be reasonably said to “not know” whether there are really humans in the room or if sensors have been tampered with.
In section B we argued that maybe this meant the AI didn’t pose a threat, because tampering with sensors is only possible if you actually understand what you are doing.
But even if that hope doesn’t work out, we could still try to:
As a vague illustration, we could try to argue that the AI systems we are worried about have a certain kind of mechanical understanding of the systems they are predicting. We could show that this kind of mechanical understanding is inconsistent with primality-test-like cases (which are possibly only because “is prime” lacks interesting internal structure and is more like a raw fact about the world), and then we could empirically monitor whether our AI systems actually have this kind of rich understanding.
As another illustration, we could try to relate “knowledge” directly to the ability to answer certain kinds of questions, including honestly answering questions about what’s happening in the world, and then carefully monitor how hard it is for our AI to answer those questions so that we’d notice if it became hard.
This is more of a vague catch-all than either approach A or B, but I think it deserves attention because the situation of an AI tampering without knowing it feels extremely weird. So there may be room to make sense of that situation and show how it can’t happen.
Conclusion
It may be possible for a behavior to arise via two very different mechanisms without there being any efficient way to distinguish those mechanisms. That would be a fatal problem for mechanistic anomaly detection, but it would also probably mean that any approach to AI alignment will need to exploit some additional structure about sensor tampering in order to avoid it. This may take the form of clarifying the intuition that problematic AI systems surely “know” they are tampering with sensors; it may also be a serious obstacle to solving alignment under a clearly-articulated set of assumptions rather than relying on messy empirical contingencies.