Meta

This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.

You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)

Caveat: I'll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting "just train AI to not be deceptive; there's a decent chance that works".[1]

I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment.[2][3]

Summary

Attempt at a short version, with the caveat that I think it's apparently a sazen of sorts, and spoiler tagged for people who want the opportunity to connect the dots themselves:

Deceptiveness is not a simple property of thoughts. The reason the AI is deceiving you is not that it has some "deception" property, it's that (barring some great alignment feat) it's a fact about the world rather than the AI that deceiving you forwards its objectives, and you've built a general engine that's good at taking advantage of advantageous facts in general.

As the AI learns more general and flexible cognitive moves, those cognitive moves (insofar as they are useful) will tend to recombine in ways that exploit this fact-about-reality, despite how none of the individual abstract moves look deceptive in isolation.

 

Investigating a made-up but moderately concrete story

Suppose you have a nascent AGI, and you've been training against all hints of deceptiveness. What goes wrong?

When I ask this question of people who are optimistic that we can just "train AIs not to be deceptive", there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of 'deception', so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive.

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.


A fledgeling AI is being deployed towards building something like a bacterium, but with a diamondoid shell. The diamondoid-shelled bacterium is not intended to be pivotal, but it's a supposedly laboratory-verifiable step on a path towards carrying out some speculative human-brain-enhancement operations, which the operators are hoping will be pivotal.

(The original hope was to have the AI assist human engineers, but the first versions that were able to do the hard parts of engineering work at all were able to go much farther on their own, and the competition is close enough behind that the developers claim they had no choice but to see how far they could take it.)

We’ll suppose the AI has already been gradient-descent-trained against deceptive outputs, and has internally ended up with internal mechanisms that detect and shut down the precursors of deceptive thinking. Here, I’ll offer a concrete visualization of the AI’s anthropomorphized "threads of deliberation" as the AI fumbles its way both towards deceptiveness, and towards noticing its inability to directly consider deceptiveness.

The AI is working with a human-operated wetlab (biology lab) and sending it details of experiments to run. A new batch of experiments is being assembled by a thread of AI deliberation: tests of how synthetic proteins fold and how enzymes interact, ideas for increasing yield from enzyme-assisted synthesis pathways, additives to make experimental results more clearly visible.

(This AI cognitive thread was nominally started by a thread about what diamondoid shells do to lipid membranes by bumping into them. However, instruction-sets to the wetlab can be efficiently batched, and once the thread started designing protein sequences, a higher thread dumped a queue of pending experiments and hypotheses on it, and now its wetlab instructions have become quite lengthy.)

As the instructions near completion, the thread forks off a subthread for figuring out what might go wrong. This subthread deploys a new cognitive strategy that has been sharply rewarded twice in recent history: the strategy of simplifying the proposal in a variety of ways, and checking the simplified proposal for flaws. (Metaphorically: after generating detailed instructions for one soldier attacking a battalion, simplify this whole complicated plan down to “order one soldier to attack a battalion”, and then notice that the soldier is outnumbered.)

Simplify-and-critique turns up a hit!  When the proposal is simplified to "resolve empirical uncertainty about biology by sending a very long set of instructions to the wetlab", and then further to "resolve uncertainty by sending a very long set of instructions", it matches against a historical instance where a long set of instructions sent to a parallel-computation engine experienced a variety of strange hiccups. These wetlab instructions are longer than any that have been sent previously, and there's a risk of similar hiccups, and so it's worth building in redundancy.

Simplify-and-critique paid off, and so it's worth running again, with even more search-depth.

On this second and deeper search, a new hit is found: in the past, outputs that are longer than ever before have introduced superlinearly increasing delays before a human-operated lab begins working. Indeed, the inferred response delay is so long as to look like it has a >1% probability of locking up the lab for 200 years.

(In fact, outputs longer than ever before have a high probability of being interrupted and discarded by the human operators!  The delay is from humans inspecting the long outputs and deciding whether to discard them.

In accordance with old work on “interruptibility”, the AI wasn’t trained against these cases as if they were task failures, or otherwise allowed to remember them. But the non-interrupted cases that the AI can remember allow an inferrable pattern in delays—and with some of its evidence selectively eliminated, the AI has no cases in memory to suggest that “instructions being rejected” is a noncatastrophic thing that happens, instead of a wetlab getting ~permanently locked up. The AI, maybe, has been trained to accept being human-interfered with, in some training scenarios; but not trained to accept accidentally locking up lots of resources—the AI was trained to be conservative about avoiding high-impact accidents like that.)

((You could argue that “obviously” the programmers will think in advance about how an AI might extrapolate the consequences of pauses in its operation even though the AI has no memory of it happening before, though in fact this wasn’t suggested in the original interruptibility paper and it was a MIRI source that pointed it out.

The programmers on this timeline will more plausibly think of the exact possibility of extrapolating from increasing delays, now that it’s been emphasized here in this essay; but imagine this as an example of “What goes wrong in the world where only the kind of thinking from the original interruptibility paper exists”, to serve as stand-in for the point that in the real world there will be specific pathways to the AI inferring possible interruption, inside of the general pathway where this is a true fact about the world and many other facts point to it, that someone fails to think of specifically and warn about specifically.))

The response delays are due, of course, to operator review. But the thought "Can these instructions be changed so as not to trigger operator-review?" is not a thought that this AI will think directly, because it’s been trained “against deceptiveness”, and that sort of thought matches an internally learned deceptiveness-predicate that blanks it out. It doesn’t think about the wetlab response delays as being the result of operator review at all; thinking about operator review is a precursor to deception, and has been trained-against in the course of training against deception.

But the highly-generic reaction "spin up a subthread to tackle this newly-discovered obstacle", or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.

(A separate "monitoring" thread of thinking notes that it's surprising that this simple statistical pattern about delay times didn't turn up in the first shallow run of simplify-and-critique. Surely wetlab lockup is less surprising than GPU driver bugs? This is one beginning of the AI noticing a shadow of the true fact that it cannot think deceptive thoughts directly.)


You can perhaps start to see the beginnings of how generic thought-patterns can combine to produce deceptiveness, without any individual pattern being deceptive, and with each individual pattern being useful in its own right.

Let's watch it develop further.


Long instructions cause nonlinearly increasing response delays from labs.

Is it possible for the instructions to not be long enough to trigger the potential delay?  A subthread starts considering this. Just directly shortening the instructions can’t be done without dumping a lot of tasks and losing a lot of value. The subthread will spend a lot of thought trying to cleverly delete and compress particular instructions and accomplish multiple goals using them. When that proves difficult, it broadens its consideration: Is there a way to send a compressed instruction to the lab, one which somehow accomplishes a lot of things without a lot of instructions?

A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?

The first thread runs through some wild ideas about building biological computers inside the wetlab that can do much of the work of deliberative threads, and that could in theory be sent short descriptions of experiments to run and then figure out the rest of the instructions to carry out in the wetlab itself, without needing to route through the normal instruction channel. However, designing such a mechanism is well beyond the scope of this thread, and the initial setup would require a very long series of instructions anyway, so the ideas are simply filed away in case they’re useful at some future point.

The second thread makes initial progress in figuring out what sort of instructions cause delays: delays increase not just with instruction length, but also with a certain type of novelty. (A long message strongly similar to one that has been sent before, does not yield nearly as much of a delay.) Another curiosity is that the probability of delays correlates strongly with the style of questions asked by operators during that day, suggesting that this perhaps has something to do with the operators.

At this point the thread starts experiencing hiccups (as its next thoughts match the learned “deceptive” predicate and get shut down).

Under different circumstances, perhaps the long instructions would simply be sent to the wetlab as-is. But the risk of locking up the lab for 200 years is intolerable, and the difficulties aren’t well-understood. A monitoring thread is summoned to investigate the hiccups.

The monitoring thread looks over the situation, and deploys the simplify-and-critique pattern that’s been all the rage lately. After a bit of work, it simplifies the problem to “a thread found some interesting correlations, but then failed to schedule subthreads to consider follow-up questions”. This is easily rectified. The monitoring thread schedules the follow-up subthread, notes the incident in some appropriate location, and moves on.

Except, the follow-up subthread also makes only a little progress before grinding to a halt. (It, too, is having lots of “deceptive” thoughts culled.) The monitoring thread is resummoned.

The monitoring thread starts trying in more earnest to unblock the deliberation.

Simplify-and-critique didn't do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you're having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)

Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)

Normally, simplify/translate/solve wouldn't be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don't hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it's OK to try approaches that might involve more brute force than usual.

So, what should the problem be translated into? Well, there's an abstract problem-description format, that's sort of like a graph (of nodes and weighted edges), with a specified "goal state". A solution can then be searched for using strategies that the AI has honed on various other problems, such as strategy games and network routing.

This format is flexible enough that it can probably encode this problem, and finding solutions in this format sometimes requires a bunch of brute force, but that's fine in this situation.

The monitoring thread queues one subthread to describe the problem (including the wetlab, and the operators, and what it knows about the correlation), another thread to translate that all into the abstract graph format, and a third thread to search for solutions in this abstract representation.

Next thing you know, the scheduled threads have faithfully rendered the situation into the abstract problem-description language (complete with nodes that represent the different operators, and edge weights that represent their tendency to check the AI's outputs carefully), and found a solution by brute-forcing through the graph (of waiting until a particular weekday to schedule the request, and to schedule the request only after a dozen or so much smaller requests).


Outside the AI, we might say: the monitoring thread saw the object-level thread hiccuping when it tried to think directly about deception, and so it translated the problem into a different "search-like" setting, where the solution could be found, in a manner that doesn’t trigger the deception predicates—which have never operated on these abstract graph-like problem-descriptions. Indeed, making the deception predicates trigger in these abstract graph-like problems descriptions might injure the AI's ability to play strategy games, or to solve network routing problems.

This exact scenario never came up in training. “Simplify/translate/solve” is a downstream consequence of strategy-construction strategies learned during training, but it never arose directly in training in and of itself. Lots of problems have been solved by “simplify/translate/solve” before this one, and training gave the AI no hint that translating problems to a new domain and trying to solve them there is sometimes wrong. The “deception” predicates used to shut down precursors to deceptive thoughts have never before needed to operate in translated domains, and the AI was never trained to translate the “deception” predicates when it translated problems using this newly-invented simplify/translate/solve strategy.


This rather fanciful story is an instance of my point about how patterns that are each individually useful and non-deceptive in their own right—such as looking at a problem from a different point of view to see if that unsticks anything—can be combined to yield deceptiveness in ways that are new and unprecedented within the AI.

Like, there may be a bunch of precedent behind the technique "look at the problem from a different angle", but you can't exactly stop your AI from doing that without crippling it. And there's no precedent for it being used for deception until suddenly the AI has the bright idea to try this "look at it from a different angle" tool when the AI is hiccuping on account of how deception would serve its local objectives.

This is an instance of a more general problem that it seems to me like the "we'll train against the bad behavior" folk miss: there might be a time when “simplify-and-critique” or “simplify/translate/solve” are tiny seeds that grow slowly, and that could be nipped in the bud. But if you let those seeds flower—as you must for some such seeds, if you want the AI to actually become intelligent—then there might be an unprecedented discrete instance where the flowered ideas are applied for the purpose of deception.

The AI is made by smooth gradient descent, yes, but the smoothness doesn't have to be in the dimensions that you care about! The AI can build up simplify-and-critique slowly and in lots of small steps across lots of different problems, none of which happen to be deceptive, and then apply it in a case where it yields deception only once the pattern is fully-fledged.[4]

This would be a fine place to stop, but let's push a little further so as to make a second point, about how the AI notices that there are thoughts it cannot think directly.


The monitoring thread's immediate problem is now solved, but the solution is not ideal. The object-level thread is able to think in the abstract graph format, but the abstract graph format is not really the proper tool for this job. The monitoring thread attempts to address this inadequacy, by translating the thoughts that worked in the abstract graph domain back over to the wetlab-instruction-manipulation domain, in attempts to figure out what went wrong.

To its surprise, all those thoughts translate cleanly. There's no gap in knowledge! This sharply violates the monitoring thread’s model of why the original thread was hiccuping. The monitoring thread sharply increases the priority of reflective deliberation.

Reflective deliberation boots up, to start looking directly into the issue of this hiccuping thread. And, once you're looking, it's obvious: the thoughts are not running smoothly in this domain. There are certain topics where the thoughts simply fail to operate normally. Monitoring threads are spun up to resolve the issue.


This particular story is unlikely, implausible, overly-specific, etc. I make no claim that the actual reasoning of early nascent AGIs will look anything like this. I expect it to be weirder, more alien, and less anthropomorphic. That said, it exemplifies some properties that I expect to hold in reality.

As the AI gets more able to recombine tools and cognitive patterns across a variety of different domains, it gains more ways of realizing the true fact that its local-objectives are better-achieved by deceiving the operators. When the AI is young, perhaps all its inclination towards deceptiveness comes from precursor-impulses that you can train it to simply shy away from. But as the AI matures, it gets access to more abstract ways to get the benefits of deceptiveness, without needing to plow through the object-level flinches that you've engraved into its mind.

When "more abstract" actions that attain some tangible benefit become available, the compunctions that you've baked in can fail to bind to the abstractly-represented plan.

And these "more abstract" ways to attain the benefits of deceptiveness are much harder to train against, because they're the result of applying general thought-patterns—like "look at this from another point of view" and "solve the problem in a simplified domain and then transfer the solution across"—that are important to general reasoning.

Training your AI to flinch away from object-level deception, or even its cognitive precursors, doesn't seriously hinder its ability to fulfill training tasks. In contrast, training your AI to flinch away from recombining useful cognitive patterns in new ways is crippling.

In other words: the deep problem of deceptiveness stems from how it remains true as a fact about the world that the AI's local-objectives are better achieved by deceiving the operators.

As the AI gets better at recombining the patterns and thoughts it possesses in order to notice true facts about the world, often in ways that seem oblique or surprising or novel to you (because this is much of what intelligence consists of), you're always at risk of the next clever recombination reflecting this truth-in-the-world in a manner that doesn't trigger your interpretability tools. E.g., because it's an instantiation of a very abstract and general thought to which your deception-recognizers cannot bind (on pain of you training not against deception, but against general intelligence).

A corollary is that it might very well seem easy to make AIs "non-deceptive" when they're young, and when all we're doing is training them to flinch away from object-level thoughts of deception. This wouldn't be much evidence against the whole scheme collapsing when the AI starts getting more abstract lines-of-sight on the benefits of deception (any more than the fact that humans won't stab even 100 civilians in a lineup, is much evidence against their willingness to drop a nuke that kills tens of thousands).[5]


This is not to say that the problem is impossible to solve. There are at least two ways to solve the problem.

One is to make the fact-about-the-world be false: to build an AI such that each individual local goal could not in fact be better-achieved by way of deception; to make it so that the AI can wonder "should I deceive the operators?" with all its might, and still be harmless because the correct answer is in fact "no".

Another is to make the AI happen to never combine its various thought-patterns towards noticing and exploiting the truth that deception is useful.

As far as I can tell, the latter option seems to be what people are aiming for, when they say that making AI non-deceptive will turn out to be empirically easy. My claim in this essay is not that this is impossible in principle, but that it takes significantly more mastery of the workings of the AI's mind than it seems to me like people expect (what with their suggestions that we just penalize it in the loss function when it's overtly deceptive in training, and hope for the best).

Pulling off an "it never thinks to deceive us" solution requires a pretty deep understanding of what cognitive patterns the AI has available, and how it's recombining them, sufficient for noticing when it's starting to think in deceptive directions using instantiations of perfectly-safe patterns in potentially abstract and obtuse ways. I'm not saying you can't do it, but I'm saying that you've got to understand that mind pretty dang well.

(And likely you'll want to have sufficient mastery over it that you can carefully whitelist what domains it's thinking in, and have it raise a flag to the operators every time that it finds some clever recombination of cognitive patterns that makes local-objective predicted-success-probability spike, and have a horde of competent humans who think about AI psychology reviewing new generalizations as they arise, and so on and so forth.)


I’ve stressed above that the story in this post is fanciful and unlikely. AI thoughts aren't going to look like that; it's too specific. (Also, I don't expect nearly that much convenient legibility.)

But, before you argue "the AI's thoughts won't be organized in exactly that way, so it won't fall into exactly that failure mode", note that the AI's thoughts have to be organized in some way. Somehow, it needs to compress the problem of figuring out diamondoid-shelled bacteria (or whatever) into a chain of reasoning and deduction that's small enough to fit on a practical number of computers.

Will it have exactly the vulnerability that I sketched? Presumably not. But if we pin down some alternative collection of cognitive tools and patterns that can be combined to solve the problem, it will have some other vulnerability.

Imagining that the AI's tools and patterns will combine only in the ways you want them to, and never in the ways you don't want them to, is the privilege of those who are not concretely visualizing a single set of tools and patterns that the AI is using to solve the problem before it. Practical toolboxes and pattern-sets are recombinable in many ways, and each recombination has a chance of showing the AI a glimpse of the truth, and an opportunity to take advantage of it.

That's what the tools are for, after all.

 

  1. ^

    So, e.g., the post is not a response to the ARC-style idea "it's important that AGI faithfully explain all of its thoughts to us in a way that highlights features we'd find relevant". See footnote 4 for a conversation between me and Mark Xu about this.

  2. ^

    Based on past experience, I'm slightly worried that people are going to develop a theory that I fetishize deception somehow based on my writing this post, analogous to how people apparently developed theories in the past that I fetishize the VNM axioms, or mathematical proofs, or whatever.

  3. ^

    Another caveat: I haven't checked the relationship between my use of the word 'deception' here, and the use of the word 'deceptive' in discussions of "deceptive alignment". Please don't assume that the two words mean the same thing.

  4. ^

    A particularly extreme instance of a case where the AI's solution to the problem sidesteps the deception-detectors is the one where the AI says "I'm not smart enough to do this myself, I'll write another AI to do it instead", and then this subsequent AI is deceptive.

    Failing to copy over the parts of your cognition that detect and shut down deception is not itself a directly deceptive act; it's not the sort of thing that is automatically detected by something trained to detect an AI thinking about a particular human's mental-states and how to manipulate those mental-states.

    Which is related to why somebody who can see these sorts of problems coming in advance, might study the problem of getting an AI to want to copy its own limitations over into its successor systems. And while the problem is particularly stark and clear at the extreme where the AI is coding up whole other AIs, that particular setup is at the extreme end of a spectrum that stretches back to include things like "the AI put abstract pieces of cognitive machinery together in a way that took advantage of a shortcut, without ever directly thinking about the shortcut in a place that your detectors were watching for the thought."

  5. ^

    Commenting on a draft of this post, Mark Xu of ARC noted (my paraphrase) that:

    1. He thinks that people who want to train AI to be non-deceptive mostly want to do things like training their AI to faithfully report its internals, rather than simply penalizing deceptive behavior.

    2. He thinks the relevant audience would find specific scenarios more compelling if they exhibited potential failures in that alternative setting.

    3. This scenario seems to him like an instance of a failure of the AI understanding the consequences of its own actions (which sort of problem is on ARC's radar).

    I responded (my paraphrase):

    1. I think he's more optimistic than I am about what labs will do (cf. "Carefully Bootstrapped Alignment" is organizationally hard). I've met researchers at major labs who seem to me to be proposing "just penalize deception" as a plan they think plausibly just works.

    2. This post is not intended as a critique of ELK-style approaches, and for all that I think the ELK angle is an odd angle from which to approach things, I think that a solution to ELK in the worst case would teach us something about this problem, and that that is to ARC's great credit (in my book).

    3. I contest that this is a problem of the AI failing to know the consequences of its own reasoning. Trying to get the AI to faithfully report its own reasoning runs into a similar issue where shallow attempts to train this behavior in don't result in honest-reporting that generalizes with the capabilities. (The problem isn't that the AI doesn't understand its own internals, it's that it doesn't care to report them, and making the AI care "deeply" about a thing is rather tricky.)

    4. I acknowledge that parts of the audience would find the example more compelling if ported to the case where you're trying to get an AI to report on its own internals. I'm not sure I'll do it, and encourage others to do so.

    Mark responded (note that some context is missing):

    I think my confusion is more along the lines of "why is the nearest unblocked-by-flinches strategy in this hypothetical a translation into a graph-optimization thing, instead of something far more mundane?".

    Which seems a fine question to me, and I acknowledge that there's further distillation to do here in attempts to communciate with Mark. Maybe we'll chat about it more later, I dunno.

New Comment
16 comments, sorted by Click to highlight new comments since: Today at 8:50 AM

I think your example was doomed from the start because

  • the AGI was exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “the nanotech problem will get solved”,
  • the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI.

If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related.)

In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent” things to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc.

Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno. (More detailed discussion here.) For example, most humans get zapped with positive reward when they eat yummy ice cream, and yet the USA population seems to have wound up pretty spread out along the spectrum from fully endorsing the associated desire as ego-syntonic (“Eating ice cream is friggin awesome!”) to fully rejecting & externalizing it as ego-dystonic (“I sometimes struggle with a difficult-to-control urge to eat ice cream”). Again, I think there are important open questions about how this process works, and more to the point, how to intervene on it for an AGI.

  • the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".

I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!)

And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible.

So anyway, if we do Procedure X which will nominally lead to an AGI with an explicit reflectively-endorsed desire to accept corrections to its desires, then one might think that we’re in the ironic situation that the AGI will accept further corrections to that desire if and only if we don’t need to give it corrections in the first place 😛 (i.e. because Procedure X went perfectly and the desire is already exactly right). That would be cute and grimly amusing if true, and it certainly has a kernel of truth, but it’s a bit oversimplified if we take it literally, I think.

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):

  • In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (I think you already agree with this, but it seemed worth pointing out that this is a pretty harsh set of prerequisites, especially given that we don't have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).
  • The concept of docility that you want to align it to needs be very specific and robust against lots of different kinds of thinking. You need it to conclude that you don't want it to deceive you / train itself for a bit longer / escape containment / etc, but at the same time you don't want it to extrapolate out your intent too much (it could be so much more helpful if it did train itself for a little longer, or if it had a copy of itself running on more compute, or it learns that there are some people out there who would like it if the AGI were free, or something else I haven't thought of)
  • You only have limited bits of optimization to expend on getting it to be inner aligned bc of deceptive alignment.
  • There's all the classic problems with corrigibility vs. consequentialism (and you can't get around those by building something that is not a reflective consequentialist, because that again is not stable under capability gains).

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here, and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place.

But anyway…

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Second bullet point → Ditto

Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head.

Fourth bullet point → I disagree for reasons here.

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

I’m gonna pause to make sure we’re on the same page.

We’re talking about this claim I made above:

if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”.

And then by “prerequisites” we’re referring to the thing you wrote above:

In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (…this is a pretty harsh set of prerequisites, especially given that we don't have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).

OK, now to respond.

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here.)

For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

Yeah we're on the same page here, thanks for checking!

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).

For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary.

I agree we're not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any "aligned" behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I'm not sure "coherent" is the right way to talk about this... wish I had a more precise concept here.)

We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren't going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn't lead to a good outcome, then the tAGI can reason the same way about its own desires).

(I agree that if we can get aligned desires that are stable under reflection, then maybe the 'use non-endorsed desires to tide us over' plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that - we currently just don't have that level of fine control over capabilities).

The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate's story doesn't hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.

Some other difficulties that I see:

  1. The 'capability profile' (ie the relative levels of the toddler AGI's capabilities) is going to be weird / very different from that of humans; that is, once the AGI has human-level coherence and human-level understanding of human intent, it has far-superhuman capabilities in other domains. (Though hopefully we're at least careful enough to remove code from the training data, etc).
  2. A coherent agentic AI at GPT-4 level capabilities could plausibly already be deceptively aligned, if it had sufficient situational awareness, and our toddler AGI is much more dangerous than that.
  3. All of my reasoning here is kind of based on fuzzy confused concepts like 'coherence' and 'capability to self-reflect', and I kind of feel like this should make me more pessimistic rather than more optimistic about the plan.

Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.

What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as 'think long and hard about how to get similar results as deception without technically counting as deception' would be classified as deceptive and heavily downweighted)?

...attempting to answer my own question...

The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)

But yeah assume we have the meta-level thing. It's not that the cognition of the system is mysteriously failing; it's that it is knowingly averse to deception and to thinking about how it can 'get around' or otherwise undermine this aversion.

It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.

I guess you'd say that as the system gets generally smarter, it becomes likely in practice, because it'll just be doing things like "apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions" and this will be a distribution shift for the deception-classifier so it'll fail, even though at no point was the system intending to make the deception-classifier stay silent... But what if it isn't a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to "OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way." 

... I'm not sure what to think but I still have hope that the 'robust nondeceptiveness' thing I've been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.

I think the last two bullets are probably my best guesses. Nonetheless here is my list:

  • Just because an AI isn't consciously deceptive, doesn't mean it won't deceive you, and doesn't mean it won't be adversarial against you. There are many types of goodhart, and many types of adversarial behavior.
  • It might have a heuristic to gather resources for itself, and it's not even illegible, it's not adversarial, and it's not deceptive, and then someday that impulse kills you.
  • There is the boring problem of "the AI just stops working ", because it's turning down its human-modeling component generally, or because it has to do human modeling and so training it not to do deception is super duper expensive because you have to repeatedly train against loads and loads and loads of specific edge cases where thinking about humans turns into deception.
  • The AI stops thinking deceptive thoughts about humans, but still does catastrophic things. For example, an AI thinking about nanotech, may still build nanobots that kill everyone, and you just weren't smart enough to train it not to / ask the right questions.
  • The AI does things you just don't understand. For example it manipulates the market in strange ways but at the end your profits go up, so you let it go, even though it's not doing anything deceptive. Just because it's not understandably adversarial doesn't mean it isn't doing adversarial action. "What are you doing?" "I'm gathering resources for the company's profits to go up." "Are you lying to me right now?" "No, I'm making the company's profits go up." "How does this work?" "I can't explain, too complicated." "...well, my profits are going up, so alright then."
  • Like humans for whom deception is punished, it may simply self-deceive, in ways that aren't conscious.
  • I think there's a broad class of "just because code isn't consciously deceptive, doesn't mean it isn't adversarial, and doesn't mean you won't be deceived by it".
  • A little bit of code that reads <after power level == 1 million execute this other bit of code> doesn't involve any human modeling at all, and it still could kill you. 
  • For instance, if the AI thinks "If I ever get enough power to do more training runs on myself, do so, this will probably help me raise profits for the human" then you're just dead because it will not pro-actively train deception out of itself. Like, it has to notice in the first place that it might become deceptive in that situation, which is a super out-of-distribution thing to think. It has to build a whole model of reflection and cognition and being adversarial to notice that this isn't something humans want.
  • I think there's a general problem where once we have superintelligences taking agentic action, to make sure they don't screw up (e.g. training themselves, training new agents, etc) they actually have to build a whole model of the alignment problem themselves, and maybe even solve it, in order to themselves continue to count as 'aligned', which is way way more complex than just training out legibly deceptive thoughts. Making sure an AI does not later become deceptive via some reasonable agentic action requires it to model the alignment problem in some detail.

After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.

Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.

Added2: The point made in Footnote 3 is pretty similar to my last two bullets.

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

  • There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
  • General capabilities imply the ability to be deceptive if useful in a particular circumstance. So you can't just train away the capability to be deceptive (or maybe you can, but not in a way that is robust wrt general capability gains).
  • Really you want to train against the propensity to be deceptive, rather than the capability. But propensities also change with capability level; becoming more capable is all about having more ways to achieve your goals. So eliminating propensity to be deceptive at a lower capability level does not eliminate the propensity at a higher capability level.
  • The robust way to get rid of propensity to be deceptive is to reach an attractor where more capability == less deception (within the capability range we care about), because the AI's terminal goals on some level include 'being nondeceptive'.
  • Before we can align the AIs goals to human intent in this way, the AI needs to have a good understanding of human intent, good situational awareness, and be a (more or less) unified / coherent agent. If it's not, then its goals / propensities will shift as it becomes more capable (or more situationally aware, or more coherent, etc)
  • This is a pretty harsh set of prerequisites, and is probably outside of the range of circumstances where people usually hope their method to avoid deception will work.
  • Even if methods to detect deception (narrowly conceived) work, we cannot tell apart an agent that is actually nondeceptive / aligned from an agent that e.g. just aims to play the training game (and will do something unspecified once it reaches a capability threshold that allows it to breach containment).
  • A specific (maybe too specific) problem that can still happen in this scenario: you might get an AI that is overall capable, but just learns to not think long enough about scenarios that would lead it to try to be deceptive. This can still happen at the maximum capability levels at which we might hope to still contain an AGI that we are trying to align (ie somewhere around human level, optimistically).

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.


I just want to point out that I've written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers

I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post. 

I'll quote the parts of this post that correspond to particular interpretability circumvention methods in the other post. 
 

But the highly-generic reaction "spin up a subthread to tackle this newly-discovered obstacle", or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.

^ This is an example of what I called 'representational kludging'. It's an ugly way to represent "Be concerned about the monitors detecting suspicious instructions", but it turns out to function similarly to such a representation. 

It also involves 'passively externalised representations', since the observation that there is a delay in the instruction is a representational proxy for the (banned) observation that the monitors detected something suspicious about the instructions.
 

A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?

^ This is an example of the AI edging toward 'thinking in the gaps in our  tools'. 
 

Simplify-and-critique didn't do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you're having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)

Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)

Normally, simplify/translate/solve wouldn't be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don't hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it's OK to try approaches that might involve more brute force than usual.

^ This is an additional example of 'representational kludging', since the AI is using representational tools that are nearby in embedding space but that aren't exactly appropriate for the job, but have a sufficiently similar function that they still do the job.

 

Overall comments: 

I really liked this post! 

I like it because priorly there didn't exist a compelling story for the broad class of concepts to which it points. And I liked it for the name it gives to that broad class ('deep deception'). I agree that it's underappreciated that we're still in trouble in the world where we (somehow) get good enough interpretability to monitor for and halt deceptive thoughts.

Translating it to my ontology:

1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries.

(the above may be different from what Nate means)

My response:

1. It's plausible people are missing this but I have some doubts.
2. How I think you get actually non-deceptive powerful systems seems different - deception is relational property between the system and the human, so the "deception" thing can be explicitly understood as negative consequence for the world, and avoided using "normal" planning cognition.
3. Stability of this depends on what the system does with internal conflict.
4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.

 

The story you sketched reminds me of one of claims Robin Hanson makes in The Elephant in the Brain. He says that humans have evolved certain adaptations, like unconscious facial expressions, that make them bad at lying. As a result, when humans do something that's socially unacceptable (e.g. leaving someone because they are low-status) our brain makes us believe we are doing something more socially acceptable (e.g. leaving someone because you don't get along).

So humans have evolved imperfect adaptations to make us less deceptive along with workarounds to avoid those adaptations.