Deep Deceptiveness

I think your example was doomed from the start because

the AGI was exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “the nanotech problem will get solved”,
the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

So the latter is obviously doomed to get crushed by a sufficiently-intelligent AGI.

If we can get to a place where the first bullet point still holds, but the AGI also has a comparably-strong, explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”, then we’re in a situation where the AGI is applying its formidable intelligence to fight for both bullet points, not just the first one. And then we can be more hopeful that the second bullet point won’t get crushed. (Related.)

In particular, if we can pull that off, then the AGI would presumably do “intelligent” things to advance the second bullet point, just like it does “intelligent” things to advance the first bullet point in your story. For example, the AGI might brainstorm subtle ways that its plans might pattern-match to deception, and feel great relief (so to speak) at noticing and avoiding those problems before they happen. And likewise, it might brainstorm clever ways to communicate more clearly with its supervisor, and treat those as wonderful achievements (so to speak). Etc.

Of course, there remains the very interesting open question of how to reliably get to a place where the AGI has an explicit, endorsed, strong desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno. (More detailed discussion here.) For example, most humans get zapped with positive reward when they eat yummy ice cream, and yet the USA population seems to have wound up pretty spread out along the spectrum from fully endorsing the associated desire as ego-syntonic (“Eating ice cream is friggin awesome!”) to fully rejecting & externalizing it as ego-dystonic (“I sometimes struggle with a difficult-to-control urge to eat ice cream”). Again, I think there are important open questions about how this process works, and more to the point, how to intervene on it for an AGI.

[-]Ben Pace3y40

the AGI was NOT exercising its intelligence & reason & planning etc. towards an explicit, reflectively-endorsed desire for “I am being helpful / I am being docile / I am acting with integrity / blah blah”.

I am naively more scared about such an AI. That AI sounds more like if I say "you're not being helpful, please stop" that it will respond "actually I thought about it, I disagree, I'm going to continue doing what I think is helpful".

[-]Steven Byrnes3y77

I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!)

And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible.

So anyway, if we do Procedure X which will nominally lead to an AGI with an explicit reflectively-endorsed desire to accept corrections to its desires, then one might think that we’re in the ironic situation that the AGI will accept further corrections to that desire if and only if we don’t need to give it corrections in the first place 😛 (i.e. because Procedure X went perfectly and the desire is already exactly right). That would be cute and grimly amusing if true, and it certainly has a kernel of truth, but it’s a bit oversimplified if we take it literally, I think.

[-]Lauro Langosco3y30

In particular, if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

Curious what your take is on these reasons to think the answer is no (IMO the first one is basically already enough):

In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (I think you already agree with this, but it seemed worth pointing out that this is a pretty harsh set of prerequisites, especially given that we don't have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).
The concept of docility that you want to align it to needs be very specific and robust against lots of different kinds of thinking. You need it to conclude that you don't want it to deceive you / train itself for a bit longer / escape containment / etc, but at the same time you don't want it to extrapolate out your intent too much (it could be so much more helpful if it did train itself for a little longer, or if it had a copy of itself running on more compute, or it learns that there are some people out there who would like it if the AGI were free, or something else I haven't thought of)
You only have limited bits of optimization to expend on getting it to be inner aligned bc of deceptive alignment.
There's all the classic problems with corrigibility vs. consequentialism (and you can't get around those by building something that is not a reflective consequentialist, because that again is not stable under capability gains).

[-]Steven Byrnes3y2-1

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I want to be clear that the “zapping” thing I wrote is a really crap plan, and I hope we can do better, and I feel odd defending it. My least-worst current alignment plan, such as it is, is here, and doesn’t look like that at all. In fact, the way I wrote it, it doesn’t attempt corrigibility in the first place.

But anyway…

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Second bullet point → Ditto

Third bullet point → Doesn’t that apply to any goal you want the AGI to have? The context was: I think OP was assuming that we can make an AGI that’s sincerely trying to invent nanotech, and then saying that deception was a different and harder problem. It’s true that deception makes alignment hard, but that’s true for whatever goal we’re trying to install. Deception makes it hard to make an AGI that’s trying in good faith to invent nanotech, and deception also makes it hard to make an AGI that’s trying in good faith to have open and honest communication with its human supervisor. This doesn’t seem like a differential issue. But anyway, I’m not disagreeing. I do think I would frame the issue differently though: I would say “zapping the AGI for being deceptive” looks identical to “zapping the AGI for getting caught being deceptive”, at least by default, and thus the possibility of Goal Mis-Generalization wields its ugly head.

Fourth bullet point → I disagree for reasons here.

[-]Lauro Langosco3y10

Are you arguing that it’s probably not going to work, or that it’s definitely not going to work? I’m inclined to agree with the first and disagree with the second.

I'm arguing that it's definitely not going to work (I don't have 99% confidence here bc I might be missing something, but IM(current)O the things I list are actual blockers).

First bullet point → Seems like a very possible but not absolutely certain failure mode for what I wrote.

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

[-]Steven Byrnes3y30

Do you mean we possibly don't need the prerequisites, or we definitely need them but that's possibly fine?

I’m gonna pause to make sure we’re on the same page.

We’re talking about this claim I made above:

if we zap the AGI with negative reward when it’s acting from a deceptive motivation and positive reward when it’s acting from a being-helpful motivation, would those zaps turn into a reflectively-endorsed desire for “I am being docile / helpful / etc.”? Maybe, maybe not, I dunno.

And you’re trying to argue: “‘Maybe, maybe not’ is too optimistic, the correct answer is ‘(almost) definitely not’”.

And then by “prerequisites” we’re referring to the thing you wrote above:

In order to have reflectively-endorsed goals that are stable under capability gains, the AGI needs to have reached some threshold levels of situational awareness, coherence, and general capabilities (…this is a pretty harsh set of prerequisites, especially given that we don't have any fine control over relative capabilities (or sit awareness, or coherence,etc), so you might get an AI that can break containment before it is general or coherent enough to be alignable in principle).

OK, now to respond.

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary. (I have previously written about that here.)

For yet another thing, I think if the “toddler AGI” is not yet sophisticated enough to have a reflectively-endorsed desire for open and honest communication (or whatever), that’s different from saying that the toddler AGI is totally out to get us. It can still have habits and desires and inclinations and aversions and such, of various sorts, and we have some (imperfect) control over what those are. We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

[-]Lauro Langosco3y30

Yeah we're on the same page here, thanks for checking!

For one thing, you use the “might” near the end of that excerpt. That seems more compatible with a ‘maybe, maybe not’ claim, than with an ‘(almost) definitely not’ claim, right?

I feel pretty uncertain about all the factors here. One reason I overall still lean towards the 'definitely not' stance is that building a toddler AGI that is alignable in principle is only one of multiple steps that need to go right for us to get a reflectively-stable docile AGI; in particular we still need to solve the problem of actually aligning the toddler AGI. (Another step is getting labs to even seriously attempt to box it and align it, which maybe is an out-of-scope consideration here but it does make me more pessimistic).

For another thing, if we have, umm, “toddler AGI” that’s too unsophisticated to have good situational awareness, coherence, etc., then I would think that the boxing / containment problem is a lot easier than we normally think about, right? We’re not talking about hardening against a superintelligent adversary.

I agree we're not talking about a superintelligent adversary, and I agree that boxing is doable for some forms of toddler AGI. I do think you need coherence; if the toddler AGI is incoherent, then any "aligned" behavioral properties it has will also be incoherent, and something unpredictable (and so probably bad) will happen when the AGI becomes more capable or more coherent. (Flagging that I'm not sure "coherent" is the right way to talk about this... wish I had a more precise concept here.)

We can use non-reflectively-endorsed desires to help tide us over until the toddler AGI develops enough reflectivity to form any reflectively-endorsed desires at all.

I agree a non-reflective toddler AGI is in many ways easier to deal with. I think we will have problems at the threshold where the tAGI is first able to reflect on its goals and realizes that the RLHF-instilled desires aren't going to imply docile behavior. (If we can speculate about how a superintelligence might extrapolate a set of trained-in desires and realize that this process doesn't lead to a good outcome, then the tAGI can reason the same way about its own desires).

(I agree that if we can get aligned desires that are stable under reflection, then maybe the 'use non-endorsed desires to tide us over' plan could work. Though even then you need to somehow manage to prevent the tAGI from reflecting on its desires until you get the desires to a point where they stay aligned under reflection, and I have no idea how you would do something like that - we currently just don't have that level of fine control over capabilities).

The basic problem here is the double-bind where we need the toddler AGI to be coherent, reflective, capable of understanding human intent (etc) in order for it to be robustly alignable at all, even though those are exactly the incredibly dangerous properties that we really want to stay away from. My guess is that the reason Nate's story doesn't hypothesize a reflectively-endorsed desire to be nondeceptive is that reflectively-stable aligned desires are really hard / dangerous to get, and so it seems better / at least not obviously worse to go for eliezer-corrigibility instead.

Some other difficulties that I see:

The 'capability profile' (ie the relative levels of the toddler AGI's capabilities) is going to be weird / very different from that of humans; that is, once the AGI has human-level coherence and human-level understanding of human intent, it has far-superhuman capabilities in other domains. (Though hopefully we're at least careful enough to remove code from the training data, etc).
A coherent agentic AI at GPT-4 level capabilities could plausibly already be deceptively aligned, if it had sufficient situational awareness, and our toddler AGI is much more dangerous than that.
All of my reasoning here is kind of based on fuzzy confused concepts like 'coherence' and 'capability to self-reflect', and I kind of feel like this should make me more pessimistic rather than more optimistic about the plan.

[-]Johannes Treutlein3y20

Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?

[-]Lauro Langosco3y33

It does make me more uncertain about most of the details. And that then makes me more pessimistic about the solution, because I expect that I'm missing some of the problems.

(Analogy: say I'm working on a math exercise sheet and I have some concrete reason to suspect my answer may be wrong; if I then realize I'm actually confused about the entire setup, I should be even more pessimistic about having gotten the correct answer).

[-]Daniel Kokotajlo3y101

This story involves an AI with an internal censor that deletes deceptive thoughts. But since the AI is trying to make rapid progress towards its goals in the wetlab, it eventually stumbles into a workaround that lets it avoid the censor.

What if instead of an internal censor, it had an internal plan-goodness-classifier that heavily downweighted plans which involved deception, and also applied on the meta level (e.g. a plan such as 'think long and hard about how to get similar results as deception without technically counting as deception' would be classified as deceptive and heavily downweighted)?

...attempting to answer my own question...

The meta-level thing is definitely necessary, otherwise the system quickly and deliberately finds a workaround. (Though maybe at merely human levels of intelligence this is not catastrophic? Humans find workarounds all the time, little white lies, stretching and bending the rules, etc. yet systems often work well enough in practice.)

But yeah assume we have the meta-level thing. It's not that the cognition of the system is mysteriously failing; it's that it is knowingly averse to deception and to thinking about how it can 'get around' or otherwise undermine this aversion.

It could still randomly stumble into a workaround. If a particular plan occurs to it that doesn't get classified as deception but achieves similar results, it'll go for it. But this is unlikely in practice because it won't be spending cognition trying to find such plans, because doing so would trigger the aversion.

I guess you'd say that as the system gets generally smarter, it becomes likely in practice, because it'll just be doing things like "apply clever cognitive strategies like reframing the problem and then brute-force searching for solutions" and this will be a distribution shift for the deception-classifier so it'll fail, even though at no point was the system intending to make the deception-classifier stay silent... But what if it isn't a distribution shift? What if the deception classifier is robust enough that no matter how you reframe the problem, it always runs some sort of routine analogous to "OK, but is this proposed plan deception? Let me translate it back and forth, consider it from a few different angles, etc. and see if it seems deceptive in any way."

... I'm not sure what to think but I still have hope that the 'robust nondeceptiveness' thing I've been describing is natural enough that systems might learn it with sufficiently careful, sufficiently early training.

[-]Ben Pace3y*73

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

I brainstormed some possible answers. This list is a bit long. I'm publishing this comment because it's not worth the half hour to make it concise, yet it seems worth trying the exercise before reading the post and possibly others will find it worth seeing my quick attempt.

I think the last two bullets are probably my best guesses. Nonetheless here is my list:

Just because an AI isn't consciously deceptive, doesn't mean it won't deceive you, and doesn't mean it won't be adversarial against you. There are many types of goodhart, and many types of adversarial behavior.
It might have a heuristic to gather resources for itself, and it's not even illegible, it's not adversarial, and it's not deceptive, and then someday that impulse kills you.
There is the boring problem of "the AI just stops working ", because it's turning down its human-modeling component generally, or because it has to do human modeling and so training it not to do deception is super duper expensive because you have to repeatedly train against loads and loads and loads of specific edge cases where thinking about humans turns into deception.
The AI stops thinking deceptive thoughts about humans, but still does catastrophic things. For example, an AI thinking about nanotech, may still build nanobots that kill everyone, and you just weren't smart enough to train it not to / ask the right questions.
The AI does things you just don't understand. For example it manipulates the market in strange ways but at the end your profits go up, so you let it go, even though it's not doing anything deceptive. Just because it's not understandably adversarial doesn't mean it isn't doing adversarial action. "What are you doing?" "I'm gathering resources for the company's profits to go up." "Are you lying to me right now?" "No, I'm making the company's profits go up." "How does this work?" "I can't explain, too complicated." "...well, my profits are going up, so alright then."
Like humans for whom deception is punished, it may simply self-deceive, in ways that aren't conscious.
I think there's a broad class of "just because code isn't consciously deceptive, doesn't mean it isn't adversarial, and doesn't mean you won't be deceived by it".
A little bit of code that reads <after power level == 1 million execute this other bit of code> doesn't involve any human modeling at all, and it still could kill you.
For instance, if the AI thinks "If I ever get enough power to do more training runs on myself, do so, this will probably help me raise profits for the human" then you're just dead because it will not pro-actively train deception out of itself. Like, it has to notice in the first place that it might become deceptive in that situation, which is a super out-of-distribution thing to think. It has to build a whole model of reflection and cognition and being adversarial to notice that this isn't something humans want.
I think there's a general problem where once we have superintelligences taking agentic action, to make sure they don't screw up (e.g. training themselves, training new agents, etc) they actually have to build a whole model of the alignment problem themselves, and maybe even solve it, in order to themselves continue to count as 'aligned', which is way way more complex than just training out legibly deceptive thoughts. Making sure an AI does not later become deceptive via some reasonable agentic action requires it to model the alignment problem in some detail.

After writing this, I am broadly unclear whether I am showing how deception is still a problem, or showing how other problems still exist if you solve the obvious problems of deception.

Added: Wow, this post is so much richer than my guesses. I think I was on some okay lines, but I suspect it would take like 2 months to 2 years of actively trying before I would be able to write something this detailed. Not to mention that ~50% of the work is knowing which question to ask in the first place, and I did not generate this question.

Added2: The point made in Footnote 3 is pretty similar to my last two bullets.

[-]Lauro Langosco3y61

That's a challenge, and while you (hopefully) chew on it, I'll tell an implausibly-detailed story to exemplify a deeper obstacle.

Some thoughts written down before reading the rest of the post (list is unpolished / not well communicated)
The main problems I see:

There are kinds of deception (or rather kinds of deceptive capabilities / thoughts) that only show up after a certain capability level, and training before that level just won't affect them cause they're not there yet.
General capabilities imply the ability to be deceptive if useful in a particular circumstance. So you can't just train away the capability to be deceptive (or maybe you can, but not in a way that is robust wrt general capability gains).
Really you want to train against the propensity to be deceptive, rather than the capability. But propensities also change with capability level; becoming more capable is all about having more ways to achieve your goals. So eliminating propensity to be deceptive at a lower capability level does not eliminate the propensity at a higher capability level.
The robust way to get rid of propensity to be deceptive is to reach an attractor where more capability == less deception (within the capability range we care about), because the AI's terminal goals on some level include 'being nondeceptive'.
Before we can align the AIs goals to human intent in this way, the AI needs to have a good understanding of human intent, good situational awareness, and be a (more or less) unified / coherent agent. If it's not, then its goals / propensities will shift as it becomes more capable (or more situationally aware, or more coherent, etc)
This is a pretty harsh set of prerequisites, and is probably outside of the range of circumstances where people usually hope their method to avoid deception will work.
Even if methods to detect deception (narrowly conceived) work, we cannot tell apart an agent that is actually nondeceptive / aligned from an agent that e.g. just aims to play the training game (and will do something unspecified once it reaches a capability threshold that allows it to breach containment).
A specific (maybe too specific) problem that can still happen in this scenario: you might get an AI that is overall capable, but just learns to not think long enough about scenarios that would lead it to try to be deceptive. This can still happen at the maximum capability levels at which we might hope to still contain an AGI that we are trying to align (ie somewhere around human level, optimistically).

[-]Lee Sharkey3y*30

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

I just want to point out that I've written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers

I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post.

I'll quote the parts of this post that correspond to particular interpretability circumvention methods in the other post.

But the highly-generic reaction "spin up a subthread to tackle this newly-discovered obstacle", or “be concerned about patterns in how long things take”, is something that cannot be readily trained away, and so that much happens without issue.

^ This is an example of what I called 'representational kludging'. It's an ugly way to represent "Be concerned about the monitors detecting suspicious instructions", but it turns out to function similarly to such a representation.

It also involves 'passively externalised representations', since the observation that there is a delay in the instruction is a representational proxy for the (banned) observation that the monitors detected something suspicious about the instructions.

A different cognitive subthread: Is it possible to look over the detailed history of instructions and their corresponding delay times, and figure out exactly which sorts of instructions cause delays, and rephrase the instructions to avoid those?

^ This is an example of the AI edging toward 'thinking in the gaps in our tools'.

Simplify-and-critique didn't do the trick, but perhaps a nearby strategy can help. One nearby strategy that has recently been proving itself is called simplify/translate/solve, and it involves translating the problem into a common abstract format, and then solving the problem there. (Metaphorically: if you're having trouble telling whether your basket of apples will divide evenly between three friends, consider translating the question into a numerical representation and using long division.)
Simplify/translate/solve is close in the relevant embedding space to simplify-and-critique, which makes it a natural candidate to consider. (Perhaps some of the simplifications from simplify-and-critique can even be reused.)
Normally, simplify/translate/solve wouldn't be deployed unless the AI had more familiarity with the original domain, and more of an idea about what translated domain might render the solution easier. But normally, threads don't hiccup like this. And the overall system has recently been granted more computing power, and avoiding a 200-year wetlab lockup is quite important, so it's OK to try approaches that might involve more brute force than usual.

^ This is an additional example of 'representational kludging', since the AI is using representational tools that are nearby in embedding space but that aren't exactly appropriate for the job, but have a sufficiently similar function that they still do the job.

Overall comments:

I really liked this post!

I like it because priorly there didn't exist a compelling story for the broad class of concepts to which it points. And I liked it for the name it gives to that broad class ('deep deception'). I agree that it's underappreciated that we're still in trouble in the world where we (somehow) get good enough interpretability to monitor for and halt deceptive thoughts.

[-]Jan_Kulveit3y1-4

Translating it to my ontology:

1. Training against explicit deceptiveness trains some "boundary-like" barriers which will make simple deceptive thoughts labelled as such during training difficult
2. Realistically, advanced AI will need to run some general search processes. The barriers described at step 1. are roughly isomorphic to "there are some weird facts about the world which make some plans difficult to plan" (e.g. similar to such plans being avoided because they depend on extremely costly computations).
3. Given some set of a goal and strong enough capabilities, it seem likely the search will find unforeseen ways around the boundaries.

(the above may be different from what Nate means)

My response:

1. It's plausible people are missing this but I have some doubts.
2. How I think you get actually non-deceptive powerful systems seems different - deception is relational property between the system and the human, so the "deception" thing can be explicitly understood as negative consequence for the world, and avoided using "normal" planning cognition.
3. Stability of this depends on what the system does with internal conflict.
4. If the system stays in some corrigibility/alignment basin, this should be stable upon reflection / various meta-cognitive modifications. Systems in the basin resist self-modifications toward being incorrigible.

[-][anonymous]3y13

The story you sketched reminds me of one of claims Robin Hanson makes in The Elephant in the Brain. He says that humans have evolved certain adaptations, like unconscious facial expressions, that make them bad at lying. As a result, when humans do something that's socially unacceptable (e.g. leaving someone because they are low-status) our brain makes us believe we are doing something more socially acceptable (e.g. leaving someone because you don't get along).

So humans have evolved imperfect adaptations to make us less deceptive along with workarounds to avoid those adaptations.

^{^}

So, e.g., the post is not a response to the ARC-style idea "it's important that AGI faithfully explain all of its thoughts to us in a way that highlights features we'd find relevant". See footnote 4 for a conversation between me and Mark Xu about this.

^{^}

Based on past experience, I'm slightly worried that people are going to develop a theory that I fetishize deception somehow based on my writing this post, analogous to how people apparently developed theories in the past that I fetishize the VNM axioms, or mathematical proofs, or whatever.

^{^}

Another caveat: I haven't checked the relationship between my use of the word 'deception' here, and the use of the word 'deceptive' in discussions of "deceptive alignment". Please don't assume that the two words mean the same thing.

^{^}

A particularly extreme instance of a case where the AI's solution to the problem sidesteps the deception-detectors is the one where the AI says "I'm not smart enough to do this myself, I'll write another AI to do it instead", and then this subsequent AI is deceptive.

Failing to copy over the parts of your cognition that detect and shut down deception is not itself a directly deceptive act; it's not the sort of thing that is automatically detected by something trained to detect an AI thinking about a particular human's mental-states and how to manipulate those mental-states.

Which is related to why somebody who can see these sorts of problems coming in advance, might study the problem of getting an AI to want to copy its own limitations over into its successor systems. And while the problem is particularly stark and clear at the extreme where the AI is coding up whole other AIs, that particular setup is at the extreme end of a spectrum that stretches back to include things like "the AI put abstract pieces of cognitive machinery together in a way that took advantage of a shortcut, without ever directly thinking about the shortcut in a place that your detectors were watching for the thought."

^{^}

Commenting on a draft of this post, Mark Xu of ARC noted (my paraphrase) that:

1. He thinks that people who want to train AI to be non-deceptive mostly want to do things like training their AI to faithfully report its internals, rather than simply penalizing deceptive behavior.

2. He thinks the relevant audience would find specific scenarios more compelling if they exhibited potential failures in that alternative setting.

3. This scenario seems to him like an instance of a failure of the AI understanding the consequences of its own actions (which sort of problem is on ARC's radar).

I responded (my paraphrase):

1. I think he's more optimistic than I am about what labs will do (cf. "Carefully Bootstrapped Alignment" is organizationally hard). I've met researchers at major labs who seem to me to be proposing "just penalize deception" as a plan they think plausibly just works.

2. This post is not intended as a critique of ELK-style approaches, and for all that I think the ELK angle is an odd angle from which to approach things, I think that a solution to ELK in the worst case would teach us something about this problem, and that that is to ARC's great credit (in my book).

3. I contest that this is a problem of the AI failing to know the consequences of its own reasoning. Trying to get the AI to faithfully report its own reasoning runs into a similar issue where shallow attempts to train this behavior in don't result in honest-reporting that generalizes with the capabilities. (The problem isn't that the AI doesn't understand its own internals, it's that it doesn't care to report them, and making the AI care "deeply" about a thing is rather tricky.)

4. I acknowledge that parts of the audience would find the example more compelling if ported to the case where you're trying to get an AI to report on its own internals. I'm not sure I'll do it, and encourage others to do so.

Mark responded (note that some context is missing):

I think my confusion is more along the lines of "why is the nearest unblocked-by-flinches strategy in this hypothetical a translation into a graph-optimization thing, instead of something far more mundane?".

Which seems a fine question to me, and I acknowledge that there's further distillation to do here in attempts to communciate with Mark. Maybe we'll chat about it more later, I dunno.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

93

93

Meta

Summary

Investigating a made-up but moderately concrete story