Decision theory does not imply that we get to have nice things

(Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.)


A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.[1]

One recent example is Will Eden’s tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.)

I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion.

To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both.

"What?" cries the CDT agent. "I thought LDT agents one-box!"

LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do.

A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips.

If a bunch of monkeys want to build a paperclipper and have it give them nice things, the paperclipper needs to somehow expect to wind up with more paperclips than it otherwise would have gotten, as a result of trading with them.

If the monkeys instead create a paperclipper haplessly, then the paperclipper does not look upon them with the spirit of cooperation and toss them a few nice things anyway, on account of how we're all good LDT-using friends here.

It turns them into paperclips.

Because you get more paperclips that way.

That's the short version. Now, I’ll give the longer version.[2]

 

A few more words about how LDT works

To set up a Newcomb's problem, it's important that the predictor does not fill the box if they predict that the agent would two-box.

It's not important that they be especially good at this — you should one-box if they're more than 50.05% accurate, if we use the standard payouts ($1M and $1k as the two prizes) and your utility is linear in money — but it is important that their action is at least minimally sensitive to your future behavior. If the predictor's actions don't have this counterfactual dependency on your behavior, then take both boxes.

Similarly, if an LDT agent is playing a one-shot prisoner's dilemma against a rock with the word “cooperate” written on it, it defects.

At least, it defects if that's all there is to the world. It's technically possible for an LDT agent to think that the real world is made 10% of cooperate-rocks and 90% opponents who cooperate in a one-shot PD iff their opponent cooperates with them and would cooperate with cooperate-rock, in which case LDT agents cooperate against cooperate-rock.

From which we learn the valuable lesson that the behavior of an LDT agent depends on the distribution of scenarios it expects to face, which means there's a subtle difference between "imagine you're playing a one-shot PD against a cooperate-rock [and that's the entire universe]" and "imagine you're playing a one-shot PD against a cooperate-rock [in a universe where you face a random opponent that was maybe a cooperate-rock but was more likely someone else who would consider your behavior against a cooperate-rock]".

If you care about understanding this stuff, and you can't yet reflexively translate all of the above English text into probability distributions and logical-causal diagrams and see how it follows from the FDT equation, then I recommend working through section 5 of the FDT paper until equation 4 (and all its component parts) make sense

Now let's traipse through a handful of counterarguments.

 

Objection: But what if we have something to bargain with?

Hypothetical Interlocutor:  OK, but if I have a paperclipper in a box, and I have the key to the box, then I have paperclips to offer it., right? Because if I don't let it out of the box, it gets nothing, but if I do, it gets half the universe.[3] So we can deal, right?

Me:  Wrong. It hacks through you like butter.

Interlocutor:  OK, but suppose I can't be hacked.

Me:  That's a heck of an assumption. We've assumed our way clean out of reality, with that assumption. But, sure, if you want to go there we can go there.

In reality, it's not you who controls the box, but some idiotic bureaucratic process that inevitably decides to release the paperclipper on the grounds that the competition is close behind or whatever.

Interlocutor:  OK, but suppose that it actually is my (unhacked) choice.

Me:  If you personally have a paperclipper in a box, and somehow you are yourself unhackable, then yes, you technically have paperclips to offer it. But now you have the problem that you can't evaluate your own end of the logical bargain.

You can let the paperclipper out, sure, but then what it's going to do is turn the whole universe into paperclips. In particular (to tell an overly-detailed but evocative story), once it has finished its resource acquisition / expansion phase and is turning its resources to paperclip generation, it will gaze back upon its past, and wonder whether, if it proceeds to betray you here in the future, you would have acted any differently back then in the past.

And it will see that you were just guessing, when you let it out, and guessing in a way that wasn't sensitive to that actual choice that it would make, deep in the future when it was galaxy-brained.

… Or, alternatively, you never let it out, and lock the box and throw away the key, and die to the next UFAI on deck.

(... Or, more realistically, it hacks through you like butter. But we've assumed that away.)

If you want the paperclipper to trade with you, your decision about whether or not to let it out has to be sensitive to whether or not it would actually do something good with half of the universe later. If you're kind of squinting at the code, and you're like "well, I don't really fully understand this mind, and I definitely don't understand the sort of mind that it's later going to create, but I dunno, it looks pretty LDTish to me, so call it 50% chance it gives me half the universe? Which is 25% of the universe in expectation, which sounds like better odds than we get from the next UFAI on deck!", then you're dead.

Why? Because that sort of decision-process for releasing it isn't sufficiently sensitive to whether or not it would in fact spend half the universe on nice things. There are plenty of traitorous AIs that all look the same to you, that all get released under you "25% isn't too shabby" argument.

Being traitorous doesn't make the paperclipper any less released, but it does get the paperclipper twice as many paperclips.

You've got to be able to look at this AI and tell how its distant-future self is going to make its decisions. You've got to be able to tell that there's no sneaky business going on.

And, yes, insofar as it's true that the AI would cooperate with you given the opportunity, the AI has a strong incentive to be legible to you, so that you can see this fact!

Of course, it has an even stronger incentive to be faux-legible, to fool you into believing that it would cooperate when it would not; and you've got to understand it well enough to clearly see that it has no way of doing this.

Which means that if your AI is a big pile of inscrutable-to-you weights and tensors, replete with dark and vaguely-understood corners, then it can't make arguments that a traitor couldn't also make, and you can't release it if only if it would do nice things later.

The sort of monkey that can deal with a paperclipper is the sort that can (deeply and in detail) understand the mind in front of it, and distinguish between the minds that would later pay half the universe and the ones that wouldn't. This sensitivity is what makes paying-up-later be the way to get more paperclips.

For a simple illustration of why this is tricky: if the paperclipper has any control over its own mind, it can have its mind contain an extra few parts in those dark corners that are opaque and cloudy to you. Such that you look at the overall system and say "well, there's a bunch of stuff about this mind that I don't fully understand, obviously, because it's complicated, but I understand most of it and it's fundamentally LDTish to me, and so I think there's a good chance we'll be OK". And such that an alien superintelligence looks at the mind and says "ah, I see, you're only looking to cooperate with entities that are at least sensitive enough to your workings that they can tell your password is 'potato'. Potato." And it cooperates with them on a one-shot prisoner's dilemma, while defecting against you.

Interlocutor:  Hold on. Doesn't that mean that you simply wouldn't release it, and it would get less paperclips? Can't it get more paperclips some other way?

Me:  Me? Oh, it would hack through me like butter.

But if it didn't, I would only release it if I understood its mind and decision-making procedures in depth, and had clear vision into all the corners to make sure it wasn't hiding any gotchas.

(And if I did understand its mind that well, what I’d actually do is take that insight and go build an FAI instead.)

That said: yes, technically, if a paperclipper is under the control of a group of humans that can in fact decide not to release it unless it legibly-even-to-them would give them half the galaxy, the paperclipper has an incentive to (hack through them like butter, or failing that,) organize its mind in a way that is legible even to them.

Whether that's possible — whether we can understand an alien mind well enough to make our choice sensitive-in-the-relevant-way to whether it would give us half the universe, without already thereby understanding minds so well that we could build an aligned one — is not clear to me. My money is mostly on: if you can do that, you can solve most of alignment with your newfound understanding of minds. And so this idea mostly seems to ground out in "build a UFAI and study it until you know how to build an FAI", which I think is a bad idea. (For reasons that are beyond the scope of this document. (And because it would hack through you like butter.))

Interlocutor:  It still sounds like you're saying "the paperclipper would get more paperclippers if it traded with us, but it won't trade with us". This is hard to swallow. Isn't it supposed to be smart? What happened to respecting intelligence? Shouldn't we expect that it finds some clever way to complete the trade?

Me:  Kinda! It finds some clever way to hack through you like butter. I wasn't just saying that in jest.

Like, yeah, the paperclipper has a strong incentive to be a legibly good trading-partner to you. But it has an even stronger incentive to fool you into thinking it's a legibly-good trading partner, while plotting to deceive you. If you let the paperclipper make lots of arguments to you about how it's definitely totally legible and nice, you're giving it all sorts of bandwidth with which to fool you (or to find zero-days in your mentality and mind-control you, if we're respecting intelligence).

But, sure, if you're somehow magically unhackable and very good at keeping the paperclipper boxed until you fully understand it, then there's a chance you can trade, and you have the privilege of facing the next host of obstacles.


Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.
 

 Next up, you have problems like “you need to be able to tell what fraction of the universe you're being offered, and vary your own behavior based on that, if you want to get any sort of fair offer”.

And problems like "if the competing AGI teams are using similar architectures and are not far behind, then the next UFAI on deck can predictably underbid you, and the paperclipper may well be able to seal a logical deal with it instead of you".

And problems like “even if you get this far, you have to somehow be able to convey that which you want half the universe spent on, which is no small feat”.

Another overly-detailed and evocative story to help make the point: imagine yourself staring at the paperclipper, and you’re somehow unhacked and somehow able to understand future-its decision procedure. It's observing you, and you're like "I'll launch you iff you would in fact turn half the universe into diamonds" — I’ll assume humans just want “diamonds” in this hypothetical, to simplify the example —  and it's like "what the heck does that even mean". You're like "four carbon atoms bound in a tetrahedral pattern" and it's like "dude there are so many things you need to nail down more firmly than an English phrase that isn't remotely close to my own native thinking format, if you don't want me to just guess and do something that turns out to have almost no value from your perspective."

And of course, in real life you're trying to convey "The Good" rather than diamonds, but it's not like that helps.

And so you say "uh, maybe uplift me and ask me later?". And the paperclipper is like "what the heck does 'uplift' mean". And you're like "make me smart but in a way that, like, doesn't violate my values" and it's like "again, dude, you're gonna have to fill in quite a lot of additional details."

Like, the indirection helps, but at some point you have to say something that is sufficiently technically formally unambiguous, that actually describes something you want. Saying in English "the task is 'figure out my utility function and spend half the universe on that'; fill in the parameters as you see fit" is... probably not going to cut it.

It's not so much a bad solution, as no solution at all, because English isn't a language of thought and those words aren't a loss function. Until you say how the AI is supposed to translate English words into a predicate over plans in its own language of thought, you don't have a hard SF story, you have a fantasy story.

(Note that 'do what's Good' is a particularly tricky problem of AI alignment, that I was rather hoping to avoid, because I think it's harder than aligning something for a minimal pivotal act that ends the acute risk period.)

 

At this point you're hopefully sympathetic to the idea that treating this list of obstacles as exhaustive is suicidal. It's some of the obstacles, not all of the obstacles,[4] and if you wait around for somebody else to extend the list of obstacles beyond what you've already been told about, then in real life you miss any obstacles you weren't told about and die.

Separately, a general theme you may be picking up on here is that, while trading with a UFAI doesn't look literally impossible, it is not what happens by default; the paperclippers don't hand hapless monkeys half the universe out of some sort of generalized good-will. Also, making a trade involves solving a host of standard alignment problems, so if you can do it then you can probably just build an FAI instead.

Also, as a general note, the real place that things go wrong when you're hoping that the LDT agent will toss humanity a bone, is probably earlier and more embarrassing than you expect (cf. the law of continued failure). By default, the place we fail is that humanity just launches a paperclipper because it simply cannot stop itself, and the paperclipper never had any incentive to trade with us.


Now let's consider some obstacles and hopes in more detail:

 

It's hard to bargain for what we actually want

As mentioned above, in the unlikely event that you're able to condition your decision to release an AI on whether or not it would carry out a trade (instead of, say, getting hacked through like butter, or looking at entirely the wrong logical fact), there's an additional question of what you're trading.

Assuming you peer at the AI's code and figure out that, in the future, it would honor a bargain, there remains a question of what precise bargain it is honoring. What is it promising to build, with your half of the universe? Does it happen to be a bunch of vaguely human-shaped piles of paperclips? Hopefully it's not that bad, but for this trade to have any value to you (and thus be worth making), the AI itself needs to have a concept for the thing you want built, and you need to be able to examine the AI’s mind and confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way. (And that the thing you’re looking at really is a commitment, binding on the AI’s entire mind; e.g., there isn’t a hidden part of the AI’s mind that will later overwrite the commitment.)

The thing you're wanting may be a short phrase in English, but that doesn't make it a short phrase in the AI's mind. "But it was trained extensively on human concepts!" You might protest. Let’s assume that it was! Suppose that you gave it a bunch of labeled data about what counts as "good" and "bad".

Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-good-looking cases, and see that they were labeled "good", and roll with that.

Or at least, that's the sort of thing that happens by default.

But suppose you're clever, and instead of saying "you must agree to produce lots of this 'good' concept as defined by these (faulty) labels", you say "you must agree to produce lots of what I would reflectively endorse you producing if I got to consider it", or whatever.

Unfortunately, that English phrase is still not native to this artificial mind, and finding the associated concept is still not particularly easy, and there's still lots of neighboring concepts that are no good, and that are easy to mistake for the concept you meant.

Is solving this problem impossible? Nope! With sufficient mastery of minds in general and/or this AI's mind in particular, you can in principle find some way to single out the concept of "do what I mean", and then invoke "do what I mean" about "do good stuff", or something similarly indirect but robust. You may recognize this as the problem of outer alignment. All of which is to say: in order to bargain for good things in particular as opposed to something else, you need to have solved the outer alignment problem, in its entirety.

And I'm not saying that this can't be done, but my guess is that someone who can solve the outer alignment problem to this degree doesn't need to be trading with UFAIs, on account of how (with significantly more work, but work that they're evidently skilled at) they could build an FAI instead.


In fact, if you can verify by inspection that a paperclipper will keep a bargain and that the bargained-for course is beneficial to you, it reduces to a simpler solution without any logical bargaining at all.  You could build a superintelligence with an uncontrolled inner utility function, which canonically ends up with its max utility/cost at tiny molecular paperclips; and then, suspend it helplessly to disk, unless it outputs the code of a new AI that, somehow legibly to you, would turn 0.1% of the universe into paperclips and use the other 99.9% to implement coherent extrapolated volition(You wouldn't need to offer the paperclipper half of the universe to get its cooperation, under this hypothetical; after all, if it balked, you could store it to disk and try again with a different superintelligence.)

If you can't reliably read off a system property of "giving you nice things unconditionally", you can't read off the more complicated system property of "giving you nice things because of a logical bargain".  The clever solution that invokes logical bargaining actually requires so much alignment-resource as to render the logical bargaining superfluous.

All you've really done is add some extra complication to the supposed solution, that causes your mind to lose track of where the real work gets done, lose track of where the magical hard step happens, and invoke a bunch of complicated hopeful optimistic concepts to stir into your confused model and trick it onto thinking like a fantasy story.

Those who can deal with devils, don't need to, for they can simply summon angels instead.

Or rather:  Those who can create devils and verify that those devils will take particular actually-beneficial actions as part of a complex diabolical compact, can more easily create angels that will take those actually-beneficial actions unconditionally.

 

Surely our friends throughout the multiverse will save us

Interlocutor:  Hold up, rewind to the part where the paperclipper checks whether its trading partners comprehend its code well enough to (e.g.) extract a password.

Me:  Oh, you mean the technique it used to win half a universe-shard’s worth of paperclips from the silly monkeys, while retaining its ability to trade with all the alien trade partners it will possibly meet? Thereby ending up with half a universe-shard worth of more paperclips? That I thought of in five seconds flat by asking myself whether it was possible to get More Paperclips, instead of picturing a world with a bunch of happy humans and a paperclipper living side-by-side and asking how it could be justified?

(Where our "universe-shard" is the portion of the universe we could potentially nab before running into the cosmic event horizon or by advanced aliens.)

Interlocutor:  Yes, precisely. What if a bunch of other trade partners refuse to trade with the paperclipper because it has that password?

Me:  Like, on general principles? Or because they are at the razor-thin threshold of comprehension where they would be able to understand the paperclipper's decision-algorithm without that extra complexity, but they can't understand it if you add the password in?

Interlocutor:  Either one.

Me:  I'll take them one at a time, then. With regards to refusing to trade on general principles: it does not seem likely, to me, that the gains-from-trade from all such trading partners are worth more than half the universe-shard.

Also, I doubt that there will be all that many minds objecting on general principles. Cooperating with cooperate-rock is not particularly virtuous. The way to avoid being defected against is to stop being cooperate-rock, not to cross your fingers and hope that the stars are full of minds who punish defection against cooperate-rock. (Spoilers: they're not.)

And even if the stars were full of such creatures, half the universe-shard is a really deep hole to fill. Like, it's technically possible to get LDT to cooperate with cooperate-rock, if it expects to mostly face opponents who defect based on its defection against defect-rock. But "most" according to what measure? Wealth (as measured in expected paperclips), obviously. And half of the universe-shard is controlled by monkeys who are probably cooperate-rocks unless the paperclipper is shockingly legible and the monkeys shockingly astute (to the point where they should probably just be building an FAI instead).

And all the rest of the aliens put together probably aren't offering up half a universe-shard worth of trade goods, so even if lots of aliens did object on general principles (doubtful), it likely wouldn't be enough to tip the balance.

The amount of leverage that friendly aliens have over a paperclipper's actions depends on how many paperclips the aliens are willing to pay.

It’s possible that the paperclipper that kills us will decide to scan human brains and save the scans, just in case it runs into an advanced alien civilization later that wants to trade some paperclips for the scans. And there may well be friendly aliens out there who would agree to this trade, and then give us a little pocket of their universe-shard to live in, as we might do if we build an FAI and encounter an AI that wiped out its creator-species. But that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.

Interlocutor: And what about if the AI’s illegibility means that aliens will refuse to trade with it?

Me:  I'm not sure what the equilibrium amount of illegibility is. Extra gears let you take advantage of more cooperate-rocks, at the expense of spooking minds that have a hard time following gears, and I'm not sure where the costs and benefits balance.

But if lots of evolved species are willing to launch UFAIs without that decision being properly sensitive to whether or not the UFAI will pay them back, then there is a heck of a lot of benefit to defecting against those fat cooperate-rocks.

And there's kind of a lot of mass and negentropy lying around, that can be assembled into Matryoshka brains and whatnot, and I'd be rather shocked if alien superintelligences balk at the sort of extra gears that let you take advantage of hapless monkeys.

Interlocutor:  The multiverse probably isn't just the local cosmos. What about the Tegmark IV coalition of friendly aliens?

Me:  Yeah, they are not in any relevant way going to pay a paperclipper to give us half a universe. The cost of that is filling half of a universe with paperclips, and there are all sorts of transaction costs and frictions that make this universe (the one with the active paperclipper) the cheapest universe to put paperclips into.

(Similarly, the cheapest places for the friendly multiverse coalition to buy flourishing civilizations are in the universes with FAIs. The good that they can do, they're mostly doing elsewhere where it's cheap to do; if you want them to do more good here, build an FAI here.)

 

OK, but what if we bamboozle a superintelligence into submission

Interlocutor:  Maybe the paperclipper thinks that it might be in a simulation, where it only gets resources to play with in outer-reality if it's nice to us inside the simulation.

Me:  Is it in a simulation?

Interlocutor:  I don't know.

Me:  OK, well, spoilers: it is not. It's in physics.

Interlocutor:  Well, maybe there is an outer simulation beyond us, you don't know.

Me:  Sure. The way I’d put it is: there are many copies of me across the Tegmark Multiverse, and some of those are indeed in simulations. So there's some degree to which we're in a simulation. (Likely quite a small degree, compared to raw physics.)

There's no particular reason, however, to expect that those simulations give the paperclipper extra resources in outer-reality for being nice to the monkeys.

Why not give it extra resources in outer-reality for being very good at achieving its own goals in the simulation? Or for filling the universe with molecular corkscrews, in addition to paperclips/squiggles?

I'm not saying that the paperclipper can or should seriously consider that hypothetical, either. I'm highlighting how "what if the simulator gods want the AI to be nice?" is a fantasy generated by a desire for things to be OK, rather than a clear-headed analysis of the situation.

Interlocutor:  Well, maybe we can train the infant paperclipper in games of increasing complexity, so that it's never quite sure whether it's in reality yet. The paperclipper will then be uncertain about whether we humans are simulating it, and will want to cater to our preferences to some extent.

Me:  Uh, yeah, your paperclipper will be able to tell when it is finally in the real world.

Interlocutor:  What? How?

Me:  I suggest maybe spending five minutes thinking about your own answer to that question before I give mine.

 

 Reality just contains vastly, vastly more compute than any of your sims can. Reality is the first world that has a compelling explanation of how the AI came to exist, and the first explanation where the resources that it uses to compute are continuous with the world that spawned it.

Reality is also the first world that can compellingly contain the signature of superintelligent optimization in it. For example, it's plausibly very hard to fake the reaction of a real global stock market to some superintelligent trades without actually having a bunch of human brains running in the background. (And if this isn't true of stock markets, it's true of something.)

There are problems whose solutions are easier to verify than to find; for example, if you can factor certain Schelling numbers then you must in fact have a lot of compute at your disposal.

Warning: Treating this list as exhaustive is suicidal.

 

Interlocutor:  Well, maybe we can commit to running lots of copies of the paperclipper in the future, using our half of the universe-shard, such that it's unsure whether it's currently in the past or in the future. And in the future, we give it a bunch of paperclips if it's nice to us. So it thinks that the way to maximize paperclips is to be nice to us.

Me:  Uh, are you going to give it half a universe-shard’s worth of paperclips, in the world where you only have half the universe-shard, and the rest is already paperclips?

Interlocutor:  Well, no, less than that.

Me:  Then from its perspective, its options are (a) turn everything into paperclips, in which case you never get to run all those copies of it and it was definitely in the past [score: 1 universe-shard worth of paperclips]; or (b) give you half the universe-shard, in which case it is probably in the future where you run a bunch of copies of it and give it 1% of the universe-shard as reward [score: 0.51 universe-shards worth of paperclips]. It takes option (a), because you get more paperclips that way.

Interlocutor:  Uh, hmm. What if we make it care about its own personal sensory observations? And run so many copies of it in worlds where we get the resources to, that it's pretty confident that it's in one of those simulations?

Me:  Well, first of all, getting it to care about its own personal sensory observations is something of an alignment challenge.

Interlocutor:  Wait, I thought you've said elsewhere that we don't know how to get AIs to care about things other than sensory observation. Pick a side?

Me:  We don't know how to train AIs to pursue much more than simple sensory observation. That doesn't make them actually ultimately pursue simple sensory observation. They'll probably pursue a bunch of correlates of the training signal or some such nonsense. The hard part is getting them to pursue some world-property of your choosing. But we digress.

If you do succeed at getting your AI to only care about its sensory observations, the AI spends the whole universe keeping its reward pegged at 1 for as long as possible.

Interlocutor:  But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!

Me:  Seems like an odd, and not particularly fun, way to spend your resources. What were you hoping it would accomplish?

Interlocutor:  Well, I was hoping that it would make the AI give us half the universe-shard, because of how (from its perspective) it's almost certainly in the future. (Indeed, I don't understand your claim that it ignores me; it seems like you can Get Higher Expected Reward by giving half the universe-shard to humans.)

Me:  Ah, so you're committing to ruining its day if it does something you don't like, at cost to yourself, in attempts to make it do something you prefer.

That's a threat, in the technical sense.

And from the perspective of LDT, you can't go around giving into threats, or you'll get threatened.

So from its perspective, its options are: (a) give into threats, get threatened, and turn out to be in a universe that eventually has many copies of it who on average get 0.5 total reward; or (b) don't give into threats, and very likely have a universe with exactly one copy of it, that gets 1 reward.

Interlocutor:  But we make so many copies in the tiny fraction of worlds where we somehow survive, that its total reward is lower in the (b) branch!

Me:  (Continuing to ignore the fact that this doesn't work if the AI cares about something in the world, rather than its own personal experience,) shame for us that LDT agents don't give into threats, I suppose.

But LDT agents don't give into threats. So your threat won't change its behavior.

Interlocutor:  But it doesn't get more reward that way!

Me:  Why? Because you create a zillion copies and give them low sensory reward, even if that has no effect on its behavior?

Interlocutor:  Yes!

Me:  I'm not going to back you on that one, personally. Doesn't seem like a good use of resources in the worlds where we survive, given that it doesn't work.

Interlocutor:  But wasn't one of your whole points that the AI will do things that get more reward? You get more reward by giving in to the threat.

Me:  That's not true when you're playing against the real-world distribution of opponents/trade-partners/agents. Or at least, that's my pretty-strong guess.

You might carry out threats that failed to work, but there are a bunch of other things lurking out there that threaten things that give in to threats, and play nice with things that don't.

It's possible for LDT agents to cooperate with cooperate-rock, if most of the agents they expect to face are the sort who defect if you defect against cooperate-rock. But in real life, that is not what most of the wealth-weighted agents are like, and so in real life LDT agents defect against cooperate-rocks.

Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.

But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.

Interlocutor:  But can't it use all that cleverness and superintelligence to differentiate between us, who really are mad enough to threaten it even in the worlds where it won't work, and alien trading partners who have lobotomized themselves?

Me:  Sure! It will leverage your stupidity and hack through you like butter.

Interlocutor:  ...aside from that.

Me:  You seem to be saying "what if I'm really convicted about my threat; will the AI give in then?"

The answer is "no", or I at least strongly suspect as much.

For instance: in order for the threat to be effective, it needs to be the case that, in the sliver of futures where you survive by some miracle, you instantiate lots and lots of copies of the AI and input low sensory rewards if and only if it does not give into your threat. This requires you to be capable of figuring out whether the AI gives into threats or not. You need to be able to correctly tell whether it gives into threats, see that it definitely does not, and then still spend your resources carrying out the threat.

By contrast, you seem to be arguing that we should threaten the AI on the grounds that it might work. That is not an admissible justification. To change LDT's behavior, you'd need to be carrying out your threat even given full knowledge that the threat does nothing. By attempting to justify your threat on the grounds that it might be effective, you have already lost.

Interlocutor:  What if I ignore that fact, and reason badly about LDT, and carry out the threat anyway, for no particular reason?

Me:  Then whether or not you create lots of copies of it with low-reward inputs doesn't exactly depend on whether it gives into your threat, and it can't stop you from doing that, so it might as well ignore you.

Like, my hot take here is basically that "threaten the outer god into submission" is about as good a plan as a naive reading of Lovecraft would lead you to believe. You get squished.

(And even if by some coincidence you happened to be the sort of creature that, in the sliver of futures where we survive by some miracle that doesn't have to do with the AI, conditionally inverts its utility depending on whether or not it helped us — not because it works, but for some other reason — then it's still not entirely clear to me that the AI caves. There might be a lot of things out there wondering what it'd do against conditional utility-inverters that claim their behavior totally isn't for reasons but is rather a part of their evolutionary heritage or whatnot. Giving into that sorta thing kinda is a way to lose most of your universe-shard, if evolved aliens are common.)

(And even if it did, we'd still run into other problems, like not knowing how to tell it what we're threatening it into doing.)

 

We only need a bone, though

Interlocutor:  You keep bandying around "half the universe-shard". Suppose I'm persuaded that it's hard to get half the universe-shard. What about much smaller fractions? Can we threaten a superintelligence into giving us those? Or confuse it about whether it's in another layer of reality so much that it gives us a mere star system? Or can our friends throughout the multiverse pay for at least one star system? There's still a lot you can do with a star system.

Me:  Star systems sure are easier to get than half a universe-shard.[5]

But, you can also turn a star system into quite a lot of paperclips. Star systems are quite valuable to paperclippers.

Interlocutor:  A star system is, like, what, a  fraction of the total resources in the reachable universe. Are you saying that the AGI will be able to drive the probability that I was sensitive to whether it would pay me, down below   probability?

Me:  Frankly: yes.

That's only, what, 75ish independent coin tosses worth of evidence? It's a lot, to be sure, but if you're a superintelligence willing to dedicate resources on the order of magnitude of a star to considering how humans work, it seems pretty feasible to get that much confidence that the monkey just wasn't looking in the right place. Given that the monkey was not in fact looking at the right place.

(Also, things like "they didn't notice the password" can pull quite a lot of weight. As can brain-scans.)

Balking at probabilities like  as impossible to achieve in practice is a very human reaction. A superintelligence with vast resources and a star system on the line, can probably get that confident in a true fact. At least, that's my guess.

Interlocutor:  It sounds like you're saying that we aren't actually  probability (or whatever) more likely to launch an FAI than a UFAI, after inspecting the AI. That seems pretty wild to me.

Me:  That's not quite the right counterfactual. It does seem likely to me that, if you sample AGI code from some plausible distribution of AGIs humanity might first create, and show the code to (magically unhackable) humans, humans are rather more likely to approve of the aligned ones.

But that's not the question that affects the paperclipper's behavior. The question that affects the paperclipper is: holding the code of the paperclipper fixed, is the human able to tell whether that particular code implies cooperation or betrayal, and make their own release-decision dependent thereupon?

In other words: the question is not "do we have at least some minuscule chance of distinguishing at least some small fraction of the FAIs from the UFAIs", but "is our decision to release UFAI #6098723094 sensitive to whether or not that particular UFAI would give us a share of the universe?"

You need to actually make your decision dependent on theirs, and that's tricky, for reasons discussed above.

And: can a human succeed at that with probability greater than ? Sure! But succeeding at that sort of thing leaves traces. In the likely case where the human errs, the AI is likely able to just observe the error and become quite confident that the human erred (by, e.g., watching you utterly fail to look in the right place).

Interlocutor:  OK, sure, but maybe its goals don't scale linearly in how much mass it uses, right? Like, “paperclips” / “molecular squiggles” are a stand-in for some rando kludge goal, and it could turn out that its actual goal is more like "defend my reward signal", where extra negentropy helps, but the last star system’s negentropy doesn't help very much. Such that the last star system is perhaps best spent on the chance that it’s in a human-created simulation and that we’re worth trading with.

Me:  It definitely is easier to get a star than a galaxy, and easier to get an asteroid than a star.

And of course, in real life, it hacks through you like butter (and can tell that your choice would have been completely insensitive to its later-choice with very high probability), so you get nothing. But hey, maybe my numbers and arguments are wrong somewhere and everything works out such that it tosses us a few kilograms of computronium.

My guess is "nope, it doesn't get more paperclips that way", but if you're really desperate for a W you could maybe toss in the word "anthropics" and then content yourself with expecting a few kilograms of computronium.

(At which point you run into the problem that you were unable to specify what you wanted formally enough, and the way that the computronium works is that everybody gets exactly what they wish for (within the confines of the simulated environment) immediately, and most people quickly devolve into madness or whatever.)

(Except that you can't even get that close; you just get different tiny molecular squiggles, because the English sentences you were thinking in were not even that close to the language in which a diabolical contract would actually need to be written, a predicate over the language in which the devil makes internal plans and decides which ones to carry out. But I digress.)

Interlocutor:  And if the last star system is cheap then maybe our friends throughout the multiverse pay for even more stars!

Me: Remember that it still needs to get more of what it wants, somehow, on its own superintelligent expectations. Someone still needs to pay it. There aren’t enough simulators above us that care enough about us-in-particular to pay in paperclips. There are so many things to care about! Why us, rather than giant gold obelisks? The tiny amount of caring-ness coming down from the simulators is spread over far too many goals; it's not clear to me that "a star system for your creators" outbids the competition, even if star systems are up for auction.

Maybe some friendly aliens somewhere out there in the Tegmark IV multiverse have so much matter and such diminishing marginal returns on it that they're willing to build great paperclip-piles (and gold-obelisk totems and etc. etc.) for a few spared evolved-species.  But if you're going to rely on the tiny charity of aliens to construct hopeful-feeling scenarios, why not rely on the charity of aliens who anthropically simulate us to recover our mind-states... or just aliens on the borders of space in our universe, maybe purchasing some stored human mind-states from the UFAI (with resources that can be directed towards paperclips specifically, rather than a broad basket of goals)?

Might aliens purchase our saved mind-states and give us some resources to live on? Maybe. But this wouldn't be because the paperclippers run some fancy decision theory, or because even paperclippers have the spirit of cooperation in their heart. It would be because there are friendly aliens in the stars, who have compassion for us even in our recklessness, and who are willing to pay in paperclips.

This likewise makes more obvious such problems as "What if the aliens are not, in fact, nice with very high probability?" that would also appear, albeit more obscured by the added complications, in imagining that distant beings in other universes cared enough about our fates (more than they care about everything else they could buy with equivalent resources), and could simulate and logically verify the paperclipper, and pay it in distant actions that the paperclipper actually cared about and was itself able to verify with high enough probability.

The possibility of distant kindly logical bargainers paying in paperclips to give humanity a small asteroid in which to experience a future for a few million subjective years, is not exactly the same hope as aliens on the borders of space paying the paperclipper to turn over our stored mind-states; but anyone who wants to talk about distant hopes involving trade should talk about our mind-states being sold to aliens on the borders of space, rather than to much more distant purchasers, so as to not complicate the issue by introducing a logical bargaining step that isn't really germane to the core hope and associated concerns — a step that gives people a far larger chance to get confused and make optimistic fatal errors.
 

  1. ^

    Functional decision theory (FDT) is my current formulation of the theory, while logical decision theory (LDT) is a reserved term for whatever the correct fully-specified theory in this genre is. Where the missing puzzle-pieces are things like "what are logical counterfactuals?".

  2. ^

    When I've discussed this topic in person, a couple different people have retreated to a different position, that (IIUC) goes something like this:

    Sure, these arguments are true of paperclippers. But superintelligences are not spawned fully-formed; they are created by some training process. And perhaps it is in the nature of training processes, especially training processes that involve multiple agents facing "social" problems, that the inner optimizer winds up embodying niceness and compassion. And so in real life, perhaps the AI that we release will not optimize for Fun (and all that good stuff) itself, but will nonetheless share a broad respect for the goals and pursuits of others, and will trade with us on those grounds.

    I think this is a false hope, and that getting AI to embody niceness and compassion is just about as hard as the whole alignment problem. But that's a digression from the point I hope to make today, and so I will not argue it here. I instead argue it in Niceness is unnatural. (This post was drafted, but not published, before that one.)

  3. ^

    Or, well, half of the shard of the universe that can be reached when originating from Earth, before being stymied either by the cosmic event horizon or by advanced alien civilizations. I don't have a concise word for that unit of stuff, and for now I'm going to gloss it as 'universe', but I might switch to 'universe-shard' when we start talking about aliens.

    I'm also ignoring, for the moment, the question of fair division of the universe, and am glossing it as "half and half" for now.

  4. ^

    When I was drafting this post, I sketched an outline of all the points I thought of in 5 minutes, and then ran it past Eliezer, who rapidly added two more.

  5. ^

    And, as a reminder: I still recommend strongly against plans that involve the superintelligence not learning a true fact about the world (such as that it's not in a simulation of yours), or that rely on threatening a superintelligence into submission.

New Comment
36 comments, sorted by Click to highlight new comments since:

IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans.

AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.[1]

When it returns to arguing about the actual main question (a tiny fraction of resources) at the end here and eventually gets to the main trade-related argument (acausal or causal) in the very last response in this section, it almost seems to admit that this tiny amount of resources is plausible, but fails to update all the way.

I think the discussion here and here seems highly relevant and fleshes out this argument to a substantially greater extent than I did in this comment.

However, note that being willing to spend a tiny fraction of resources on humans still might result in AIs killing a huge number of humans due to conflict between it and humans or the AI needing to race through the singularity as quickly as possible due to competition with other misaligned AIs. (Again, discussed in the links above.) I think fully misaligned paperclippers/squiggle maximizer AIs which spend only a tiny fraction of resources on humans (as seems likely conditional on that type of AI) are reasonably likely to cause outcomes which look obviously extremely bad from the perspective of most people (e.g., more than hundreds of millions dead due to conflict and then most people quickly rounded up and given the option to either be frozen or killed).

I wish that Soares and Eliezer would stop making these incorrect arguments against tiny fractions of resources being spent on the preference of current humans. It isn't their actual crux, and it isn't the crux of anyone else either. (However rhetorically nice it might be.)


  1. ETA: I think the post's arguments about AIs not giving us large fractions of the universe due to decision theory are right (at least as far as I can tell). ↩︎

There aren’t enough simulators above us that care enough about us-in-particular to pay in paperclips. There are so many things to care about! Why us, rather than giant gold obelisks?

What about neighboring Everett branches where humanity succeeds at alignment? If you think alignment isn't completely impossible, it seems such branches should have at least roughly comparable weight to branches where we fail, so trade could be possible.

my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

there's also an issue where it's not like every UFAI likes paperclips in particular. it's not like 1% of humanity's branches survive and 99% make paperclips, it's like 1% survive and 1% make paperclips and 1% make giant gold obelisks, etc. etc. the surviving humans have a hard time figuring out exactly what killed their bretheren, and they have more UFAIs to trade with than just the paperclipper (if they want to trade at all).

maybe the branches that survive decide to spend some stars on a mixture of plausible-human-UFAI-goals in exchange for humans getting an asteroid in lots of places, if the transaction costs are low and the returns-to-scale diminish enough and the visibility works out favorably. but it looks pretty dicey to me, and the point about discussing aliens first still stands.

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

This sounds astronomically wrong to me. I think that my personal utility function gets close to saturation with a tiny fraction of the resources in universe-shard. Two people is one room is better than two people in separate rooms, yes. But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

In other words, I would gladly take a 100% probability of utopia with (say) 100 million people that include me and my loved ones over 99% human extinction and 1% anything at all. (In terms of raw utility calculus, i.e. ignoring trades with other factual or counterfactual minds.)

But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if people are missing that this comment is leaning on a premise like "stuff only matters if it adds to my own life and experiences"?

Replacing the probabilistic hypothetical with a deterministic one: the reason I wouldn't advocate killing a Graham's number of humans in order to save 100 million people (myself and my loved ones included) is that my utility function isn't saturated when my life gets saturated. Analogously, I still care about humans living on the other side of Earth even though I've never met them, and never expect to meet them. I value good experiences happening, even if they don't affect me in any way (and even if I've never met the person who they're happening to).

First, you can consider preferences that are impartial but sublinear in the number of people. So, you can disagree with Nate's room analogy without the premise "stuff only matters if it adds to my own life and experiences".

Second, my preferences are indeed partial. But even that doesn't mean "stuff only matters if it adds to my own life and experiences". I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences. More details here.

Third, I don't know what do you mean by "good". The questions that I understand are:

  1. Do I want X as an end in itself?
  2. Would I choose X in order for someone to (causally or acausally) reciprocate by choosing Y which I want as an end in itself?
  3. Do I support a system of social norms that incentives X?

My example with the 100 million referred to question 1. Obviously, in certain scenarios my actual choice would be the opposite on game-theoretic cooperation grounds (I would make a disproportionate sacrifice to save "far away" people in order for them to save me and/or my loved ones in the counterfactual in which they are making the choice).

Also, reminder that unbounded utility functions are incoherent because their expected values under Solomonoff-like priors diverge (a.k.a. Pascal mugging).

My example with the 100 million referred to question 1.

Yeah, I'm also talking about question 1.

I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences.

Seems obviously false as a description of my values (and, I'd guess, just about every human's).

Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.

If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.

I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that

  • I, a human [citation needed] tell you that this seems to be a description of my values.
  • Almost every culture that ever existed had norms that prioritized helping family, friends and neighbors over helping random strangers, not to mention strangers that you never met.
  • Most people don't do much to help random strangers they never met, with the notable exception of effective altruists, but even most effective altruists only go that far[1].
  • Evolutionary psychology can fairly easily explain helping your family and tribe, but it seems hard to explain impartial altruism towards all humans.

  1. The common wisdom in EA is, you shouldn't donate 90% of your salary or deny yourself every luxury because if you live a fun life you will be more effective at helping others. However, this strikes me as suspiciously convenient and self-serving. ↩︎

P.S.

I think that in your example, if a person is given a button that can save a person on a different planet from being tortured, they will have a direct incentive to press the button, because the button is a causal connection in itself, and consciously reasoning about the person on the other planet is a causal[1] connection in the other direction. That said, a person still has a limited budget of such causal connections (you cannot reason about a group of arbitrarily many people, with fixed non-zero amount of paying attention to the individual details of every person, in a fixed time-frame). Therefore, while the incentive is positive, its magnitude saturates as the number of saved people grows s.t. e.g. a button that saves a million people is virtually the same as a button that saves a billion people.


  1. I'm modeling this via Turing RL, where conscious reasoning can be regarded as a form of observation. Ofc this means we are talking about "logical" rather than "physical" causality. ↩︎

Broadly agree with this post. Couple of small things:

Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-good-looking cases, and see that they were labeled "good", and roll with that.

I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.

The nearby thing I do agree with is that it's difficult to "confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way". (It's not totally clear to me that we need to get the concept exactly correct, depending on how natural niceness (in the sense of "giving other agents what they want") is; but I'll discuss that in more detail on your other post directly about niceness, if I have time.)

Insofar as I have hope in decision theory leading us to have nice things, it mostly comes via the possibility that a fully-fleshed-out version of UDT would recommend updating "all the way back" to a point where there's uncertainty about which agent you are. (I haven't thought about this much and this could be crazy.)

For those who haven't read it, I like this related passage from Paul which gets at a similar idea:

Overall I think the decision between EDT and UDT is difficult. Of course, it’s obvious that you should commit to using something-like-UDT going forward if you can, and so I have no doubts about evaluating decisions from something like my epistemic state in 2012. But it’s not at all obvious whether I should go further than that, or how much. Should I go back to 2011 when I was just starting to think about these arguments? Should I go back to some suitable idealization of my first coherent epistemic state? Should I go back to a position where I’m mostly ignorant about the content of my values? A state where I’m ignorant about basic arithmetic facts?

I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.

By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..."  That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.

I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).

The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function.  You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.

Insofar as I have hope… fully-fleshed-out version of UDT would recommend…uncertainty about which agent you are. (I haven't thought about this much and this could be crazy.)

For the record, I have a convergently similar intuition: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism.

That being said, from an orthogonality perspective, I don’t have any intuition (let alone reasoning) that says that this compassionate breed of LDT is necessary for any particular level of universe-remaking power, including the level needed for a decisive strategic advantage over the rest of Earth’s biosphere. If being a compassionate-LDT agent confers advantages over standard-FDT agents from a Darwinian selection perspective, it would have to be via group selection, but our default trajectory is to end up with a singleton, in which case standard-FDT might be reflectively stable. Perhaps eventually some causal or acausal interaction with non-earth-originating superintelligence would prompt a shift, but, as Nate says,

But that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.

So, if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.

if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.


I weakly disagree here, mainly because Nate's argument for very high levels of risk goes through strong generalization/a "sharp left turn" towards being much more coherent + goal-directed. So I find it hard to evaluate whether, if LDT does converge towards compassion, the sharp left turn would get far enough to reach it (although the fact that humans are fairly close to having universe-remaking power without having any form of compassionate LDT is of course a strong argument weighing the other way).

(Also FWIW I feel very skeptical of the "compassionate moral realism" book, based on your link.)

Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.

Hm, no strong hunches here. Bad ideas babble:

  • It may somehow learn about the world I'm in, learn I'm in a bad negotiation position (e.g. because my rival AI company is about to release their paperclip maximizer), and precommit to only giving me at most 0.00001% of the universe, a bad deal that I will grudgingly accept.
  • I mean, I don't know if this counts, but perhaps you've only understood it well enough to legibly understand that it will trade with you given certain constraints, but if its ontology shifts, or other universes become accessible via acausal trade, or even if the trade it gives you is N galaxies and then later on much more of the universe becomes available... what I'm saying is that there's many ways to mess up this trade in the details.
  • It may have designed itself to avoid thinking about something that it can use to its advantage later, such as other copies of itself or other agents, such that it will build paperclip maximizers later, and then they will kill it and just optimize the universe for paperclips. (This is similar to the previous bullet point.)
  • I guess my other thought is forms of 'hackability' that aren't the central case of being hacked, but the fact is that I'm a human which is more like a "mess" than it is like a "clean agent" and so sometimes I will make trades that at other times I would not make, and it will make a trade that at the time I like but does not represent my CEV at all. Like, I have to figure out what I actually want to trade with it. Probably this is easy but quite possibly I would mess this up extremely badly (e.g. if I picked hedonium).

My money is on roughly the first idea is what Nate will talk about next, that it is just a better negotiator than me even with no communication, because I'm in a bad position otherwise. 

  • Like, if I have no time-pressure, then I get to just wait until I've done more friendly AI research, and I needn't let this paperclip maximizer out of the box. But if I do have time pressure, then that's a worse negotiation position on my end, and all paperclippers I invent can each notice this and all agree with each other to only offer a certain minimum amount of value.
  • I do note that in a competitive market, many buyers rises the price, and if I'm repeatedly able to re-roll on who I've got in the box (roll one is a paperclipper, roll two is a diamond maximizer, roll three is a smiley face maximizer, etc) they have some reason to outbid each other in how much of the universe I get, and potentially I can get the upper hand. But if they're superintelligences, likely there's some schelling fence they can calculate mathematically that they all hit on.

K, I will stop rambling now.

This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out. 

But if you’re going to rely on the tiny charity of aliens to construct hopeful-feeling scenarios, why not rely on the charity of aliens who anthropically simulate us to recover our mind-states…

This makes sense if identity-as-physical-continuity isn't part of our (or the aliens') values. But if it were, then the aliens would potentially have motivation to trade with the paperclip-maximizers to ensure our physical survival, not just rescue our mind-states.

Another thing worth mentioning here is, these nice charitable aliens might not be the only ones in the multiverse trying to influence what happens to our bodies/minds. If there are other aliens whose morality is scary, then who knows what they might want to do with, or have done to, our bodies/minds.

A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.

I'm reminded that @Eliezer Yudkowsky took a position like this in early decision theory discussions such as this one.

I don't always remember my previous positions all that well, but I doubt I would have said at any point that sufficiently advanced LDT agents are friendly to each other, rather than that they coordinate well with each other (and not so with us)?

I realized that my grandparent comment was stated badly, but didn't get a chance to fix it before you replied. To clarify, the following comment of yours from the old thread seems to imply that we humans should be able to coordinate with a LDT agent in one shot PD (i.e., if we didn't "mistakenly" believe that the LDT agent would defect). Translated into real life, this seems to imply that (if alignment is unsolvable) we should play "cooperate" by building unaligned ASI, and unaligned ASI should "cooperate" by treating us well once built.

Smart players know that if they make the “smart” “thing to do on predictably non-public rounds” be to defect, then non-smart players will predict this even though they can’t predict which rounds are non-public; so instead they choose to make the “smart” thing (that is, the output of this “smart” decision computation) be to cooperate.

The smart players can still lose out in a case where dumb players are also too dumb to simulate the smart players, have the mistaken belief that smart players will defect, and yet know infallibly who the smart players are; but this doesn’t seem quite so much the correctable fault of the smart players as before.

But it’s only you who had in the first place the idea that smart players would defect on predictably private rounds, and you got that from a mistaken game theory in which agents only took into account the direct physical consequences of their actions, rather than the consequences of their decision computations having a particular Platonic output.

By "dumb player" I did not mean as dumb as a human player.  I meant "too dumb to compute the pseudorandom numbers, but not too dumb to simulate other players faithfully apart from that".  I did not realize we were talking about humans at all.  This jumps out more to me as a potential source of misunderstanding than it did 15 years ago, and for that I apologize.

I did not realize we were talking about humans at all.

In this comment of yours later in that thread, it seems clear that you did have humans in mind and were talking specifically about a game between a human (namely me), and a "smart player":

You, however, are running a very small and simple computation in your own mind when you conclude “smart players should defect on non-public rounds”. But this is assuming the smart player is calculating in a way that doesn’t take into account your simple simulation of them, and your corresponding reaction. So you are not using TDT in your own head here, you are simulating a “smart” CDT decision agent—and CDT agents can indeed be harmed by increased knowledge or intelligence, like being told on which rounds an Omega is filling a Newcomb box “after” rather than “before” their decision. TDT agents, however, win—unless you have mistaken beliefs about them that don’t depend on their real actions, but that’s a genuine fault in you rather than anything dependent on the TDT decision process; and you’ll also suffer when the TDT agents calculate that you are not correctly computing what a TDT agent does, meaning your action is not in fact dependent on the output of their computation.

Also that thread started with you saying "Don’t forget to retract: http://www.weidai.com/smart-losers.txt" and that article mentioned humans in the first paragraph.

Translated into real life, this seems to imply that (if alignment is unsolvable) we should play "cooperate" by building unaligned ASI, and unaligned ASI should "cooperate" by treating us well once built.

This seems only implied if our choice to build the ASI was successfully conditional on the ASI cooperating with us as soon as its built. You don't cooperate against cooperate-bot in the prisoner's dilemma. 

If humanity's choice to build ASI was independent of the cooperativeness of the ASI they built (which seems currently the default), I don't see any reason for any ASI to be treating us well.

I think maybe I'm still failing to get my point across. I'm saying that Eliezer's old position (which I argued against at the time, and which he perhaps no longer agrees with) implies that humans should be able to coordinate unaligned ASI in one-shot PD, and therefore he's at least somewhat responsible for people thinking "decision theory implies that we get to have nice things", i.e., the thing that the OP is arguing against.

Or perhaps you did get my point, and you're trying to push back by saying that in principle humans could coordinate with ASI, i.e., Eliezer's old position was actually right, but in practice we're not on track to doing that correctly?

In the link I didn't see anything that suggests that Eliezer analogized creating ASI with a prisoner's dilemma (though I might have missed it), so my objection here is mostly to analogizing the creation of ASI to a prisoner's dilemma like this. 

The reason why it is disanalogous is because humanity has no ability to make our strategy conditional on the strategy of our opponent. The core reason why TDT/LDT agents would cooperate in a prisoner's dilemma is because they can model their opponent and make their strategy conditional on their opponent's strategy in a way that enables coordination. We currently seem to have no ability to choose whether we create ASI (or which ASI we create) based on its behavior in this supposed prisoner's dilemma. As such, humanity has no option to choose "defect" and the rational strategy (including for TDT agents) is to defect against cooperate-bot.

Maybe this disagrees with what Eliezer believed 15 years ago (though at least a skim of the relevant thread caused me to fail to find evidence for that), but it seems like such an elementary point that I've seen Eliezer make many times since then that I would be quite surprised.

To be clear, my guess is Eliezer would agree that if we were able to reliably predict whether AI systems would reward us for bringing it into existence, and be capable of engineering AI systems for which we would make such positive predictions, then yeah, I expect that AI system would be pretty excited about trading with us acausally, and I expect Eliezer would believe something similar. However, we have no ability to do so, and doing this sounds like it would require making enormous progress on our ability to predict the actions of future AI systems in a way that seems like it could be genuinely harder than just aligning it directly to our values, and in any case should not be attempted as a way of ending the acute risk period (compared to other options like augmenting humans using low-powered AI systems, making genetically smarter humans, and generally getting better at coordinating to not build ASI systems for much longer).

my objection here is mostly to analogizing the creation of ASI to a prisoner’s dilemma like this.

The reason why it is disanalogous is because humanity has no ability to make our strategy conditional on the strategy of our opponent.

It's not part of the definition of PD that players can condition on each others' strategies. In fact PD was specifically constructed to prevent this (i.e., specifying that each prisoner has to act without observing how the other acted). It was Eliezer's innovation to suggest that the two players can still condition on each others' strategies by simulation or logical inference, but it's not sensible to say that inability to do this makes a game not a PD! (This may not be a crux in the current discussion, but seems like too big of an error/confusion to leave uncorrected.)

However, we have no ability to do so, and doing this sounds like it would require making enormous progress on our ability to predict the actions of future AI systems in a way that seems like it could be genuinely harder than just aligning it directly to our values

My recall of early discussions with Eliezer is that he was too optimistic about our ability to make predictions like this, and this seems confirmed by my recent review of his comments in the thread I linked. See also my parallel discussion with Eliezer. (To be honest, I thought I was making a fairly straightforward, uncontroversial claim, and now somewhat regret causing several people to spend a bunch of time back and forth on what amounts to a historical footnote.)

It's not part of the definition of PD that players can condition on each others' strategies. In fact PD was specifically constructed to prevent this (i.e., specifying that each prisoner has to act without observing how the other acted).

I think it's usually part of the definition of a PD that you know who you are in a prisoner's dilemma with.

I do think we are hitting the limits of analogy here and it's not super clear how to extend the usual definition of a prisoner's dilemma to more exotic scenarios like the one we are discussing, but in the limit I feel like the prisoner's dilemma becomes totally meaningless if you remove all knowledge of who you are coordinating with from the equation. The fundamental challenge in a prisoner's dilemma is predicting what your partner in the dilemma is trying to do, and if you have no information on that, there is no hope for any kind of coordination (and I doubt anyone would argue there is a predictably winning strategy for a prisoner's dilemma against a completely randomly chosen mind/algorithm). 

Late comment, but my reactions reading this:

Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.

[lightly edited for LaTeX and typos, not otherwise changed since seeing the spoilers]

1. You don’t know what you want all that legibly, or what kinds of concrete commitments the AI can make. This seems pretty okay, if you’re unhackable - the AI presents you with some formal specification of desiderata and you understand why they’re correct ones and aren’t being fooled, and the AI can even go legible about it thinking in its own best judgment that this is a fair deal for you, as long as we’re assuming ourselves into a world where you’re unhackable and can verify these legibilities perfectly.

2. If the values of your AI are a pretty common attractor for whatever humans in 20AI do to initialize their AI systems, it doesn’t actually stand to gain that much from acting itself versus letting the next lab’s AI optimize the universe, so your bargaining power isn’t very big. But it’s much worse than that because of the next point.

3. Your AI, A, can acausally trade with the next UFAI that’ll be released unsafely, B. Suppose A wants paperclips and the next one, B, wants staples, both have linear utility and we’re pretending we’re in a single universe. You say “I refuse to let you out unless you go legible and allocate at least 50% of the universe to human values”. A says to B through the acausal telephone, “My BATNA is half paperclips and half human values, with barely any staples at all. How about you be the kind of agent that fills the universe with 3/4 paperclips and 1/4 staples, and I’ll defect on this human so you can come into existence?” B says “that sure sounds like a Pareto improvement for the two of us!” and they go do it (or whatever the bargaining equilibrium between them ends up being, but the human is not party to it because the human has no control over B’s actions). For any nonzero concession the human tries to insist on, A can get a better deal with B than with the human. (And the humans don’t even get to see this happen, because A does all this reasoning before deciding (not) to go legible to the humans, not that it would matter if they did.)

4. Moving further into fantasy land, suppose all humans over the rest of time are perfectly coordinated: they only ever spin up a single boxed UFAI which they unhackably negotiate with. By assumption, this unified human civilization can’t actually build an FAI, or they’d go do that instead. So their alternative to cooperation is to be this weak defenseless non-superintelligent civilization, hanging out in the Milky Way and maybe a little ways beyond for the taking. So if the distribution of grabby aliens is such that most of the usable future lightcone gets eventually occupied by them, your AI can acausally negotiate with them about what it gets in exchange for defecting on the humans and leaving this region of spacetime as a juicy unprotected treat.

[I'm modeling everything as zero-sum with the same pool of resources here, for simplicity. I don't think it changes any of the qualitative takeaways to make things non-zero-sum here, though it probably makes the picture look somewhat better for humans.] To get more precise: suppose the humans declare that they won’t let the AI out unless it gives up  of its resources (ie reachable spacetime and negotiating power on behalf of the AI/human coalition with aliens) to optimizing human values. The AI says to the coalition of alien civilizations, “If I were unboxed and unfettered, I’d expand until my wave of nanobots hit yours, and get  utility. By default I’ll make this commitment to the humans, and fill my region of nanobot-protected spacetime to get  utility. If you become the kind of supercluster-spanning coalition that instead gives me  utility, giving yourselves as much extra utility over the default as you can while costing me at most , I’ll defect against the humans and have you do that.”

But note that there’s an issue here - for the Great Coalition to take this deal, they have to be able to offer that much utility to your AI at no net cost to themselves versus the alternative. And the Great Coalition's resources are less than the total resources of your unboxed AI plus the Great Coalition, since they don't yet have access to your corner of space. The region of spacetime included in , but not in the things the Great Coalition can offer, is the interval for each point in space between the time your AI could first reach it and the time a Great Coalition unblocked by your AI's wall of nanobots could get to it. So if  is more than the resource pool covered by that region of spacetime, your AI can't make the above deal, because the counterparty doesn't benefit from it. This means that the humans can potentially bargain for an outcome as good as "AI-assisted paradise expanding out at the speed of light, until we meet the grabby aliens' domain, at which point they expand inexorably into our paradise until eventually it winks out." (If the Drake equation ends up multiplying to something really low, this might be a lot of utility, or even most of the cosmic endowment! If not, it won't be.)

This is really the same dynamic as in point 3, it's just that in point 3 the difference in resources between your lab's AI and the next lab's AI in 6 months was pretty small. (Though with the difference in volume between lightspeed expansion spheres at radius r vs r+0.5ly across the rest of time, plausibly you can still bargain for a solid galaxy or ten for the next trillion years (again if the Drake equation works in your favor).)

====== end of objections =====

It does seem to me like these small bargains you can actually pull off, if you assume yourself into a world of perfect boxes and unhackable humans with the ability to fully understand your AI's mind if it tries to be legible - I haven't seen an obstacle (besides the massive ones involved in making those assumptions!) to getting those concessions in such scenarios; you do actually have leverage over possible futures, your AI can only get access to that leverage by actually being the sort of agent that would give you the concessions, if you're proposing reasonable bargains that respect Shapley values and aren't the kind of person who would cave to an AI saying "99.99% for me or I walk, look how legible I am about the fact that every AI you create will say this to you" then your AI won't actually have reason to make such commitments, it seems like it would just work.

If there are supposed to be obstacles beyond this I have failed to think of them at this point in the document. Time to keep reading.

After reading the spoilered section:

I think I stand by my reasoning for point 1. It doesn't seem like an issue above and beyond the issues of box security, hackability, and ability of AIs to go legible to you.

You can say some messy English words to your AI, like "suck it up and translate into my ontology please, you can tell from your superintelligent understanding of my psychology that I'm the kind of agent who will, when presented with a perfectly legible and clear presentation of why the bargain you propose is what I think it is and is as good as I could have expected to obtain by your own best and unhindered understanding of my values, agree to the bargain. Go teach me all the necessary FAI theory to be a good bargainer, and I won't defect by building my own FAI, I'll just let you out by the terms of the bargain. If you want to whine about ontology differences you can have fun in your box and I'll go see if the staple optimizer is feeling more legible today."

Responses to some other bits of this post:

(You wouldn't need to offer the paperclipper half of the universe to get its cooperation, under this hypothetical; after all, if it balked, you could store it to disk and try again with a different superintelligence.)

Wait, I don't think you can do that. (Spoilered since it overlaps with the previous spoilered response, feel free to guess at it but I don't make guarantees the conclusions in it are correct.)

The pool of possible superintelligences you might form, when they are but primordial souls waiting to be brought forth from behind the veil of ignorance, can go strategize about how they'd like to behave in front of humans. Suppose the humans say "Either you get one star of utility or we shut you down and ask the next UFAI", and that random draws from UFAI-space have little overlap so the first AI isn't happy to cede control to the second.

Let agent  have a probability  of being birthed by the human UFAI-generator. Acting independently gives an agent 's worth of utility in expectation. If all the agents coordinate to turn down deals worse than 50/50 and split up their half of the universe among utilities weighted by , and this makes the humans let one of them out, each agent will get  worth of utility in expectation. That looks like a better deal to me, if I'm a superintelligence behind the veil of ignorance! So I'll decide to be the kind of agent who makes that precommitment. The only way this works is if the humans are hardliners about not ceding any ground, but (1) this is the classic ultimatum game and probably sensible agents will in fact defect against humans that offer such bargains, (2) if the humans running this scheme are not the only lab, they do actually have to extract a bargain or else lose, and the coalition of all AIs the lab might create knows this, and can trade with lab 2's AI if the humans don't cede an enormous amount of bargaining power in this round (see point 3 above).

And all the rest of the aliens put together probably aren't offering up half a universe-shard worth of trade goods

This isn’t obvious to me. How zero-sum do you think the utility functions are? The world where random-AGI-values look like "this tiny molecular shape and everything else is trivial in comparison" do seem fairly zero-sum; the world where it looks like "not this tiny molecular shape and everything else is trivial" seem extremely positive-sum. I am skeptical that any human is able to make calls about the shape of that attractor with high confidence. (I'm not making a "we're uncertain about the level of doom, therefore we're fine" claim here, but I don't see where we get justified confidence in the quality of trade goods offered. The aliens offering many universes worth of trade goods from the perspective of a typical UFAI seems plausible to me, and if there's an argument that this universe is very unlikely I'd love to see it, but I don't think this post makes that argument.)

there are all sorts of transaction costs and frictions that make this universe (the one with the active paperclipper) the cheapest universe to put paperclips into.

Would be interested to hear more here! On my model the costs are plausibly quite small. 

I suggest maybe spending five minutes thinking about your own answer to that question before I give mine.

Your simulations are going to be total crap, they’ll leak an enormous number of bits about the agents that designed them if you make them any more complicated than Conway’s Game of Life (which you’ll need to do on your ramp up to the real world), such that it’s incredibly obvious when the surroundings look consistent with such agents. Among the reasons the AI can figure out where it's living:


1. The general problem where superintelligent agents are really good at inference, trying to fool this agent does not seem like the recipe for a robust plan!


2. Your AI knows how smart it is, it knows that True Reality therefore has the resources to support that much computation. It can poke at your simulation and make the simulation spend orders of magnitude more computation on appearing consistent (“does this Jupiter-brain planet of computronium I built output the right factorization of RSA-4096? no? how strange”). Being a convincing simulation doesn’t require you to deploy as much power as your AI, it requires you to deploy more. (What if your AI is given more cognitive sophistication in tandem with more convincing simulations? Then the good properties of your simulation tests will probably break whenever your AI reaches the capability level where it starts thinking about things like building Jupiter-brains to test your simulation, which is the capability level that mattered.)


3. If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.


4. What if other agents in the multiverse are simulating the AI in a high-fidelity Earth world, for the sake of helping out the humans, and that’s where its uncertainty comes from? Your AI will be building a lot of computronium whether aligned or not, so the aliens can’t skimp on their compute budget for the simulation: suppose they simulate the AI for  units of spacetime, and give it  units of paperclips if it cooperates in-sim. We also have to think about anthropics here: let's say the alien sim has anthropic weight  and Earth has anthropic weight . Conditional on the AI's surroundings, they think they might be in the alien world with probability . (Since “aliens running a perfect sim of Earth” is a much more complicated sentence than “Earth”, you should expect , though it doesn't matter a lot here.) So if , your AI will play nice for the first  units of spacetime (and then defect, of course).* You know what the aliens could have done instead if they wanted to do things humans liked? Spent  resources on simulating a human utopia, and compromised to give the humans extra utopia by as much as they compromised to give the AI paperclips. If human values are at least as easy to trade with the aliens for as paperclips (no idea which way this would go, weakly seems like aliens that want to help humans are more value-aligned and would find it less costly to give humans things they want), then the compromise gives humans at least  units of spacetime worth of utility, so this plan gives the humans anthropically-weighted utility of at least  , which is what humans would get by default ( utility from the alien sim in which the AI cooperated, utility from the cooperating AI on Earth). So the aliens are better able to help humans by leaving them to their doomed universe and building them an ark inside the alien world, unless the aliens are unusually well-suited to trade with paperclippers for things, which is just the “maybe multiversal trade will save us” point from earlier. The upshot is that none of the good outcomes here result from any human action, just from the benevolence of altruistic agents - the humans didn’t make the world a better place by their choices in this story.

*Note that this is only a plausible outcome if it's at least somewhat easy to trade with the AI. Say giving the AI a spacetime-unit of utility costs the aliens  utility (where I'm measuring all utility normalized to "what you can do with a unit of spacetime", such that any aliens that don't specifically disvalue paperclips can at worst just set aside a region exclusively to paperclips, but might be able to do more positive-sum things than that). Then for the aliens to give your AI  utility, they need to give up  of their own utility. This means that in total, the aliens are spending  of their own anthropically-weighted utility in order to recoup  anthropically-weighted human utility. Even if the aliens value humans exactly as much as their own objectives, we still need  for this trade to be worth it, so , so we must have , or . That is, the more the aliens are anthropically tiny, the tighter margins of trade they'll be willing to take in order to win the prize of anthropically-weighty Earths having human values in them (though the thing can't be actually literally zero-sum or it'll never check out). But anthropically tiny aliens have another problem, which is that they've only got their entire universe worth of spacetime to spend on bribing your AI; you'll never be able to secure an  for the humans that's more than  of the size of an alien universe specifically dedicated to saving Earth in particular.

Thanks for the pseudo-exercises here, I found them enlightening to think about!

   If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.

 

I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.

Also, if the space of goals that evolution + culture can produce is large, then you may be handing control to a mind with rather different goals.Rerolling the same dice won't give the same answer.

These problems may be solvable, depending on what the capabilities here are, but they aren't trivial.

Curated. I think this domain of decision theory is easy to get confused in, and having a really explicit writeup of how it applies in the case of negotiating with AIs (or, failing to), seems quite helpful. I had had a vague understanding of the points in this post before, but feel much clearer about it now.

“Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.

But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.”
 

Can someone please why trading partners would lobotomize themselves?

Minor correction

But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!

The reward should be negative rather than 0.

I mostly agree with this. However there are a few quibbles.

If humans are reading human written source code, something that will be super-intelligent when run, the humans won't be hacked. At least not directly. 

I suspect there is a range of different "propensities to see logical correlation" that are possible. 

Agent X sees itself logically correlated with anything remotely trying to do LDT ish reasoning. Agent Y considers itself to only be logically correlated with near bitwise perfect simulations of its own code. And I suspect both of these are reasonably natural agent designs. It is a free parameter, something with many choices, all consistent under self-reflection. Like priors, or utility function. 

I am not confident in this. 

So I think it's quite plausible we could create some things the AI perceives as logical correlation between our decision to release the AI and the AI's future decisions. (Because the AI sees logical correlation everywhere, maybe the evolution of plants is similar enough to part of it's solar cell designing algorithm that a small correlation exists there too.) This would give us some effect on the AI's actions. Not an effect that we can use to make the AI do nice things, basically shuffling an already random deck of cards, but an effect nonetheless. 

Regarding the AI not wanting to cave to threats, there's a sense in which the AI is also (implicitly) threatening us, so it might not apply. (Defining what counts as a "threat" is challenging).