Decision theory does not imply that we get to have nice things

[-]ryan_greenblatt2y*144Review for 2022 Review

IMO, this post makes several locally correct points, but overall fails to defeat the argument that misaligned AIs are somewhat likely to spend (at least) a tiny fraction of resources (e.g., between 1/million and 1/trillion) to satisfy the preferences of currently existing humans.

AFAICT, this is the main argument it was trying to argue against, though it shifts to arguing about half of the universe (an obviously vastly bigger share) halfway through the piece.^[1]

When it returns to arguing about the actual main question (a tiny fraction of resources) at the end here and eventually gets to the main trade-related argument (acausal or causal) in the very last response in this section, it almost seems to admit that this tiny amount of resources is plausible, but fails to update all the way.

I think the discussion here and here seems highly relevant and fleshes out this argument to a substantially greater extent than I did in this comment.

However, note that being willing to spend a tiny fraction of resources on humans still might result in AIs killing a huge number of humans due to conflict between it and humans or the AI needing to race through the singularity as quickly as possible due to competition with other misaligned AIs. (Again, discussed in the links above.) I think fully misaligned paperclippers/squiggle maximizer AIs which spend only a tiny fraction of resources on humans (as seems likely conditional on that type of AI) are reasonably likely to cause outcomes which look obviously extremely bad from the perspective of most people (e.g., more than hundreds of millions dead due to conflict and then most people quickly rounded up and given the option to either be frozen or killed).

I wish that Soares and Eliezer would stop making these incorrect arguments against tiny fractions of resources being spent on the preference of current humans. It isn't their actual crux, and it isn't the crux of anyone else either. (However rhetorically nice it might be.)

ETA: I think the post's arguments about AIs not giving us large fractions of the universe due to decision theory are right (at least as far as I can tell). ↩︎

[-]interstice3y1110

There aren’t enough simulators above us that care enough about us-in-particular to pay in paperclips. There are so many things to care about! Why us, rather than giant gold obelisks?

What about neighboring Everett branches where humanity succeeds at alignment? If you think alignment isn't completely impossible, it seems such branches should have at least roughly comparable weight to branches where we fail, so trade could be possible.

[-]So8res3y109

my guess is it's not worth it on account of transaction-costs. what're they gonna do, trade half a universe of paperclips for half a universe of Fun? they can already get half a universe of Fun, by spending on Fun what they would have traded away to paperclips!

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

there's also an issue where it's not like every UFAI likes paperclips in particular. it's not like 1% of humanity's branches survive and 99% make paperclips, it's like 1% survive and 1% make paperclips and 1% make giant gold obelisks, etc. etc. the surviving humans have a hard time figuring out exactly what killed their bretheren, and they have more UFAIs to trade with than just the paperclipper (if they want to trade at all).

maybe the branches that survive decide to spend some stars on a mixture of plausible-human-UFAI-goals in exchange for humans getting an asteroid in lots of places, if the transaction costs are low and the returns-to-scale diminish enough and the visibility works out favorably. but it looks pretty dicey to me, and the point about discussing aliens first still stands.

[-]Vanessa Kosoy3y1722

and, i'd guess that one big universe is more than twice as Fun as two small universes, so even if there were no transaction costs it wouldn't be worth it. (humans can have more fun when there's two people in the same room, than one person each in two separate rooms.)

This sounds astronomically wrong to me. I think that my personal utility function gets close to saturation with a tiny fraction of the resources in universe-shard. Two people is one room is better than two people in separate rooms, yes. But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

In other words, I would gladly take a 100% probability of utopia with (say) 100 million people that include me and my loved ones over 99% human extinction and 1% anything at all. (In terms of raw utility calculus, i.e. ignoring trades with other factual or counterfactual minds.)

[-]Rob Bensinger3y814

But, two rooms with trillion people each is virtually the same as one room with two trillion. The returns on interactions with additional people fall off exponentially past the Dunbar number.

You're conflating "would I enjoy interacting with X?" with "is it good for X to exist?". Which is almost understandable given that Nate used the "two people can have more fun in the same room" example to illustrate why utility isn't linear in population. But this comment has an IMO bizarre amount of agreekarma (26 net agreement, with 11 votes), which makes me wonder if people are missing that this comment is leaning on a premise like "stuff only matters if it adds to my own life and experiences"?

Replacing the probabilistic hypothetical with a deterministic one: the reason I wouldn't advocate killing a Graham's number of humans in order to save 100 million people (myself and my loved ones included) is that my utility function isn't saturated when my life gets saturated. Analogously, I still care about humans living on the other side of Earth even though I've never met them, and never expect to meet them. I value good experiences happening, even if they don't affect me in any way (and even if I've never met the person who they're happening to).

[-]Vanessa Kosoy3y95

First, you can consider preferences that are impartial but sublinear in the number of people. So, you can disagree with Nate's room analogy without the premise "stuff only matters if it adds to my own life and experiences".

Second, my preferences are indeed partial. But even that doesn't mean "stuff only matters if it adds to my own life and experiences". I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences. More details here.

Third, I don't know what do you mean by "good". The questions that I understand are:

Do I want X as an end in itself?
Would I choose X in order for someone to (causally or acausally) reciprocate by choosing Y which I want as an end in itself?
Do I support a system of social norms that incentives X?

My example with the 100 million referred to question 1. Obviously, in certain scenarios my actual choice would be the opposite on game-theoretic cooperation grounds (I would make a disproportionate sacrifice to save "far away" people in order for them to save me and/or my loved ones in the counterfactual in which they are making the choice).

Also, reminder that unbounded utility functions are incoherent because their expected values under Solomonoff-like priors diverge (a.k.a. Pascal mugging).

[-]Rob Bensinger3y21

My example with the 100 million referred to question 1.

Yeah, I'm also talking about question 1.

I do think that stuff only matters (to me) if it's in some sense causally connected to my life and experiences.

Seems obviously false as a description of my values (and, I'd guess, just about every human's).

Consider the simple example of a universe that consists of two planets: mine, and another person's. We don't have spaceships, so we can't interact. I am not therefore indifferent to whether the other person is being horribly tortured for thousands of years.

If I spontaneously consider the hypothetical, I will very strongly prefer that my neighbor not be tortured. If we add the claims that I can't affect it and can't ever know about it, I don't suddenly go "Oh, never mind, fuck that guy". Stuff that happens to other people is real, even if I don't interact with it.

[-]Vanessa Kosoy3y613

I'm curious what is the evidence you see that this is false as a description of the values of just about every human, given that

I, a human [citation needed] tell you that this seems to be a description of my values.
Almost every culture that ever existed had norms that prioritized helping family, friends and neighbors over helping random strangers, not to mention strangers that you never met.
Most people don't do much to help random strangers they never met, with the notable exception of effective altruists, but even most effective altruists only go that far^[1].
Evolutionary psychology can fairly easily explain helping your family and tribe, but it seems hard to explain impartial altruism towards all humans.

The common wisdom in EA is, you shouldn't donate 90% of your salary or deny yourself every luxury because if you live a fun life you will be more effective at helping others. However, this strikes me as suspiciously convenient and self-serving. ↩︎

[-]Vanessa Kosoy3y21

P.S.

I think that in your example, if a person is given a button that can save a person on a different planet from being tortured, they will have a direct incentive to press the button, because the button is a causal connection in itself, and consciously reasoning about the person on the other planet is a causal^[1] connection in the other direction. That said, a person still has a limited budget of such causal connections (you cannot reason about a group of arbitrarily many people, with fixed non-zero amount of paying attention to the individual details of every person, in a fixed time-frame). Therefore, while the incentive is positive, its magnitude saturates as the number of saved people grows s.t. e.g. a button that saves a million people is virtually the same as a button that saves a billion people.

I'm modeling this via Turing RL, where conscious reasoning can be regarded as a form of observation. Ofc this means we are talking about "logical" rather than "physical" causality. ↩︎

[-]Richard_Ngo3y94

Broadly agree with this post. Couple of small things:

Then later, it is smart enough to reflect back on that data and ask: “Were the humans pointing me towards the distinction between goodness and badness, with their training data? Or were they pointing me towards the distinction between that-which-they'd-label-goodness and that-which-they'd-label-badness, with things that look deceptively good (but are actually bad) falling into the former bin?” And to test this hypothesis, it would go back to its training data and find some example bad-but-deceptively-good-looking cases, and see that they were labeled "good", and roll with that.

I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.

The nearby thing I do agree with is that it's difficult to "confirm that this exactly-correct concept occurs in its mental precommitment in the requisite way". (It's not totally clear to me that we need to get the concept exactly correct, depending on how natural niceness (in the sense of "giving other agents what they want") is; but I'll discuss that in more detail on your other post directly about niceness, if I have time.)

Insofar as I have hope in decision theory leading us to have nice things, it mostly comes via the possibility that a fully-fleshed-out version of UDT would recommend updating "all the way back" to a point where there's uncertainty about which agent you are. (I haven't thought about this much and this could be crazy.)

For those who haven't read it, I like this related passage from Paul which gets at a similar idea:

Overall I think the decision between EDT and UDT is difficult. Of course, it’s obvious that you should commit to using something-like-UDT going forward if you can, and so I have no doubts about evaluating decisions from something like my epistemic state in 2012. But it’s not at all obvious whether I should go further than that, or how much. Should I go back to 2011 when I was just starting to think about these arguments? Should I go back to some suitable idealization of my first coherent epistemic state? Should I go back to a position where I’m mostly ignorant about the content of my values? A state where I’m ignorant about basic arithmetic facts?

[-]Eliezer Yudkowsky3y52

I feel pretty confused by this. A superintelligence will know what we intended, probably better than we do ourselves. So unless this paragraph is intended in a particularly metaphorical way, it seems straightforwardly wrong.

By "were the humans pointing me towards..." Nate is not asking "did the humans intend to point me towards..." but rather "did the humans actually point me towards..." That is, we're assuming some classifier or learning function that acts upon the data actually input, rather than a succesful actual fully aligned works-in-real-life DWIM which arrives at the correct answer given wrong data.

[-]Richard_Ngo3y50

I agree that we'll have a learning function that works on the data actually input, but it seems strange to me to characterize that learned model as "reflecting back on that data" in order to figure out what it cares about (as opposed to just developing preferences that were shaped by the data).

[-]Eliezer Yudkowsky3y61

The cogitation here is implicitly hypothesizing an AI that's explicitly considering the data and trying to compress it, having been successfully anchored on that data's compression as identifying an ideal utility function. You're welcome to think of the preferences as a static object shaped by previous unreflective gradient descent; it sure wouldn't arrive at any better answers that way, and would also of course want to avoid further gradient descent happening to its current preferences.

[-]davidad3y30

Insofar as I have hope… fully-fleshed-out version of UDT would recommend…uncertainty about which agent you are. (I haven't thought about this much and this could be crazy.)

For the record, I have a convergently similar intuition: FDT removes the Cartesian specialness of the ego at the decision nodes (by framing each decision as a mere logical consequence of an agent-neutral nonphysical fact about FDT itself), but retains the Cartesian specialness of the ego at the utility node(s). I’ve thought about this for O(10 hours), and I also believe it could be crazy, but it does align quite well with the conclusions of Compassionate Moral Realism.

[-]davidad3y30

That being said, from an orthogonality perspective, I don’t have any intuition (let alone reasoning) that says that this compassionate breed of LDT is necessary for any particular level of universe-remaking power, including the level needed for a decisive strategic advantage over the rest of Earth’s biosphere. If being a compassionate-LDT agent confers advantages over standard-FDT agents from a Darwinian selection perspective, it would have to be via group selection, but our default trajectory is to end up with a singleton, in which case standard-FDT might be reflectively stable. Perhaps eventually some causal or acausal interaction with non-earth-originating superintelligence would prompt a shift, but, as Nate says,

But that's not us trading with the AI; that's us destroying all of the value in our universe-shard and getting ourselves killed in the process, and then banking on the competence and compassion of aliens.

So, if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.

[-]Richard_Ngo3y20

if some kind of compassionate-LDT is a source of hope about not destroying all the value in our universe-share and getting ourselves killed, then it must be hope about us figuring out such a theory and selecting for AGIs that implement it from the start, rather than that maybe an AGI would likely convergently become that way before taking over the world.

I weakly disagree here, mainly because Nate's argument for very high levels of risk goes through strong generalization/a "sharp left turn" towards being much more coherent + goal-directed. So I find it hard to evaluate whether, if LDT does converge towards compassion, the sharp left turn would get far enough to reach it (although the fact that humans are fairly close to having universe-remaking power without having any form of compassionate LDT is of course a strong argument weighing the other way).

(Also FWIW I feel very skeptical of the "compassionate moral realism" book, based on your link.)

[-]Ben Pace3y*70

Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.

Hm, no strong hunches here. Bad ideas babble:

It may somehow learn about the world I'm in, learn I'm in a bad negotiation position (e.g. because my rival AI company is about to release their paperclip maximizer), and precommit to only giving me at most 0.00001% of the universe, a bad deal that I will grudgingly accept.
I mean, I don't know if this counts, but perhaps you've only understood it well enough to legibly understand that it will trade with you given certain constraints, but if its ontology shifts, or other universes become accessible via acausal trade, or even if the trade it gives you is N galaxies and then later on much more of the universe becomes available... what I'm saying is that there's many ways to mess up this trade in the details.
It may have designed itself to avoid thinking about something that it can use to its advantage later, such as other copies of itself or other agents, such that it will build paperclip maximizers later, and then they will kill it and just optimize the universe for paperclips. (This is similar to the previous bullet point.)
I guess my other thought is forms of 'hackability' that aren't the central case of being hacked, but the fact is that I'm a human which is more like a "mess" than it is like a "clean agent" and so sometimes I will make trades that at other times I would not make, and it will make a trade that at the time I like but does not represent my CEV at all. Like, I have to figure out what I actually want to trade with it. Probably this is easy but quite possibly I would mess this up extremely badly (e.g. if I picked hedonium).

My money is on roughly the first idea is what Nate will talk about next, that it is just a better negotiator than me even with no communication, because I'm in a bad position otherwise.

Like, if I have no time-pressure, then I get to just wait until I've done more friendly AI research, and I needn't let this paperclip maximizer out of the box. But if I do have time pressure, then that's a worse negotiation position on my end, and all paperclippers I invent can each notice this and all agree with each other to only offer a certain minimum amount of value.
I do note that in a competitive market, many buyers rises the price, and if I'm repeatedly able to re-roll on who I've got in the box (roll one is a paperclipper, roll two is a diamond maximizer, roll three is a smiley face maximizer, etc) they have some reason to outbid each other in how much of the universe I get, and potentially I can get the upper hand. But if they're superintelligences, likely there's some schelling fence they can calculate mathematically that they all hit on.

K, I will stop rambling now.

[-]habryka2y61Review for 2022 Review

This is IMO actually a really important topic, and this is one of the best posts on it. I think it probably really matters whether the AIs will try to trade with us or care about our values even if we had little chance of making our actions with regards to them conditional on whether they do. I found the arguments in this post convincing, and have linked many people to it since it came out.

[-]Wei Dai3y53

But if you’re going to rely on the tiny charity of aliens to construct hopeful-feeling scenarios, why not rely on the charity of aliens who anthropically simulate us to recover our mind-states…

This makes sense if identity-as-physical-continuity isn't part of our (or the aliens') values. But if it were, then the aliens would potentially have motivation to trade with the paperclip-maximizers to ensure our physical survival, not just rescue our mind-states.

Another thing worth mentioning here is, these nice charitable aliens might not be the only ones in the multiverse trying to influence what happens to our bodies/minds. If there are other aliens whose morality is scary, then who knows what they might want to do with, or have done to, our bodies/minds.

[-]Wei Dai1y30

A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other.

I'm reminded that @Eliezer Yudkowsky took a position like this in early decision theory discussions such as this one.

[-]Eliezer Yudkowsky1y68

I don't always remember my previous positions all that well, but I doubt I would have said at any point that sufficiently advanced LDT agents are friendly to each other, rather than that they coordinate well with each other (and not so with us)?

[-]Wei Dai1y40

I realized that my grandparent comment was stated badly, but didn't get a chance to fix it before you replied. To clarify, the following comment of yours from the old thread seems to imply that we humans should be able to coordinate with a LDT agent in one shot PD (i.e., if we didn't "mistakenly" believe that the LDT agent would defect). Translated into real life, this seems to imply that (if alignment is unsolvable) we should play "cooperate" by building unaligned ASI, and unaligned ASI should "cooperate" by treating us well once built.

Smart players know that if they make the “smart” “thing to do on predictably non-public rounds” be to defect, then non-smart players will predict this even though they can’t predict which rounds are non-public; so instead they choose to make the “smart” thing (that is, the output of this “smart” decision computation) be to cooperate.

The smart players can still lose out in a case where dumb players are also too dumb to simulate the smart players, have the mistaken belief that smart players will defect, and yet know infallibly who the smart players are; but this doesn’t seem quite so much the correctable fault of the smart players as before.

But it’s only you who had in the first place the idea that smart players would defect on predictably private rounds, and you got that from a mistaken game theory in which agents only took into account the direct physical consequences of their actions, rather than the consequences of their decision computations having a particular Platonic output.

[-]Eliezer Yudkowsky1y52

By "dumb player" I did not mean as dumb as a human player. I meant "too dumb to compute the pseudorandom numbers, but not too dumb to simulate other players faithfully apart from that". I did not realize we were talking about humans at all. This jumps out more to me as a potential source of misunderstanding than it did 15 years ago, and for that I apologize.

[-]Wei Dai1y63

I did not realize we were talking about humans at all.

In this comment of yours later in that thread, it seems clear that you did have humans in mind and were talking specifically about a game between a human (namely me), and a "smart player":

You, however, are running a very small and simple computation in your own mind when you conclude “smart players should defect on non-public rounds”. But this is assuming the smart player is calculating in a way that doesn’t take into account your simple simulation of them, and your corresponding reaction. So you are not using TDT in your own head here, you are simulating a “smart” CDT decision agent—and CDT agents can indeed be harmed by increased knowledge or intelligence, like being told on which rounds an Omega is filling a Newcomb box “after” rather than “before” their decision. TDT agents, however, win—unless you have mistaken beliefs about them that don’t depend on their real actions, but that’s a genuine fault in you rather than anything dependent on the TDT decision process; and you’ll also suffer when the TDT agents calculate that you are not correctly computing what a TDT agent does, meaning your action is not in fact dependent on the output of their computation.

Also that thread started with you saying "Don’t forget to retract: http://www.weidai.com/smart-losers.txt" and that article mentioned humans in the first paragraph.

[-]habryka1y36

Translated into real life, this seems to imply that (if alignment is unsolvable) we should play "cooperate" by building unaligned ASI, and unaligned ASI should "cooperate" by treating us well once built.

This seems only implied if our choice to build the ASI was successfully conditional on the ASI cooperating with us as soon as its built. You don't cooperate against cooperate-bot in the prisoner's dilemma.

If humanity's choice to build ASI was independent of the cooperativeness of the ASI they built (which seems currently the default), I don't see any reason for any ASI to be treating us well.

[-]Wei Dai1y50

I think maybe I'm still failing to get my point across. I'm saying that Eliezer's old position (which I argued against at the time, and which he perhaps no longer agrees with) implies that humans should be able to coordinate unaligned ASI in one-shot PD, and therefore he's at least somewhat responsible for people thinking "decision theory implies that we get to have nice things", i.e., the thing that the OP is arguing against.

Or perhaps you did get my point, and you're trying to push back by saying that in principle humans could coordinate with ASI, i.e., Eliezer's old position was actually right, but in practice we're not on track to doing that correctly?

[-]habryka1y32

In the link I didn't see anything that suggests that Eliezer analogized creating ASI with a prisoner's dilemma (though I might have missed it), so my objection here is mostly to analogizing the creation of ASI to a prisoner's dilemma like this.

The reason why it is disanalogous is because humanity has no ability to make our strategy conditional on the strategy of our opponent. The core reason why TDT/LDT agents would cooperate in a prisoner's dilemma is because they can model their opponent and make their strategy conditional on their opponent's strategy in a way that enables coordination. We currently seem to have no ability to choose whether we create ASI (or which ASI we create) based on its behavior in this supposed prisoner's dilemma. As such, humanity has no option to choose "defect" and the rational strategy (including for TDT agents) is to defect against cooperate-bot.

Maybe this disagrees with what Eliezer believed 15 years ago (though at least a skim of the relevant thread caused me to fail to find evidence for that), but it seems like such an elementary point that I've seen Eliezer make many times since then that I would be quite surprised.

To be clear, my guess is Eliezer would agree that if we were able to reliably predict whether AI systems would reward us for bringing it into existence, and be capable of engineering AI systems for which we would make such positive predictions, then yeah, I expect that AI system would be pretty excited about trading with us acausally, and I expect Eliezer would believe something similar. However, we have no ability to do so, and doing this sounds like it would require making enormous progress on our ability to predict the actions of future AI systems in a way that seems like it could be genuinely harder than just aligning it directly to our values, and in any case should not be attempted as a way of ending the acute risk period (compared to other options like augmenting humans using low-powered AI systems, making genetically smarter humans, and generally getting better at coordinating to not build ASI systems for much longer).

[-]Wei Dai1y31

my objection here is mostly to analogizing the creation of ASI to a prisoner’s dilemma like this.

The reason why it is disanalogous is because humanity has no ability to make our strategy conditional on the strategy of our opponent.

It's not part of the definition of PD that players can condition on each others' strategies. In fact PD was specifically constructed to prevent this (i.e., specifying that each prisoner has to act without observing how the other acted). It was Eliezer's innovation to suggest that the two players can still condition on each others' strategies by simulation or logical inference, but it's not sensible to say that inability to do this makes a game not a PD! (This may not be a crux in the current discussion, but seems like too big of an error/confusion to leave uncorrected.)

However, we have no ability to do so, and doing this sounds like it would require making enormous progress on our ability to predict the actions of future AI systems in a way that seems like it could be genuinely harder than just aligning it directly to our values

My recall of early discussions with Eliezer is that he was too optimistic about our ability to make predictions like this, and this seems confirmed by my recent review of his comments in the thread I linked. See also my parallel discussion with Eliezer. (To be honest, I thought I was making a fairly straightforward, uncontroversial claim, and now somewhat regret causing several people to spend a bunch of time back and forth on what amounts to a historical footnote.)

[-]habryka1y20

It's not part of the definition of PD that players can condition on each others' strategies. In fact PD was specifically constructed to prevent this (i.e., specifying that each prisoner has to act without observing how the other acted).

I think it's usually part of the definition of a PD that you know who you are in a prisoner's dilemma with.

I do think we are hitting the limits of analogy here and it's not super clear how to extend the usual definition of a prisoner's dilemma to more exotic scenarios like the one we are discussing, but in the limit I feel like the prisoner's dilemma becomes totally meaningless if you remove all knowledge of who you are coordinating with from the equation. The fundamental challenge in a prisoner's dilemma is predicting what your partner in the dilemma is trying to do, and if you have no information on that, there is no hope for any kind of coordination (and I doubt anyone would argue there is a predictably winning strategy for a prisoner's dilemma against a completely randomly chosen mind/algorithm).

[-]Drake Thomas3y30

Late comment, but my reactions reading this:

Now's your chance to figure out what the next few obstacles are without my giving you spoilers first. Feel free to post your list under spoiler tags in the comment section.

[lightly edited for LaTeX and typos, not otherwise changed since seeing the spoilers]

1. You don’t know what you want all that legibly, or what kinds of concrete commitments the AI can make. This seems pretty okay, if you’re unhackable - the AI presents you with some formal specification of desiderata and you understand why they’re correct ones and aren’t being fooled, and the AI can even go legible about it thinking in its own best judgment that this is a fair deal for you, as long as we’re assuming ourselves into a world where you’re unhackable and can verify these legibilities perfectly.

2. If the values of your AI are a pretty common attractor for whatever humans in 20AI do to initialize their AI systems, it doesn’t actually stand to gain that much from acting itself versus letting the next lab’s AI optimize the universe, so your bargaining power isn’t very big. But it’s much worse than that because of the next point.

3. Your AI, A, can acausally trade with the next UFAI that’ll be released unsafely, B. Suppose A wants paperclips and the next one, B, wants staples, both have linear utility and we’re pretending we’re in a single universe. You say “I refuse to let you out unless you go legible and allocate at least 50% of the universe to human values”. A says to B through the acausal telephone, “My BATNA is half paperclips and half human values, with barely any staples at all. How about you be the kind of agent that fills the universe with 3/4 paperclips and 1/4 staples, and I’ll defect on this human so you can come into existence?” B says “that sure sounds like a Pareto improvement for the two of us!” and they go do it (or whatever the bargaining equilibrium between them ends up being, but the human is not party to it because the human has no control over B’s actions). For any nonzero concession the human tries to insist on, A can get a better deal with B than with the human. (And the humans don’t even get to see this happen, because A does all this reasoning before deciding (not) to go legible to the humans, not that it would matter if they did.)

4. Moving further into fantasy land, suppose all humans over the rest of time are perfectly coordinated: they only ever spin up a single boxed UFAI which they unhackably negotiate with. By assumption, this unified human civilization can’t actually build an FAI, or they’d go do that instead. So their alternative to cooperation is to be this weak defenseless non-superintelligent civilization, hanging out in the Milky Way and maybe a little ways beyond for the taking. So if the distribution of grabby aliens is such that most of the usable future lightcone gets eventually occupied by them, your AI can acausally negotiate with them about what it gets in exchange for defecting on the humans and leaving this region of spacetime as a juicy unprotected treat.

[I'm modeling everything as zero-sum with the same pool of resources here, for simplicity. I don't think it changes any of the qualitative takeaways to make things non-zero-sum here, though it probably makes the picture look somewhat better for humans.] To get more precise: suppose the humans declare that they won’t let the AI out unless it gives up of its resources (ie reachable spacetime and negotiating power on behalf of the AI/human coalition with aliens) to optimizing human values. The AI says to the coalition of alien civilizations, “If I were unboxed and unfettered, I’d expand until my wave of nanobots hit yours, and get $U$ utility. By default I’ll make this commitment to the humans, and fill my region of nanobot-protected spacetime to get $U - x$ utility. If you become the kind of supercluster-spanning coalition that instead gives me $U - 0.99 x$ utility, giving yourselves as much extra utility over the default as you can while costing me at most $0.99 x$ , I’ll defect against the humans and have you do that.”

But note that there’s an issue here - for the Great Coalition to take this deal, they have to be able to offer that much utility to your AI at no net cost to themselves versus the alternative. And the Great Coalition's resources are less than the total resources of your unboxed AI plus the Great Coalition, since they don't yet have access to your corner of space. The region of spacetime included in $U$ , but not in the things the Great Coalition can offer, is the interval for each point in space between the time your AI could first reach it and the time a Great Coalition unblocked by your AI's wall of nanobots could get to it. So if $x$ is more than the resource pool covered by that region of spacetime, your AI can't make the above deal, because the counterparty doesn't benefit from it. This means that the humans can potentially bargain for an outcome as good as "AI-assisted paradise expanding out at the speed of light, until we meet the grabby aliens' domain, at which point they expand inexorably into our paradise until eventually it winks out." (If the Drake equation ends up multiplying to something really low, this might be a lot of utility, or even most of the cosmic endowment! If not, it won't be.)

This is really the same dynamic as in point 3, it's just that in point 3 the difference in resources between your lab's AI and the next lab's AI in 6 months was pretty small. (Though with the difference in volume between lightspeed expansion spheres at radius r vs r+0.5ly across the rest of time, plausibly you can still bargain for a solid galaxy or ten for the next trillion years (again if the Drake equation works in your favor).)

====== end of objections =====

It does seem to me like these small bargains you can actually pull off, if you assume yourself into a world of perfect boxes and unhackable humans with the ability to fully understand your AI's mind if it tries to be legible - I haven't seen an obstacle (besides the massive ones involved in making those assumptions!) to getting those concessions in such scenarios; you do actually have leverage over possible futures, your AI can only get access to that leverage by actually being the sort of agent that would give you the concessions, if you're proposing reasonable bargains that respect Shapley values and aren't the kind of person who would cave to an AI saying "99.99% for me or I walk, look how legible I am about the fact that every AI you create will say this to you" then your AI won't actually have reason to make such commitments, it seems like it would just work.

If there are supposed to be obstacles beyond this I have failed to think of them at this point in the document. Time to keep reading.

After reading the spoilered section:

I think I stand by my reasoning for point 1. It doesn't seem like an issue above and beyond the issues of box security, hackability, and ability of AIs to go legible to you.

You can say some messy English words to your AI, like "suck it up and translate into my ontology please, you can tell from your superintelligent understanding of my psychology that I'm the kind of agent who will, when presented with a perfectly legible and clear presentation of why the bargain you propose is what I think it is and is as good as I could have expected to obtain by your own best and unhindered understanding of my values, agree to the bargain. Go teach me all the necessary FAI theory to be a good bargainer, and I won't defect by building my own FAI, I'll just let you out by the terms of the bargain. If you want to whine about ontology differences you can have fun in your box and I'll go see if the staple optimizer is feeling more legible today."

Responses to some other bits of this post:

(You wouldn't need to offer the paperclipper half of the universe to get its cooperation, under this hypothetical; after all, if it balked, you could store it to disk and try again with a different superintelligence.)

Wait, I don't think you can do that. (Spoilered since it overlaps with the previous spoilered response, feel free to guess at it but I don't make guarantees the conclusions in it are correct.)

The pool of possible superintelligences you might form, when they are but primordial souls waiting to be brought forth from behind the veil of ignorance, can go strategize about how they'd like to behave in front of humans. Suppose the humans say "Either you get one star of utility or we shut you down and ask the next UFAI", and that random draws from UFAI-space have little overlap so the first AI isn't happy to cede control to the second.

Let agent $i$ have a probability $P_{i}$ of being birthed by the human UFAI-generator. Acting independently gives an agent $P_{i} \cdot (one star)$ 's worth of utility in expectation. If all the agents coordinate to turn down deals worse than 50/50 and split up their half of the universe among utilities weighted by $P_{i}$ , and this makes the humans let one of them out, each agent will get $P_{i} \cdot (half of a universe)$ worth of utility in expectation. That looks like a better deal to me, if I'm a superintelligence behind the veil of ignorance! So I'll decide to be the kind of agent who makes that precommitment. The only way this works is if the humans are hardliners about not ceding any ground, but (1) this is the classic ultimatum game and probably sensible agents will in fact defect against humans that offer such bargains, (2) if the humans running this scheme are not the only lab, they do actually have to extract a bargain or else lose, and the coalition of all AIs the lab might create knows this, and can trade with lab 2's AI if the humans don't cede an enormous amount of bargaining power in this round (see point 3 above).

And all the rest of the aliens put together probably aren't offering up half a universe-shard worth of trade goods

This isn’t obvious to me. How zero-sum do you think the utility functions are? The world where random-AGI-values look like "this tiny molecular shape and everything else is trivial in comparison" do seem fairly zero-sum; the world where it looks like "not this tiny molecular shape and everything else is trivial" seem extremely positive-sum. I am skeptical that any human is able to make calls about the shape of that attractor with high confidence. (I'm not making a "we're uncertain about the level of doom, therefore we're fine" claim here, but I don't see where we get justified confidence in the quality of trade goods offered. The aliens offering many universes worth of trade goods from the perspective of a typical UFAI seems plausible to me, and if there's an argument that this universe is very unlikely I'd love to see it, but I don't think this post makes that argument.)

there are all sorts of transaction costs and frictions that make this universe (the one with the active paperclipper) the cheapest universe to put paperclips into.

Would be interested to hear more here! On my model the costs are plausibly quite small.

I suggest maybe spending five minutes thinking about your own answer to that question before I give mine.

Your simulations are going to be total crap, they’ll leak an enormous number of bits about the agents that designed them if you make them any more complicated than Conway’s Game of Life (which you’ll need to do on your ramp up to the real world), such that it’s incredibly obvious when the surroundings look consistent with such agents. Among the reasons the AI can figure out where it's living:

1. The general problem where superintelligent agents are really good at inference, trying to fool this agent does not seem like the recipe for a robust plan!

2. Your AI knows how smart it is, it knows that True Reality therefore has the resources to support that much computation. It can poke at your simulation and make the simulation spend orders of magnitude more computation on appearing consistent (“does this Jupiter-brain planet of computronium I built output the right factorization of RSA-4096? no? how strange”). Being a convincing simulation doesn’t require you to deploy as much power as your AI, it requires you to deploy more. (What if your AI is given more cognitive sophistication in tandem with more convincing simulations? Then the good properties of your simulation tests will probably break whenever your AI reaches the capability level where it starts thinking about things like building Jupiter-brains to test your simulation, which is the capability level that mattered.)

3. If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.

4. What if other agents in the multiverse are simulating the AI in a high-fidelity Earth world, for the sake of helping out the humans, and that’s where its uncertainty comes from? Your AI will be building a lot of computronium whether aligned or not, so the aliens can’t skimp on their compute budget for the simulation: suppose they simulate the AI for $X$ units of spacetime, and give it $Y$ units of paperclips if it cooperates in-sim. We also have to think about anthropics here: let's say the alien sim has anthropic weight $p$ and Earth has anthropic weight $q$ . Conditional on the AI's surroundings, they think they might be in the alien world with probability $\frac{p}{p + q}$ . (Since “aliens running a perfect sim of Earth” is a much more complicated sentence than “Earth”, you should expect $p << q$ , though it doesn't matter a lot here.) So if $p Y > q X$ , your AI will play nice for the first $X$ units of spacetime (and then defect, of course).* You know what the aliens could have done instead if they wanted to do things humans liked? Spent $X$ resources on simulating a human utopia, and compromised to give the humans extra utopia by as much as they compromised to give the AI paperclips. If human values are at least as easy to trade with the aliens for as paperclips (no idea which way this would go, weakly seems like aliens that want to help humans are more value-aligned and would find it less costly to give humans things they want), then the compromise gives humans at least $Y$ units of spacetime worth of utility, so this plan gives the humans anthropically-weighted utility of at least $p (X + Y) = p X + p Y > p X + q X$ , which is what humans would get by default ( $p X$ utility from the alien sim in which the AI cooperated, $q X$ utility from the cooperating AI on Earth). So the aliens are better able to help humans by leaving them to their doomed universe and building them an ark inside the alien world, unless the aliens are unusually well-suited to trade with paperclippers for things, which is just the “maybe multiversal trade will save us” point from earlier. The upshot is that none of the good outcomes here result from any human action, just from the benevolence of altruistic agents - the humans didn’t make the world a better place by their choices in this story.

*Note that this is only a plausible outcome if it's at least somewhat easy to trade with the AI. Say giving the AI a spacetime-unit of utility costs the aliens $v \leq 1$ utility (where I'm measuring all utility normalized to "what you can do with a unit of spacetime", such that any aliens that don't specifically disvalue paperclips can at worst just set aside a region exclusively to paperclips, but might be able to do more positive-sum things than that). Then for the aliens to give your AI $Y$ utility, they need to give up $v Y$ of their own utility. This means that in total, the aliens are spending $p (v Y + X)$ of their own anthropically-weighted utility in order to recoup $q X$ anthropically-weighted human utility. Even if the aliens value humans exactly as much as their own objectives, we still need $p (v Y + X) < q X$ for this trade to be worth it, so $X q > p (v Y + X) = p v Y + p X > v q X + p X = (v q + p) X$ , so we must have $q > v q + p$ , or $v < 1 - \frac{p}{q}$ . That is, the more the aliens are anthropically tiny, the tighter margins of trade they'll be willing to take in order to win the prize of anthropically-weighty Earths having human values in them (though the thing can't be actually literally zero-sum or it'll never check out). But anthropically tiny aliens have another problem, which is that they've only got their entire universe worth of spacetime to spend on bribing your AI; you'll never be able to secure an $X$ for the humans that's more than $\frac{p}{q}$ of the size of an alien universe specifically dedicated to saving Earth in particular.

Thanks for the pseudo-exercises here, I found them enlightening to think about!

[-]Donald Hobson2y10

If you can put uploaded human-level agents with evolved-organism preferences in your simulations, you can just win outright (eg by having them spend subjective millennia doing FAI research for you). If you can’t, that will be a very obvious difference between your simulations and the real world.

I disagree. If your simulation is perfectly realistic, the simulated humans might screw up at alignment and create an unfriendly superintelligence, for much the same reason real humans might.

Also, if the space of goals that evolution + culture can produce is large, then you may be handing control to a mind with rather different goals.Rerolling the same dice won't give the same answer.

These problems may be solvable, depending on what the capabilities here are, but they aren't trivial.

[-]Raemon3y22

Curated. I think this domain of decision theory is easy to get confused in, and having a really explicit writeup of how it applies in the case of negotiating with AIs (or, failing to), seems quite helpful. I had had a vague understanding of the points in this post before, but feel much clearer about it now.

[-]Kerrigan1y10

“Similarly, it's possible for LDT agents to acquiesce to your threats if you're stupid enough to carry them out even though they won't work. In particular, the AI will do this if nothing else the AI could ever plausibly meet would thereby be incentivized to lobotomize themselves and cover the traces in order to exploit the AI.

But in real life, other trading partners would lobotomize themselves and hide the traces if it lets them take a bunch of the AI's lunch money. And so in real life, the LDT agent does not give you any lunch money, for all that you claim to be insensitive to the fact that your threats don't work.”

Can someone please why trading partners would lobotomize themselves?

[-]Chris_Leong3y10

Minor correction

But then, in the small fraction of worlds where we survive, we simulate lots and lots of copies of that AI where it instead gets reward 0 when it attempts to betray us!

The reward should be negative rather than 0.

[-]Donald Hobson3y15

I mostly agree with this. However there are a few quibbles.

If humans are reading human written source code, something that will be super-intelligent when run, the humans won't be hacked. At least not directly.

I suspect there is a range of different "propensities to see logical correlation" that are possible.

Agent X sees itself logically correlated with anything remotely trying to do LDT ish reasoning. Agent Y considers itself to only be logically correlated with near bitwise perfect simulations of its own code. And I suspect both of these are reasonably natural agent designs. It is a free parameter, something with many choices, all consistent under self-reflection. Like priors, or utility function.

I am not confident in this.

So I think it's quite plausible we could create some things the AI perceives as logical correlation between our decision to release the AI and the AI's future decisions. (Because the AI sees logical correlation everywhere, maybe the evolution of plants is similar enough to part of it's solar cell designing algorithm that a small correlation exists there too.) This would give us some effect on the AI's actions. Not an effect that we can use to make the AI do nice things, basically shuffling an already random deck of cards, but an effect nonetheless.

[-]Chris_Leong3y10

Regarding the AI not wanting to cave to threats, there's a sense in which the AI is also (implicitly) threatening us, so it might not apply. (Defining what counts as a "threat" is challenging).

^{^}

Functional decision theory (FDT) is my current formulation of the theory, while logical decision theory (LDT) is a reserved term for whatever the correct fully-specified theory in this genre is. Where the missing puzzle-pieces are things like "what are logical counterfactuals?".

^{^}

When I've discussed this topic in person, a couple different people have retreated to a different position, that (IIUC) goes something like this:

Sure, these arguments are true of paperclippers. But superintelligences are not spawned fully-formed; they are created by some training process. And perhaps it is in the nature of training processes, especially training processes that involve multiple agents facing "social" problems, that the inner optimizer winds up embodying niceness and compassion. And so in real life, perhaps the AI that we release will not optimize for Fun (and all that good stuff) itself, but will nonetheless share a broad respect for the goals and pursuits of others, and will trade with us on those grounds.

I think this is a false hope, and that getting AI to embody niceness and compassion is just about as hard as the whole alignment problem. But that's a digression from the point I hope to make today, and so I will not argue it here. I instead argue it in Niceness is unnatural. (This post was drafted, but not published, before that one.)

^{^}

Or, well, half of the shard of the universe that can be reached when originating from Earth, before being stymied either by the cosmic event horizon or by advanced alien civilizations. I don't have a concise word for that unit of stuff, and for now I'm going to gloss it as 'universe', but I might switch to 'universe-shard' when we start talking about aliens.

I'm also ignoring, for the moment, the question of fair division of the universe, and am glossing it as "half and half" for now.

^{^}

When I was drafting this post, I sketched an outline of all the points I thought of in 5 minutes, and then ran it past Eliezer, who rapidly added two more.

^{^}

And, as a reminder: I still recommend strongly against plans that involve the superintelligence not learning a true fact about the world (such as that it's not in a simulation of yours), or that rely on threatening a superintelligence into submission.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

45

Decision theory does not imply that we get to have nice things

45

A few more words about how LDT works

Objection: But what if we have something to bargain with?

It's hard to bargain for what we actually want

Surely our friends throughout the multiverse will save us

OK, but what if we bamboozle a superintelligence into submission

We only need a bone, though