Let's say we talk about two AIs merging. Then the tuple of their expected utilities from the merge had better be on the Pareto frontier, no? Otherwise they'd just do a better merge that gets them onto the frontier. Which specific point on the frontier is a matter of bargaining, but the fact that they want to hit the frontier isn't, it's a win-win. And the merges that get them to the frontier are exactly those that output a EUM agent, maximizing some linear combination of their utilities. If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it's curvy at that point, the merge will be deterministic. For realistic agents who have more complex preferences than just linearly caring about one cake, I expect the frontier will be curvy, so deterministic merge into a EUM agent will be the best choice.

Reply

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

cousin_it7mo*40

The relevant point is his latter claim: “in particular with respect to “learn ‘don’t steal’ rather than ‘don’t get caught’.”″ I think this is a very strong conclusion, relative to available data.

I think humans don't steal mostly because society enforces that norm. Toward weaker "other" groups that aren't part of your society (farmed animals, weaker countries, etc) there's no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker "other" group. So if alignment of AIs to human standard is a complete success - if AIs learn to behave toward weaker "other" groups exactly as humans behave toward such groups - the result will be bad for humans.

It gets even worse because AIs, unlike humans, aren't raised to be moral. They're raised by corporations with a goal to make money, with a thin layer of "don't say naughty words" morality. We already know corporations will break rules, bend rules, lobby to change rules, to make more money and don't really mind if people get hurt in the process. We'll see more of that behavior when corporations can make AIs to further their goals.

Reply

In Defense of Open-Minded UDT

cousin_it8mo*10

Going back to the envelopes example, a nosy neighbor hypothesis would be "the left envelope contains $100, even in the world where the right envelope contains $100". Or if we have an AI that's unsure whether it values paperclips or staples, a nosy neighbor hypothesis would be "I value paperclips, even in the world where I value staples". I'm not sure how that makes sense. Can you give some scenario where a nosy neighbor hypothesis makes sense?

Reply

In Defense of Open-Minded UDT

cousin_it8mo10

Imagine if we had narrowed down the human prior to two possibilities, P_1 and P_2 . Humans can’t figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that P_2 is bad enough that it will lead to a catastrophe from the human perspective (that is, from the P_1 perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.

This seems like the central argument in the post, but I don't understand how it works.

Here's a toy example. Two envelopes, one contains $100, the other leads to a loss of $10000. We don't know which envelope is which, but it's possible to figure out by a long computation. So we make a money-maximizing UDT AI, whose prior is "the $100 is in whichever envelope {long_computation} says". Now if the AI has time to do the long computation, it'll do it and then open the right envelope. And if it doesn't have time to do the long computation, and is offered to open a random envelope or abstain, it will abstain. So it seems like ordinary UDT solves this example just fine. Can you explain where "updatefulness" comes in?

Reply

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

cousin_it10mo910

Yeah, this agrees with my thinking so far. However, I think if you could research how to align AIs specifically to human flourishing (as opposed to things like obedience/interpretability/truthfulness, which defer to the user's values), that kind of work could be more helpful than most.

Reply

Refusal in LLMs is mediated by a single direction

cousin_it1y53

Sorry for maybe naive question. Which other behaviors X could be defeated by this technique of "find n instructions that induce X and n that don't"? Would it work for X=unfriendliness, X=hallucination, X=wrong math answers, X=math answers that are wrong in one specific way, and so on?

Reply

cousin_it's Shortform

cousin_it1y10

If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?

Reply

Security amplification

cousin_it1y10

I don't know about others, but to me these approaches sound like "build a bureaucracy from many well-behaved agents", and it seems to me that such a bureaucracy wouldn't necessarily behave well.

Reply

Transformers Represent Belief State Geometry in their Residual Stream

cousin_it1y*30

I have maybe a naive question. What information is needed to find the MSP image within the neural network? Do we have to know the HMM to begin with? Or could it be feasible someday to inspect a neural network, find something that looks like an MSP image, and infer the HMM from it?

Reply

Many arguments for AI x-risk are wrong

cousin_it1y*3-5

I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.

This has been my main worry for the past few years, and to me it counts as "doom" too. AIs and AI companies playing by legal and market rules (and changing these rules by lobbying, which is also legal) might well lead to most humans having no resources to survive.

Reply