Going back to the envelopes example, a nosy neighbor hypothesis would be "the left envelope contains $100, even in the world where the right envelope contains $100". Or if we have an AI that's unsure whether it values paperclips or staples, a nosy neighbor hypothesis would be "I value paperclips, even in the world where I value staples". I'm not sure how that makes sense. Can you give some scenario where a nosy neighbor hypothesis makes sense?
Imagine if we had narrowed down the human prior to two possibilities, P_1 and P_2 . Humans can’t figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that P_2 is bad enough that it will lead to a catastrophe from the human perspective (that is, from the P_1 perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.
This seems like the central argument in the post, but I don't understand how it works.
Here's a toy example. Two envelopes, one contains $100, the other leads to a loss of $10000. We don't know which envelope is which, but it's possible to figure out by a long computation. So we make a money-maximizing UDT AI, whose prior is "the $100 is in whichever envelope {long_computation} says". Now if the AI has time to do the long computation, it'll do it and then open the right envelope. And if it doesn't have time to do the long computation, and is offered to open a random envelope or abstain, it will abstain. So it seems like ordinary UDT solves this example just fine. Can you explain where "updatefulness" comes in?
Yeah, this agrees with my thinking so far. However, I think if you could research how to align AIs specifically to human flourishing (as opposed to things like obedience/interpretability/truthfulness, which defer to the user's values), that kind of work could be more helpful than most.
Sorry for maybe naive question. Which other behaviors X could be defeated by this technique of "find n instructions that induce X and n that don't"? Would it work for X=unfriendliness, X=hallucination, X=wrong math answers, X=math answers that are wrong in one specific way, and so on?
If the housing crisis is caused by low-density rich neighborhoods blocking redevelopment of themselves (as seems the consensus on the internet now), could it be solved by developers buying out an entire neighborhood or even town in one swoop? It'd require a ton of money, but redevelopment would bring even more money, so it could be win-win for everyone. Does it not happen only due to coordination difficulties?
I don't know about others, but to me these approaches sound like "build a bureaucracy from many well-behaved agents", and it seems to me that such a bureaucracy wouldn't necessarily behave well.
I have maybe a naive question. What information is needed to find the MSP image within the neural network? Do we have to know the HMM to begin with? Or could it be feasible someday to inspect a neural network, find something that looks like an MSP image, and infer the HMM from it?
- I’m worried about centralization of power and wealth in opaque non-human decision-making systems, and those who own the systems.
This has been my main worry for the past few years, and to me it counts as "doom" too. AIs and AI companies playing by legal and market rules (and changing these rules by lobbying, which is also legal) might well lead to most humans having no resources to survive.
It's interesting that part of human value might be having our actions matter. But if you build an AI that can give you all the things, or even if you could've built such an AI but chose not to, then objectively your actions no longer matter much after that. I've no idea how even CEV could approach this problem.
Edit: I think I've figured it out. The AI shouldn't try to build the best world according to CEV, it should take the best action for an AI to take according to CEV. So if the AI notices that humans strongly prefer to be left alone with their problems, it'll just shut down. Or find some other way to ensure that humans can't rely on AIs for everything.
I think humans don't steal mostly because society enforces that norm. Toward weaker "other" groups that aren't part of your society (farmed animals, weaker countries, etc) there's no such norm, and humans often behave badly toward such groups. And to AIs, humans will be a weaker "other" group. So if alignment of AIs to human standard is a complete success - if AIs learn to behave toward weaker "other" groups exactly as humans behave toward such groups - the result will be bad for humans.
It gets even worse because AIs, unlike humans, aren't raised to be moral. They're raised by corporations with a goal to make money, with a thin layer of "don't say naughty words" morality. We already know corporations will break rules, bend rules, lobby to change rules, to make more money and don't really mind if people get hurt in the process. We'll see more of that behavior when corporations can make AIs to further their goals.