Note: Due to a bug, if you were subscribed to email notifications for curated posts, the curation email for this post came from Alignment Forum instead of LessWrong. If you're viewing this post on AF, to see the comments, view it on LessWrong instead. (This is a LessWrong post, not an AF post, but the two sites share a database and have one-directional auto-crossposting from AF to LW.)
Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven't yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.
To clarify, what I meant was not that they need a source of shared randomness, but that they need a shared probability distribution; ie, having dice isn't enough, they also need to coordinate on a way of interpreting the dice, which is similar to the original problem of coordinating on an ordering over points.
I don't think the mechanics of the problem, as specified, let them mutually specify random things without something like an externally-provided probability distribution. This is aimed at eliminating that requirement. But it may be that this issue isn't very illuminating and would be better addressed by adjusting the problem formulation to provide that.
The procrastination paradox is isomorphic to well-founded recursion. In the reasoning, the fourth step, "whether or not I press the button, the next agent or an agent after that will press the button" is an invalid proof-step; it's shown that there is an inductive steps ending at the conclusion, but not that that chain has a base case.
This can only happen when the relation between an agent and its successor is not well-founded. If there is any well-founded relation between agents and their successors - either because they're in a finite universe, or because the first agent picked a well-founded relation and build that in - then the button will eventually get pushed.
Point (1) seems to be a combination of an issue of working around the absence of a mathematically-elegant communication channel in the formalism, and an incentive to choose some orderings over others because of (2). If (2) is solved and they can communicate, then they can agree on an ordering without any trouble because they're both indifferent to which one is chosen.
If you don't have communication but you have solved (2), I think you can solve the problem by splitting agents into two stages. In the first stage, agents try to coordinate on an ordering over
...Regarding (2), the main problem is that this creates an incentive for agents to choose orderings that favor themselves when there is overlap between the acceptable regions, and this creates a high chance that they won't be able to agree on an ordering at all. Jessica Taylor's solution solves the problem of not being able to find an ordering, but at the cost of all the surplus utility that was in the region of overlap. For example, if Janos and I are deciding how to divide a dollar, I offer that Janos keeps it, and Janos offers that I keep it, that solution
...This relates to what in Boston we've been calling the Ensemble Stability problem: given multiple utility functions, some of which may be incorrect, how do you keep the AI from sacrificing the other values for the incorrect one(s). Maximin is a step in the right direction, but I don't think it fully solves the problem.
I see two main issues. First, suppose one of the utility functions in the set is erroneous, and the AI predicts that in the future, we'll realize this and create a different AI that optimizes without it. Then the AI will be incentivized to pre
...This generalizes nicely. The asteroid problem provides a nice partitioning into two pieces, such that either piece alone has no effect, but the two pieces together have an effect. But most problems won't have such a partition built in.
If we want the answer to a yes/no question, the first instinct would be that no such partitioning is possible: if two AIs each provide less than 1 bit of information, then combining them won't produce a reliable answer. But we can make it work by combining the yes/no question with some other problem, as follows.
Suppose you wa
...
It seems to me that the surprising simplicity of current-generation ML algorithms is a big part of the problem.
As a thought experiment: suppose you had a human brain, with the sort of debug access you'd have with a neural net; ie, you could see all the connections, edge weights, and firings, and had a decent multiple of the compute the brain has. Could you extract something like a verbal inner monologue, a text stream that was strongly predictive of that human's plans? I don't think it would be trivial, but my guess is that you could. It wouldn't hold up a... (read more)