All of jimrandomh's Comments + Replies

It seems to me that the surprising simplicity of current-generation ML algorithms is a big part of the problem.

As a thought experiment: suppose you had a human brain, with the sort of debug access you'd have with a neural net; ie, you could see all the connections, edge weights, and firings, and had a decent multiple of the compute the brain has. Could you extract something like a verbal inner monologue, a text stream that was strongly predictive of that human's plans? I don't think it would be trivial, but my guess is that you could. It wouldn't hold up a... (read more)

5Alex Turner
Wish granted! 

Note: Due to a bug, if you were subscribed to email notifications for curated posts, the curation email for this post came from Alignment Forum instead of LessWrong. If you're viewing this post on AF, to see the comments, view it on LessWrong instead. (This is a LessWrong post, not an AF post, but the two sites share a database and have one-directional auto-crossposting from AF to LW.)

Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven't yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.

0Stuart Armstrong
You could have an agent that cares about what an idealised counterfactual human would think about its decisions (if the idealised human had a huge amount of time to think them over). Compare with Paul Christiano's ideas. Now, this isn't safe, but it's at least something you might be able to play with.

To clarify, what I meant was not that they need a source of shared randomness, but that they need a shared probability distribution; ie, having dice isn't enough, they also need to coordinate on a way of interpreting the dice, which is similar to the original problem of coordinating on an ordering over points.

I don't think the mechanics of the problem, as specified, let them mutually specify random things without something like an externally-provided probability distribution. This is aimed at eliminating that requirement. But it may be that this issue isn't very illuminating and would be better addressed by adjusting the problem formulation to provide that.

0orthonormal
We already assumed a source of mutual randomness in order to guarantee that the feasible set is convex (caption to Figure 1).

The procrastination paradox is isomorphic to well-founded recursion. In the reasoning, the fourth step, "whether or not I press the button, the next agent or an agent after that will press the button" is an invalid proof-step; it's shown that there is an inductive steps ending at the conclusion, but not that that chain has a base case.

This can only happen when the relation between an agent and its successor is not well-founded. If there is any well-founded relation between agents and their successors - either because they're in a finite universe, or because the first agent picked a well-founded relation and build that in - then the button will eventually get pushed.

Point (1) seems to be a combination of an issue of working around the absence of a mathematically-elegant communication channel in the formalism, and an incentive to choose some orderings over others because of (2). If (2) is solved and they can communicate, then they can agree on an ordering without any trouble because they're both indifferent to which one is chosen.

If you don't have communication but you have solved (2), I think you can solve the problem by splitting agents into two stages. In the first stage, agents try to coordinate on an ordering over

... (read more)
0orthonormal
I think your proposal is more complicated than, say, mutually randomly choosing an ordering in one step. Does it have any superior properties to just doing that?

Regarding (2), the main problem is that this creates an incentive for agents to choose orderings that favor themselves when there is overlap between the acceptable regions, and this creates a high chance that they won't be able to agree on an ordering at all. Jessica Taylor's solution solves the problem of not being able to find an ordering, but at the cost of all the surplus utility that was in the region of overlap. For example, if Janos and I are deciding how to divide a dollar, I offer that Janos keeps it, and Janos offers that I keep it, that solution

... (read more)
1orthonormal
I thought about that at some point, in the case where they're biased in their own directions, but of course there it just reintroduces the incentive for playing hardball. In the case where they're each overly generous, they already have the incentive to bias slightly more in their own direction. However, there's not an obvious way to translate the second round of negotiation into the modal framework...

This relates to what in Boston we've been calling the Ensemble Stability problem: given multiple utility functions, some of which may be incorrect, how do you keep the AI from sacrificing the other values for the incorrect one(s). Maximin is a step in the right direction, but I don't think it fully solves the problem.

I see two main issues. First, suppose one of the utility functions in the set is erroneous, and the AI predicts that in the future, we'll realize this and create a different AI that optimizes without it. Then the AI will be incentivized to pre

... (read more)
1Jessica Taylor
The utility functions are normalized so that they all assign 0 to the status quo. The status quo includes humans designing an AI to optimize something. So the minimax agent won't do anything worse for the values of the later AI than what would happen normally, unless the future AI's utility function is not in minimax's ensemble. Since they're normalized to return 0 on the status quo, this won't quite happen, but it could be that one is a lot harder to increase above 0 than others, and so more resources will go to increasing that one above 0 than the others.

This generalizes nicely. The asteroid problem provides a nice partitioning into two pieces, such that either piece alone has no effect, but the two pieces together have an effect. But most problems won't have such a partition built in.

If we want the answer to a yes/no question, the first instinct would be that no such partitioning is possible: if two AIs each provide less than 1 bit of information, then combining them won't produce a reliable answer. But we can make it work by combining the yes/no question with some other problem, as follows.

Suppose you wa

... (read more)
0Stuart Armstrong
Interesting generalisation. The next step is to allow more interaction between AI and world, while still minimising impact safely…