I'm hoping someone can clear this up for me; I'd be much obliged.

On LessWrong, I think someone made a good point regarding UDT's propensity (or lack there-of) to be counter-factually mugged, see: http://lesswrong.com/lw/3l/counterfactual_mugging/dm3r

Personal Blog
New Comment
7 comments, sorted by Click to highlight new comments since:

People normally consider thought experiments where we specify the agent's prior over worlds. In reality you are presumably using some broad prior like the universal prior, and the reason you would pay a mugger is indeed because they are more likely to exist than the anti-mugger.

OK that makes sense, thanks. This is what I suspected, but I was surprised that so many people are saying that UDT gets mugged without stipulating this; it made it seem like I was missing something.

Playing devil's advocate:

  1. P(mugger) and P(anti-mugger) aren't the only relevant quantities IRL
  2. I don't think we know nearly enough to have a good idea of what policy UDT would choose for a given prior. This leads me to doubt the usefulness of UDT.

An agent that cares only about given worlds is a useful concept. If these worlds are more like gameboards of an abstract game (with the agent being part of the gameboards), we can talk about game-aligned AI. By virtue of only caring about the abstract game it won't be motivated to figure out how its decisions influence our physical world (which it doesn't care about) and so won't normally be dangerous despite not being human-aligned.

This seems only loosely related to my OP.

But it is quite interesting... so you're proposing that we can make safe AIs by, e.g. giving them a prior which puts 0 probability mass on worlds where dangerous instrumental goals are valuable. The simplest way would be to make the agent believe that there is no past / future (thus giving us a more “rational” contextual bandit algorithm than we would get by just setting a horizon of 0). However, Mathieu Roy suggested to me that acausal trade might still emerge, and I think I agree based on open-source prisoner’s dilemma.

Anyways, I think that's a promising avenue to investigate.
Having a good model of the world seems like a necessary condition for an AI to pose a significant Xrisk.

Game-aligned agents aren't useful for AI control as complete agents, since if you give them enough data from the real world, they start caring about it. This aspect applies more straightforwardly to very specialized sub-agents.

It's misleading to say that such agents assign probability zero to the real world, since the computations they optimize don't naturally talk about things like worlds at all. For example, consider a chess-playing AI that should learn to win against a fixed opponent program. It only reasons about chess, there is no reason to introduce concepts like physical worlds in its reasoning, because they probably aren't any help for playing chess. (Unless the opponent is a sufficiently large program written by humans, data about the real world.)

There do seem to be some caveats, basically an agent shouldn't be given either motivation or opportunity to pursue consequentialist reasoning about the real world. So if it finds an almost perfect strategy and has nothing more to do, it might start looking into more esoteric ways of improving the outcome that pass through worlds we might care about. This possibility is already implausible in practice, but adding a term in agent's value for solving some hard harmless problem, like inverting hashes, should reliably take its attention away from physical worlds.

(My interest is in pushing this towards complete agents, in particular letting them care about things like the real world. Since bounded agents can't perfectly channel idealized preference, there is a question of how they can do good things while remaining misaligned to some extent, so studying simple concepts of misalignment might help.)

The issue in the OP is that possibility of other situations influences agent's decision. The standard way of handling this is to agree to disregard other situations, including by appealing to Omega's stipulated ability to inspire belief (it's the whole reason for introducing the trustworthiness clause). This belief, if reality of situations is treated equivalently to their probability in agent's eyes, expels the other situations from consideration.

The idea Paul mentioned is just another way of making sure that the other situations don't intrude on the thought experiment, but since the main principle is to get this done somehow, it doesn't really matter if a universal prior likes anti-muggers more than muggers, since in that case we'd just need to change the thought experiment.

Thought experiments are not natural questions that rate usefulness of decision theories, they are tests that examine particular features of decision theories, so if such investigation goes too far afield (as in looking into a priori weights of anti-muggers), that calls for a change in a thought experiment.

I reason as follows:

  1. Omega inspires belief only after the agent encounters Omega.
  2. According to UDT, the agent should not update its policy based on this encounter; it should simply follow it.
  3. Thus the agent should act according to whatever the best policy is, according to its original (e.g. universal) prior from before it encountered Omega (or indeed learned anything about the world).

I think either:

  1. the agent does update, in which case, why not update on the result of the coin-flip? or
  2. the agent doesn't update, in which case, what matters is simply the optimal policy given the original prior.