In Defense of Open-Minded UDT

abramdemski

A Defense of Open-Minded Updatelessness.

This work owes a great debt to many conversations with Sahil, Martín Soto, and Scott Garrabrant.

You can support my work on Patreon.

Iterated Counterfactual Mugging On a Single Coinflip

Iterated counterfactual mugging on a single coinflip begins like a classic counterfactual mugging, with Omega approaching you, explaining the situation, and asking for your money. Let's say you buy the classic UDT idea, so you happily give Omega your money.

Next week, Omega appears again, with the same question. However, Omega clarifies that it has used the same coin-flip as last week.

This throws you off a little bit, but you see that the math is the same either way; your prior still assigns a 50-50 chance to both outcomes. If you thought it was a good deal last week, you should also think it is a good deal this week. You pay up again.

On the third week, Omega makes the same offer again, and once again has used the same coinflip. You ask Omega how many times it's going to do this. Omega replies, "forever". You ask Omega whether it would have continued coming if the coin had landed heads; it says "Of course! How else could I make you this offer now? Since the coin landed tails, I will come and ask you for $100 every single week going forward. If the coin had landed heads, I would have simulated what would happen if it had landed tails, and I would come and give you $10,000 on every week that simulated-you gives up $100!"

Let's say for the sake of the thought experiment that you can afford to give Omega $100 once a week. It hurts, but it doesn't hurt as much as getting $10,000 from Omega every week would have benefited you, if that had happened.^[1]

Nonetheless, I suspect many readers will feel some doubt creep in as they imagine giving Omega $100 week after week after week. The first few weeks, the possibility of the coin landing heads might feel "very real". Heck yeah I want to be the sort of person who gets (a 50% chance of) 10K from Omega for a (50% chance) cost of $100!

By the hundredth week, though, you may feel yourself the fool for giving up so much money for the imaginary benefit of the "heads" world that never was.

If you think you'd still happily give up the $100 for as long as Omega kept asking, then I would ask you to consider a counterlogical mugging instead. Rather than flipping a coin, Omega uses a digit of the binary expansion of ; as before, Omega uses the same digit week after week, for infinitely many counterlogical muggings.

Feeling uneasy yet? Does the possibility of the digit of $π$ going one way or the other continue to feel "just as real" as time passes? Or do you become more sympathetic to the idea that, at some point, you're wasting money on helping a non-real world?

UDT vs Learning

Updateless Decision Theory (UDT) clearly keeps giving Omega the $100 forever in this situation, at least, under the usual assumptions. A single Counterfactual Mugging is not any different from an infinitely iterated one, especially in the version above where only a single coinflip is used. The ordinary decision between "give up $100" and "refuse" is isomorphic to the choice of general policy "give up $100 forever" and "refuse forever".^[2]

However, the idea of applying a decision theory to a specific decision problem is actually quite subtle, especially for UDT. We generally assume an agent's prior equals the probabilities described in the decision problem.^[3] A simple interpretation of this could be that the agent is born with this prior (and immediately placed into the decision problem). This isn't very realistic, though. How did the agent get the correct prior?^[4]

A more realistic idea is that:

(a) the agent was born some time ago and has learned a fair amount about the world;
(b) the agent has accumulated sufficient evidence to think, with high confidence, that it is now facing the decision problem being described; and importantly for UDT,
(c) there are no further considerations in UDT's prior which would sway UDT in the situation being described.

However, (c) is not very plausible with UDT!

Lizard World

To elaborate on (c): We say that "UDT accepts counterfactual muggings". But, imagine that your prior had a 50% initial probability that you would end up in Lizard World rather than Earth. Lizards have a strange value system, which values updateful behavior intrinsically. They'll reward you greatly if your chosen policy agrees with updateful decision theory (except with respect to this specific incentive, which is of course a counterfactual mugging).

Given such a prior, UDT will refuse counterfactual muggings on Earth (if the incentive for accepting the mugging is less than the rewards offered by the Lizards). The assumption (c) is saying that there are no interfering considerations like this.

Notice that this is quite a large assumption! Usually we think of agents as starting out with quite a broad prior, such as the Solomonoff distribution, which gives lots of crazy worlds non-negligible probability.

Agents need to start with broad priors in order to learn. But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Lizards.

Learning Desiderata

Let's sharpen that idea into a maybe-desirable property:

Learning-UDT: Suppose an agent starts with prior $P_{0}$ . If the prior were updated on the actual information the agent receives over time, it would become $P_{1}$ , then $P_{2}$ , and so on. An agent obeys "learning UDT" if it eventually behaves as if it were applying UDT with prior $P_{n}$ , for any $n$ .^[5] That is: for each $n$ , there is a time $t_{n}$ after which all decisions maximize expected utility according to $P_{n}$ .

Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each $P_{n}$ . Therefore, in particular, it eventually behaves like UDT with prior $P_{0}$ . So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior $P_{0}$ is required to eventually agree with the recommendations of $P_{1}, P_{2}, . . .$ (which also implies that these eventually agree with each other).

Here's a different learning principle which doesn't imply UDT so strongly:

Eventual Learning: At each timestep t, the agent obeys UDT with prior $P_{f (t)}$ , where $P_{t}$ would be the beliefs fully updated on the observations at time t. $f (t)$ monotonically increases without bound.

The idea here is to allow agents to have fairly rich priors (so that they can learn what world they are in!), and also allow them to be somewhat updateless (so that they can correctly solve problems like counterfactual mugging), but require them to eventually "face facts" rather than make mistakes forever based on weird stuff in the prior. Eventual Learning + UDT implies Learning-UDT.

So the overall argument here is as follows:

UDT was invented to solve some decision problems.
However, realistic agents have to somehow learn what decision problem they face.
UDT doesn't learn like this by default. So the idea that it "solves decision problems" like counterfactual mugging is somewhat imaginary.
Eventual Learning solves this problem by eventually updating on any particular fact. Thus, we eventually understand any decision problem that we face, if we have faced similar problems enough times in the past.

Notice that Eventual Learning implies that we eventually stop giving Omega $100 in Iterated Counterfactual Mugging on a Single Coinflip, since we eventually behave as if we've updated on the coin.

I should mention that I'm not necessarily endorsing Eventual Learning as a true normative constraint. It's more a way of pointing out that UDT can easily refuse to learn, in a way that seems bad. The main point here is to illustrate that UDT is sometimes incompatible with learning, but learning is a hidden necessary assumption behind applications of UDT to solve decision problems.

Open problem:^[6] Under what conditions is classic UDT compatible with Eventual Learning? Is it possible to specify a rich prior while systematically avoiding the "god" traps I mention, such that following the standard UDT decision procedure will also satisfy the Learning-UDT criterion? If so, can it learn to behave optimally in some nice rich class of situations which includes counterfactual muggings?

Bargaining

Another way to think about this, which both Scott Garrabrant and Diffractor suggested, is that we're somehow bargaining between different branches of possibility, instead of simply maximizing expected utility according to a fixed prior.

Imagine the prior is split up into hypotheses, and the hypotheses bargain to the Pareto frontier by some method such as Nash bargaining. The resulting policy can be justified in terms of a fixed prior over hypotheses, but importantly, the weights on different hypotheses will depend on the bargaining. This means a "fair coin" might not be given 50-50 weight.

It's hard to apply this directly to Counterfactual Mugging, because the mugged branch simply loses out (if it pays up). The branch that benefits from the mugging has nothing to offer the branch that loses out, so there's no motivation for bargaining. However, let's vaguely imagine that there's an agreement to get counterfactually mugged sometimes as part of a larger bargain between hypotheses.

The UDT reasoning I offered for Iterated Counterfactual Mugging On a Single Coinflip at the beginning assumed that you're insensitive to the amount of money; if it makes sense to accept Omega's bargain once, it makes sense to accept it twice, and three times, etc. The utilities add linearly.

However, if we think of our probability for the coin-flip as the result of bargaining, it makes sense that it might be sensitive to size. The negotiation which was willing to trade $100 from one branch to get $10,000 in another branch need not be equally willing to perform that trade arbitrarily many times.

In fact, I can give a toy analysis based on bargaining which justifies giving Omega the money only sometimes.^[9]

Obviously this is hand-wavy, and needs to be worked out properly. While I'm gesticulating wildly, I'll also throw in that geometric rationality seems relevant. Is the distinction between types of uncertainty we should be updateless about, vs types we should be updateful about, related to cases where we should use linear vs geometric expectation?

Can we justify a policy of eventually ignoring branches that claim too many cross-branch correlations, on the grounds that those branches are "greedy" and a "fair" prior shouldn't treat them like utility monsters?

Open Problems:

Spell out a sensible version of bargaining-based UDT.
Can we justify a policy of mostly ignoring Lizard World based on intuitions from bargaining, ie, that Lizard Words are "too greedy" and a "fair" prior shouldn't cede very much to them?
Is there a connection to Geometric Rationality which clarifies what we should be updateful vs updateless about?

Possible Critiques

At this point, I imagine a man in an all-black suit has snuck up behind me while I'm writing. I'm initially unnerved, but I relax when I spot the Bayesian Conspiracy ring on his finger.

Interlocutor: ...Unorthodox.

Me: Perhaps.

Interlocutor: What you've observed is that UDT doesn't react to a decision problem as you'd expect unless there are no relevant cross-branch entanglements of sufficient importance to change that. A different way to put this would be: UDT only deviates from optimal play on a subtree when it has a good reason. Your learning principle requires the agent to eventually ignore this good reason.

Me: Not necessarily. There are two ways to get learning: we can change the decision theory so that cross-branch entanglements are eventually ignored if they're "too old" -- the simplest way to achieve this is to eventually update on each individual piece of information. Or, we can stick to UDT and bound the cross-branch entanglements allowed by the prior (this is the idea behind the learning-UDT condition).

Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.

Me: I'll concede for now^[7] that reality does not forbid Iterated Counterfactual Mugging On a Single Coinflip or Lizard Worlds.

Interlocutor: And my argument against updating?

Me: You argue that any deviation from updatelessness creates an incentive to self-modify to be more updateless. However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT avoids self-modifying; it's only a conjecture.

Interlocutor: You claim the conjecture is false?

Me: A broad prior is dumb enough that it'll probably self-modify out of ignorance. If I can see that UDT has a learning problem, so can rational agents. If we start life with a broad prior, then there'll be lots of nonsense possibilities which we shouldn't be updateless about. It seems clear to me that we need some compromise between updatefulness and updatefulness, in practice.

Interlocutor: I reiterate: UDT only deviates from optimal play on a subtree when it has a good reason. Your concept of learning asks us to ignore those good reasons.

Me: I would agree so far as this: if you trust the prior, then the cases where updatelessness is optimal form a strict superset of the cases where updatefulness is optimal. When we set up a decision problem, we artificially assume that the prior is good -- that it describes reality somehow. "Prior reason" is not the same as "good reason".

Interlocutor: The prior is subjective. An agent has no choice but to trust its own prior. From its own perspective, its prior is the most accurate description of reality it can articulate.

Me: See, sometimes I feel like Bayesian subjectivism hides a subtle assumption of objectivity. Like when someone goes from endorsing Solomonoff Induction to endorsing the existence of a multiverse of all computations. Bayesian subjectivism gets substituted for multiverse frequentism.

Interlocutor: I'm doing no such thing. A Pragmatist's Guide to Epistemic Utility by Ben Levinstein shows that when an agent analyzes the value of possible beliefs in terms of their expected utility for decisions which the agent expects to make later, the result tends to be a proper scoring rule. Therefore, the agent's own beliefs will be the most valuable beliefs to have, by its calculations.

Me: But you've made a significant hidden assumption! What you say is only true if the agent knows how to write down the alternative beliefs in complete detail. Alice can think that Bob's beliefs are more accurate than hers, perhaps because Bob has access to more of the relevant information. Or, Alice can think her own later beliefs will be more accurate than her current beliefs. These things are allowed because Alice doesn't know what Bob's beliefs are in detail, and doesn't already know her future self's beliefs in detail.

Interlocutor: I would hardly call it a hidden assumption. But you're correct. What you're referring to is called the reflection principle, and is formalized as $P_{A} (X | P_{B} (X) = y) = y$ , where $P_{B}$ is the probability distribution which $P_{A}$ trusts over itself -- usually, the future beliefs of the same agent. But this being the case, $P_{A}$ will naturally utilize the information in $P_{B}$ when it learns its contents in detail. There is no need for an extra change-of-prior process.

Me: Well, for one thing, there's a size problem: in order for $P_{A}$ to think about all the possible ways $P_{B}$ might turn out, $P_{B}$ has to 'fit inside' $P_{A}$ ; but if $P_{B}$ is similarly sized to $P_{A}$ , or even larger, then there could be problems.

Interlocutor: So you're saying we might need to be updateful because we have finite computational power, and updating to a new set of beliefs can be less costly than treating those new beliefs as observations and considering how all possible policies could act on that observation?

Me: That's one way to think about it, but I worry this version will leave you thinking of UDT as the perfect ideal which we're trying to approximate in a bounded fashion.

Interlocutor: Indeed it would.

Me: Here's a thought-experiment. Suppose we've discovered the objectively correct human utility function, and we also know how to make a working UDT computation with whatever utility function we like. All that remains is to decide on the prior. Would you give it a broad prior, like Solomonoff's prior?

Interlocutor: We should use exactly our own prior. We don't want it to be too broad, because this will entail making trade-offs which sacrifice utility to worlds we don't believe in (such as your example with a 50% chance of going to Heaven where God rewards souls who don't act in a UDT-optimal way on Earth). We also don't want it to know too much, because then it might not make trade-offs which we would happily endorse based on our UDT.

Me: Unfortunately, although I'm granting you perfect knowledge of the human utility function for the purpose of the thought experiment, I am not granting you perfect knowledge of the human prior.

Interlocutor: In that case, we should design the AI to learn about humans and estimate the human prior, and make decisions based on that estimate.

Me: And should the AI be updateless about this? Should it simply make UDT decisions according to this best-guess prior we've input?

Interlocutor: Hm... no, it seems like we want it to learn. Imagine if we had narrowed down the human prior to two possibilities, $P_{1}$ and $P_{2}$ . Humans can't figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that $P_{2}$ is bad enough that it will lead to a catastrophe from the human perspective (that is, from the $P_{1}$ perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.

Me: You've invented my new decision theory for me. The reality is that we aren't certain what "our prior" is. Like the AI you describe, we are in the process of learning which prior is "correct". So, like that AI, we should be updating our prior as we learn more about what "the correct prior" is.

Interlocutor: So you're saying that divergence from UDT can be justified when we don't perfectly know our own beliefs?

Me: I'm not sure if that's exactly the condition, but at least it motivates the idea that there's some condition differentiating when we should be updateful vs updateless. I think uncertainty about "our own beliefs" is subtly wrong; it seems more like uncertainty about which beliefs we normatively endorse.

Interlocutor: But in this thought experiment, it seemed like uncertainty about our own prior is what drove the desire for updatefulness.

Me: When you said that in principle we should program the AI with our own prior, you were assuming that we were using a tiling UDT which endorses its own prior as the "correct" one. But since we are in fact uncertain about what our prior is, the same reasoning can apply to us. We shouldn't just be UDT, either, because we're still in the process of figuring out what prior we want to use with UDT. So it seems more accurate to me to say we're trying to figure out "the correct" prior, rather than "our" prior.

Interlocutor: Your language seems to suggest realism about probabilities.

Me: Yes, but in a similar sense to the moral realism of Eliezer's Meta-ethics sequence. You could call it subjectively objective. I claim that, when we put UDT into a decision problem, we make a significant assumption that the prior "correctly" describes the situation; for example, that the coin-flip in Counterfactual Mugging is indeed fair. But a totally subjectivist position about probability says: no coin is "really fair" -- 50-50 odds are a feature of maps, not territory! Coins in fact land one way or the other; all else is mere uncertainty! Yet, by asking the question "which prior should I use with UDT?" we create a need for a normative standard on probability distributions beyond mere accuracy.^[8]

If you value essays like this, you can support my work on Patreon.

^{^}
I am assuming that the value of money is linear (one utilon per dollar), but with temporal discounting to make overall expectations well-defined.
^{^}
The iterated problem adds other possible policies, where we sometimes give up $100 and sometimes do not; from UDT's perspective, these options are simply intermediate between "always give" and "never give".
^{^}
Actually, we do some kind of extrapolation to create a sensible prior. For problems like Newcomb's Problem, we can just take the situation at face value; UDT's prior is just a probabilistic description of the situation which has already been described in English. However, for problems like Counterfactual Mugging, we often describe the problem as "Omega tells you the coin has landed tails" -- but then to feed the problem to UDT, we would construct a prior which gave the coin a 50-50 chance of heads or tails.
We can eliminate this ambiguity by spelling out the prior from the beginning, as part of the decision problem.
However, the work of "extrapolating" a problem into a sensible prior for use with UDT is extremely important and worth studying. In fact, that is one way to state the main point of this post.
^{^}
For example, in Newcomb's Problem, we might say things like "Society on Earth has seen Omega do things like this for a long time, and instances are very well-documented; not once has Omega been known to make an incorrect prediction."
Stuff like this is often added to sort of "get people in the mood" if they appear to be intuitively rejecting the hypothetical. From one perspective, this is irrelevant fluff that doesn't change the decision problem, instead merely helping the listener to concretely imagine it. The perspective I'm arguing here ascribes somewhat more significance to backstories like this.
^{^}
Learning UDT doesn't specify anything about the learning rate; if an agent uses $P_{n}$ at time $n$ , then it's completely updateful and not UDT at all, but still counts as "Learning UDT" by this definition.
What I really have in mind is a slower learning rate. For example, I want to guarantee that the agent, faced with iterated counterfactual muggings using independent coins, eventually learns to give Omega $100 when asked.
Presumably, if UDT has a prior such that
(1) the prior is rich enough to learn about arbitrary decision problems like counterfactual mugging,
(2) the Learning UDT criterion is satisfied by the classic UDT decision procedure when using this prior,
we can also prove that UDT won't use $P_{n}$ at time $n$ , since this would violate the assumption that UDT could correctly learn that it's in a counterfactual mugging. However, it is unclear whether a prior like this can be constructed.
^{^}
Actually, I think Diffractor has made at least some progress on this.
^{^}
I still think it might be interesting/important to explore the compatibility of learning with UDT.
For one thing, the assumption that the universe contains only bounded "cross-branch entanglements" might be thought of as a learnability assumption, similar to a no-traps assumption. We don't believe that the universe contains zero deadly traps, but if the universe might contain some deadly traps, then it is not rational to explore the environment and learn. Therefore, we may need to make a no-traps assumption to study learning behavior. Similarly, if one branch is allowed to entangle with another forever, this stops UDT from learning. So we may wish to study the learning behavior of UDT by making a bounded-entanglement assumption.
Secondarily, I have some suspicion that a bounded-entanglement assumption can be justified by other means. Entanglements are always subjective. In the case of counterfactual mugging, for example, the agent subjectively thinks of Omega's simulation as accurate -- that is, correlated with what that agent "would have actually done" in the other branch. However, a bounded agent can only concretely believe in finitely many such correlations at a given time. So, perhaps some versions of boundedly-rational UDT would come with bounded-correlation assumptions "for free"?
^{^}
Indeed, it seems somewhat frequentist: I regard the coinflip as fair if I can naturally interpret it as part of a sequence of coinflips which looks like it converges to something close to 50-50 frequency, and contains no other patterns that I can exploit with my bounded intelligence.
^{^}
Imagine that Omega will approach us 5 times, and also, that we have $500 to start, so if we say yes every time we will be reduced to $0. Further, imagine that we are maximizing the product of our ending amount of money in the two branches (an assumption we can justify with Nash bargaining, assuming the BATNA is $0).
In this case, it is optimal to say yes to Omega twice:
Number of 'yes': End wealth of benefitting branch: End wealth of mugged branch: Product of end wealths:
0 500 500 250000
1 10500 400 4200000
2 20500 300 6150000
3 30500 200 6100000
4 40500 100 4050000
5 50500 0 0

But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Gods.

Are you suggesting that this is a unique problem for UDT, or affects it more than other decision theories? It seems like Bayesian decision theories can have the same problem, for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life. How is this different in principle from UDT refraining from paying the counterfactual mugger in your scenario to get reward from God in the other branch? Why wouldn't this problem be solved automatically given "good" or "reasonable" priors (whatever that means), which presumably would assign such gods low probabilities to begin with?

Interlocutor: The prior is subjective. An agent has no choice but to trust its own prior. From its own perspective, its prior is the most accurate description of reality it can articulate.

I wouldn't say this, because I'm not sure that the prior is subjective. From my current perspective I would say that it is part of the overall project of philosophy to figure out the nature of our priors and the contents of what they should be (if they're not fully subjective or have some degree of normativity).

So I think there are definitely problems in this area, but I'm not sure it has much to do with "learning" as opposed to "philosophy" and the examples / thought experiments you give don't seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)

for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life.

Yeah, this is an important point, but I think UDT has it significantly worse. For one thing, UDT has the problem I mention on top of the problem you mention. But more importantly, I think the problem I mention is less tractable than the problem you mention.

EDIT: I've edited the essay to name my problem as "lizard worlds" (lizards reward updateful policies). So I'll call the issue you raise the heaven/hell problem, and the issue I raise the lizard world problem.

For updateful DT, we can at least say: yes, a broad prior will include heaven/hell hypotheses which dramatically impact policy choice. But updateful priors have tools to address this problem:

The prior includes a likelihood function for heaven/hell hypotheses, which specifies how the probability of such hypotheses gets adjusted in light of evidence.
We mostly^[1] trust simplicity priors to either make sensible likelihood functions, which will only lean towards heaven/hell hypotheses when there's good reason, or else penalize heaven/hell hypotheses a priori for having a higher description complexity.
We can also directly provide feedback about the value estimates to teach an updateful DT to have sensible expectations.^[2]

None of these methods help UDT address the lizard world problem:

The likelihood functions don't matter; only the prior probability matters.
Simplicity priors aren't especially going to rule out these alternative worlds.^[3]
Direct feedback we give about expected values doesn't reduce the prior weight of these problematic hypotheses.

So I think there are definitely problems in this area, but I'm not sure it has much to do with "learning" as opposed to "philosophy" and the examples / thought experiments you give don't seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)

Yeah, I expect the thought experiment I start with is only going to be compelling to people who sort of already agree with me.

I do agree that "philosophy" problems are very close to this stuff, and it would be good to articulate in those terms.

^{^}
Modulo inner-optimizer concerns like simulation attacks.
^{^}
I'm imagining something like Bayesian RL, or Bayesian approval-directed agents, but perhaps with the twist that feedback is only sometimes given.
^{^}
This is somewhat nuanced/questionable. The lizards might be providing rewards/punishments for behavior in lots of worlds, not just Earth, so that this hypothesis doesn't have to point to Earth specifically. However, if utilities are bounded, then this arguably weakens the rewards/punishments relevant to Earth, which is similarly reassuring to giving this hypothesis less prior weight.
Perhaps it doesn't have to weaken the rewards/punishments relevant to Earth, though, if lizards reward only those who always reject counterfactual muggings in all other worlds (not including the Lizard's offer, of course, which is arguably a counterfactual mugging itself).
Also, I think there are more complications related to inner optimizers.

FWIW, in our original formulation of open-minded updatelessness, the idea was about revising the prior via either

-becoming aware of new possibilities (thereby changing the support of the prior);
-"philosophy", i.e., reflecting on principles for prior-setting. (Which would allow for the use of an "objective" prior, if we ended up thinking there was such a thing.)

(Cool post, Abraham, thanks!)

However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.

I think this conjecture is probably false for reasons described in this section of "When does EDT seek evidence about correlations?". The section offers an argument for why son-of-EDT isn't UEDT, but I think it generalizes to an argument for why son-of-UEDT isn't UEDT.

Briefly: UEDT-at-timestep-1 is making a different decision than UEDT-at-timestep-0. This means that its decision might be correlated (according to the prior) with some facts which UEDT-at-timestep-0's decision isn't correlated with. From the perspective of UEDT-at-timestep-0, it's bad to let UEDT-at-timestep-1 make decisions on the basis of correlations with things that UEDT-at-timestep-0 can't control.

I haven't analyzed your argument yet, but: tiling arguments will always depend on assumptions. Really, it's a question of when something tiles, not whether. So, if you've got a counterexample to tiling, a natural next question is what assumptions we could make to rule it out, and how unfortunate it is to need those assumptions.

I might not have understood adequately, yet, but it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action. This is a big assumption, but at the same time, the sort of assumption I would expect to need in order to justify UDT. As Eliezer put it, tiling results need to assume that the environment only cares about what policy we implement, not our "rituals of cognition" that compute those policies. An earlier act of self-modification vs a later decision is a difference in "ritual of cognition" as opposed to a difference in the policy, to me.

So, I need to understand the argument better, but it seems to me like this kind of counterexample doesn't significantly wound the spirit of UDT.

it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.

I would guess that assumption would be sufficient to defeat my counter-example, yeah.

I do think this is a big assumption. Definitely not one that I'd want to generally assume for practical purposes, even if it makes for a nicer theory of decision theory. But it would be super interesting if someone could make a proper defense of it typically being true in practice.

E.g.: Is it really true that a human's decision about whether or not to program a seed AI to take action A has the same correlations as that same superintelligence deciding whether or not to take action A 1000 years later while using a jupiter brain for its computation? Intuitively, I'd say that the human would correlate mostly with other humans and other evolved species, and that the superintelligence would mostly correlate with other superintelligences, and it'd be a big deal if that wasn't true.

Here's a different way of framing it: if we don't make this assumption, is there some useful generalization of UDT which emerges? Or, having not made this assumption, are we stuck in a quagmire where we can't really say anything useful?

I think about these sorts of 'technical assumptions' needed for nice DT results as "sanity checks":

I think we need to make several significant assumptions like this in order to get nice theoretical DT results.
These nice DT results won't precisely apply to the real world; however, they do show that the DT being analyzed at least behaves sanely when it is in these 'easier' cases.
So it seems like the natural thing to do is prove tiling results, learning results, etc under the necessary technical assumptions, with some concern for how restrictive the assumptions are (broader sanity checks being better), and then also, check whether behavior is "at least somewhat reasonable" in other cases.

So if UDT fails to tile when we remove these assumptions, but, at least appears to choose its successor in a reasonable way given the situation, this would count as a success.

Better, of course, if we can find the more general DT which tiles under weaker assumptions. I do think it's quite plausible that UDT needs to be generalized; I just expect my generalization of UDT will still need to make an assumption which rules out your counterexample to UDT.

Nice post. I'm excited about the bargaining interpretation of UDT.

However, if we think of our probability for the coin-flip as the result of bargaining, it makes sense that it might be sensitive to size. The negotiation which was willing to trade $100 from one branch to get $10,000 in another branch need not be equally willing to perform that trade arbitrarily many times.

Given this, is there any reason to focus on iterated counterfactual mugging, as opposed to just counterfactual muggings with higher stakes?

It seems like iteration is maybe related to learning. That doesn't make a difference for counterfactual mugging, because you'll learn nothing relevant over time.

For counterlogical muggings about the Nth digit of pi, we can imagine a scenario where you would have learned the Nth digit of pi after 1000 days, and therefore wouldn't have paid if Omega had first offered you the deal on the 1001st day. But now it's confounded by the fact that he already told you about it... So maybe there's something here where you stop taking the deal on the day when you would have found out the Nth digit of pi if Omega hadn't appeared?

Yeah, I'm kind of connecting a lot of threads here in a messy way. This post definitely could be better-organized.

I have a sense that open-minded UDT should relate to objective probabilities in a frequentist sense. For example, in decision problems involving Omega, it's particularly compelling if we stipulate that Omega has a long history of offering similar choices to mortals and a track record of being honest and predicting correctly. This is in some sense the most compelling way we can come to know what decision problem we face; and, it relies on framing our decision as part of a sequence. Counterlogical mugging on a digit of pi is similarly compelling if we imagine a very large digit, but becomes less compelling as we imagine digits closer to the beginning of pi. I want to suggest a learning principle with frequentist-flavored guarantees (similar to LIDT or BRIA but less updateful).
On the other hand, the bargaining framing does not have anything to do with iteration. The bargaining idea in some sense feels much more promising, since I can already offer a toy analysis supporting my intuition that iterated counterfactual mugging with the some coin is less tempting than iterated muggings with different coins.

For counterlogical mugging, it's unclear if it should be possible to correctly discover the parity of the relevant digit of pi. I would expect that in the counterfactual where it's even, it will eventually be discovered to be even. And in the countefactual where it's odd, that same digit will eventually be discovered to be odd.

ASP and Transparent Newcomb might be closer to test cases for formulating updateless policies that have the character of getting better as they grow more powerful. These problems ask the agent to use a decision procedure that intentionally doesn't take certain information into account, whether the agent as a whole has access to that information or not. But they lack future steps that would let that decision procedure benefit from eventually getting stronger than the agent that initially formulated it, so these aren't quite the thought experiments needed here.

Did you ever read Lara Buchak's book? Seems related.

Also, I'm not really intuition-pumped by the repeated mugging example. It seems similar to a mugging where Omega only shows up once, but asks you for a recurring payment.

A related issue might be asking if UDT-ish agents who use a computable approximation to the Solomonoff prior are reflectively stable - will they want to "lock out" certain hypotheses that involve lots of computation (e.g. universes provably trying to simulate you via search for simple universes that contain agents who endorse Solomonoff induction). And probably the answer us going to be "it depends," and you can do verbal argumentation for either option.

Yeah, I expect the Lizard World argument to be the more persuasive argument for a similar point. I'm thinking about reorganizing the post to make it more prominent.

This post discusses an important point: it is impossible to be simultaneously perfectly priorist ("updateless") and learn. Learning requires eventually "passing to" something like a posterior, which is inconsistent with forever maintaining "entanglement" with a counterfactual world. This is somewhat similar to the problem of traps (irreversible transitions): being prudent about risking traps requires relying on your prior, which prevents you from learning every conceivable opportunity.

My own position on this cluster of questions is that you should be priorist/(infra-)Bayesian about physics but postist/learner/frequentist about logic. This idea is formally embodied in the no-regret criterion for Formal Computational Realism. I believe that this no-regret condition implies something like the OP's "Eventual Learning", but formally demonstrating it is future work.

Updatelessness is about coordination between possible versions of an agent. Coordination with more distant versions of an agent gets more difficult or less informative, and a stronger version of an agent can reach further. This results in many local commitments that coordinate more related versions of an agent.

These local commitments, as agents in their own right, can grow stronger and should themselves coordinate with each other, where their parents failed to reach. Commitment to a strategy that won't itself engage in future rounds of coordination with its alternative possible forms (and other things) is a bad commitment.

interesting, this actually changed my mind, to the extent i had any beliefs about this already. I can see why you would want to update your prior, but the iterated mugging doesn't seem like the right type of thing that should cause you to update. My intuition is to pay all the single coinflip muggings. For the digit of pi muggings, i want to consider how different this universe would be if the digit of pi was different. Even though both options are subjectively equally likely to me, one would be inconsistent with other observations or less likely or have something wrong with it, so i lean toward never paying it

Yeah, in hindsight I realize that my iterated mugging scenario only communicates the intuition to people who already have it. The Lizard World example seems more motivating.

Excellent explanation, congratulations! Sad I'll have to miss the discussion.

Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.

You found yourself a very nice interlocutor. I think we truly cannot have our cake and eat it: either you update, making you susceptible to infohazards=traps (if they exist, and they might exist), or you don't, making you entrenched forever. I think we need to stop dancing around this fact, recognize that a fully-general solution in the formalism is not possible, and instead look into the details of our particular case. Sure, our environment might be adversarially bad, traps might be everywhere. But under this uncertainty, which ways do we think are best to recognize and prevent traps (while updating on other things). This is kind of studying and predicting generalization: given my past observations, where do I think I will suddenly fall out of distribution (into a trap)?

Me: I'm not sure if that's exactly the condition, but at least it motivates the idea that there's some condition differentiating when we should be updateful vs updateless. I think uncertainty about "our own beliefs" is subtly wrong; it seems more like uncertainty about which beliefs we endorse.

This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can't differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on "our own beliefs" or "which beliefs I endorse"? After all, that's just one more part of reality (without a clear boundary separating it).

It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can't know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: "I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I'm in the Infinite Counterlogical Mugging... then I will just eventually change my prior because I noticed I'm in the bad world!". But then again, why would we think this update is safe? That's just not being updateless, and losing out on the strategic gains from not updating.

Since a solution doesn't exist in full generality, I think we should pivot to more concrete work related to the "content" (our particular human priors and our particular environment) instead of the "formalism". For example:

Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
- Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.

This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can't differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on "our own beliefs" or "which beliefs I endorse"? After all, that's just one more part of reality (without a clear boundary separating it).

I'm comfortable explicitly assuming this isn't the case for nice clean decision-theoretic results, so long as it looks like the resulting decision theory also handles this possibility 'somewhat sanely'.

It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can't know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: "I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I'm in the Infinite Counterlogical Mugging... then I will just eventually change my prior because I noticed I'm in the bad world!". But then again, why would we think this update is safe? That's just not being updateless, and losing out on the strategic gains from not updating.

My thinking is more that we should accept the offer finitely many times or some fraction of the times, so that we reap some of the gains from updatelessness while also 'not sacrificing too much' in particular branches.

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

Since a solution doesn't exist in full generality, I think we should pivot to more concrete work related to the "content" (our particular human priors and our particular environment) instead of the "formalism".

This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it's misleading to imply we can surmount it. It's great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I've felt this was obscured in many relevant conversations.

This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

My point is that the theoretical work you are shooting for is so general that it's closer to "what sorts of AI designs (priors and decision theories) should always be implemented", rather than "what sorts of AI designs should humans in particular, in this particular environment, implement".
And I think we won't gain insights on the former, because there are no general solutions, due to fundamental trade-offs ("no-free-lunchs").
I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/"eye-balling"/iterating.

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it's misleading to imply we can surmount it. It's great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I've felt this was obscured in many relevant conversations.

I don't get your disagreement. If your view is that you can't eat one cake and keep it too, and my view is that you can eat some cakes and keep other cakes, isn't the obvious conclusion that these two views are compatible?

I would also argue that you can slice up a cake and keep some slices but eat others (this corresponds to mixed strategies), but this feels like splitting hairs rather than getting at some big important thing. My view is mainly about iterated situations (more than one cake).

Maybe your disagreement would be better stated in a way that didn't lean on the cake analogy?

My point is that the theoretical work you are shooting for is so general that it's closer to "what sorts of AI designs (priors and decision theories) should always be implemented", rather than "what sorts of AI designs should humans in particular, in this particular environment, implement".
And I think we won't gain insights on the former, because there are no general solutions, due to fundamental trade-offs ("no-free-lunchs").
I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/"eye-balling"/iterating.

Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions.

Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful.

It sort of seems like your point is that theoretical work always needs to be predicated on simplifying assumptions. I agree with this, but I don't think it makes theoretical work useless. My belief is that we should continue working to make the assumptions more and more realistic, but the 'essential picture' is often preserved under this operation. (EG, Newtonian gravity and general relativity make most of the same predictions in practice. Kolmogorov axioms vindicated a lot of earlier work on probability theory.)

Imagine if we had narrowed down the human prior to two possibilities, P_1 and P_2 . Humans can’t figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that P_2 is bad enough that it will lead to a catastrophe from the human perspective (that is, from the P_1 perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.

This seems like the central argument in the post, but I don't understand how it works.

Here's a toy example. Two envelopes, one contains $100, the other leads to a loss of $10000. We don't know which envelope is which, but it's possible to figure out by a long computation. So we make a money-maximizing UDT AI, whose prior is "the $100 is in whichever envelope {long_computation} says". Now if the AI has time to do the long computation, it'll do it and then open the right envelope. And if it doesn't have time to do the long computation, and is offered to open a random envelope or abstain, it will abstain. So it seems like ordinary UDT solves this example just fine. Can you explain where "updatefulness" comes in?

Let's frame it in terms of value learning.

Naive position: UDT can't be combined with value learning, since UDT doesn't learn. If we're not sure whether puppies or rainbows are what we intrinsically value, but rainbows are easier to manufacture, then the superintelligent UDT will tile the universe with rainbows instead of puppies because that has higher expectation according to the prior, regardless of evidence it encounters suggesting that puppies are what's more valuable.

Cousin_it's reply: There's puppy-world and rainbow-world. In rainbow-world, tiling the universe with rainbows has 100 utility, and tiling the universe with puppies has 0 utility. In puppy-world, tiling the universe with puppies has 90 utility (because puppies are harder to maximize than rainbows), but rainbows have 0 utility.

The UDT agent gets to observe which universe it is in, although it has a 50-50 prior on the two. There are four policies:

Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.
- EV: 45
Always make rainbows: 50% chance of utility 100, otherwise zero.
- EV: 50
Make puppies in rainbow world; make rainbows in puppy world.
- EV: 0
Make puppies in puppy world, make rainbows in rainbow world.
- EV: 95

The highest EV is to do the obvious value-learning thing; so, there's no problem.

Fixing the naive position: Some hypotheses will "play nice" like the example above, and updateless value learning will work fine.

However, there are some versions of "valuing puppies" and "valuing rainbows" which value puppies/rainbows regardless of which universe the puppies/rainbows are in. This only requires that there's some sort of embedding of counterfactual information into the sigma-algebra which the utility functions are predicated on. For example, if the agent has beliefs about PA, these utility functions could check for the number of puppies/rainbows in arbitrary computations. This mostly won't matter, because the agent doesn't have any control over arbitrary computations; but some of the computations contemplated in Rainbow Universe will be good models of Puppy Universe. Such a rainbow-value-hypothesis will value policies which create rainbows over puppies regardless of which branch they do it in.

These utility functions are called "nosy neighbors" because they care about what happens in other realities, not just their own.

Suppose the puppy hypothesis and the rainbow hypothesis are both nosy neighbors. I'll assume they're nosy enough that they value puppies/rainbows in other universes exactly as much as in their own. There are four policies:

Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.
- EV: 90
Always make rainbows: 50% worthless, 50% worth 100 + 100.
- EV: 100
Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.
- EV: 95
Puppies in puppy universe, rainbown in rainbow universe:
- EV: 95

In the presence of nosy neighbors, the naive position is vindicated: UDT doesn't do "value learning".

The argument is similar for the case of 'learning the correct prior'. The problem is that if we start with a broad prior over possible priors, then there can be nosy-neighbor hypotheses which spoil the learning. These are hard to rule out, because it is hard to rule out simulations of other possible worlds.

Going back to the envelopes example, a nosy neighbor hypothesis would be "the left envelope contains $100, even in the world where the right envelope contains $100". Or if we have an AI that's unsure whether it values paperclips or staples, a nosy neighbor hypothesis would be "I value paperclips, even in the world where I value staples". I'm not sure how that makes sense. Can you give some scenario where a nosy neighbor hypothesis makes sense?

I think so, yes, but I want to note that my position is consistent with nosy-neighbor hypotheses not making sense. A big part of my point is that there's a lot of nonsense in a broad prior. I think it's hard to rule out the nonsense without learning. If someone thought nosy neighbors always 'make sense', it could be an argument against my whole position. (Because that person might be just fine with UDT, thinking that my nosy-neighbor 'problems' are just counterfactual muggings.)

Here's an argument that nosy neighbors can make sense.

For values, as I mentioned, a nosy-neighbors hypothesis is a value system which cares about what happens in many different universes, not just the 'actual' universe. For example, a utility function which assigns some value to statements of mathematics.

For probability, a nosy-neighbor is like the Lizard World hypothesis mentioned in the post: it's a world where what happens there depends a lot on what happens in other worlds.

I think what you wrote about staples vs paperclips nosy-neighbors is basically right, but maybe if we rephrase it it can 'make more sense'?: "I (actual me) value paperclips being produced in the counterfactual(-from-my-perspective) world where I (counterfactual me) don't value paperclips."

Anyway, whether or not it makes intuitive sense, it's mathematically fine. The idea is that a world will contain facts that are a good lens into alternative worlds (such as facts of Peano Arithmetic), which utility hypotheses / probabilistic hypotheses can care about. So although a hypothesis is only mathematically defined as a function of worlds where it holds, it "sneakily" depends on stuff that goes on in other worlds as well.

Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).

I don't understand this argument.

"an agent eventually behaves as if it were applying UDT with each Pn" — why can't an agent skip over some Pn entirely or get stuck on P9 or whatever?

"Therefore, in particular, it eventually behaves like UDT with prior P0." even granting the above — sure, it will beahve like UDT with prior p0 at some point. But then after that it might have some other prior. Why would it stick with P0?

I probably need to clarify the statement of the assumption.

The idea isn't that it eventually takes at least one action that's in line with for each n, but then, might do some other stuff. The idea is that for each $n$ , there is a time $t_{n}$ after which all decisions will be consistent with UDT-using- $P_{n}$ .

UDT's recommendations will often coincide with more-updateful DTs. So the learning-UDT assumption is saying that UDT eventually behaves in an updateful way with respect to each observation, although not necessarily right away upon receiving that observation.

But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Lizards.

I don’t think this necessarily follows? For there to be a systematic impact on UDT’s behavior there would need to be more Lizard-Worlds that reward X than Anti-Lizard-Worlds that penalize X, so this is only a concern if there is reason to believe that there are “more” worlds (in an abstract logical-probability sense) that favor a specific direction.

Clearly this could still potentially cause problems, but (at least to me) it doesn’t seem like the problem is as ubiquitous as the essay makes it out to be.

You're right, I was overstating there. I don't think it's probable that everything cancels out, but a more realistic statement might be something like "if UDT starts with a broad prior which wasn't designed to address this concern, there will probably be many situations where its actions are more influenced by alternative possibilities (delusional, from our perspective) than by what it knows about the branch that it is in".

But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Gods.

Interlocutor: The prior is subjective. An agent has no choice but to trust its own prior. From its own perspective, its prior is the most accurate description of reality it can articulate.

for example a Bayesian agent might have a high prior that an otherwise non-interventionist God will reward them after death for not eating apples, and therefore not eat apples throughout their life.

For updateful DT, we can at least say: yes, a broad prior will include heaven/hell hypotheses which dramatically impact policy choice. But updateful priors have tools to address this problem:

The prior includes a likelihood function for heaven/hell hypotheses, which specifies how the probability of such hypotheses gets adjusted in light of evidence.
We mostly^[1] trust simplicity priors to either make sensible likelihood functions, which will only lean towards heaven/hell hypotheses when there's good reason, or else penalize heaven/hell hypotheses a priori for having a higher description complexity.
We can also directly provide feedback about the value estimates to teach an updateful DT to have sensible expectations.^[2]

None of these methods help UDT address the lizard world problem:

The likelihood functions don't matter; only the prior probability matters.
Simplicity priors aren't especially going to rule out these alternative worlds.^[3]
Direct feedback we give about expected values doesn't reduce the prior weight of these problematic hypotheses.

So I think there are definitely problems in this area, but I'm not sure it has much to do with "learning" as opposed to "philosophy" and the examples / thought experiments you give don't seem to pump my intuition in that direction much. (How UDT works in iterated counterfactual mugging also seems fine to me.)

Yeah, I expect the thought experiment I start with is only going to be compelling to people who sort of already agree with me.

I do agree that "philosophy" problems are very close to this stuff, and it would be good to articulate in those terms.

^{^}
Modulo inner-optimizer concerns like simulation attacks.
^{^}
I'm imagining something like Bayesian RL, or Bayesian approval-directed agents, but perhaps with the twist that feedback is only sometimes given.
^{^}
This is somewhat nuanced/questionable. The lizards might be providing rewards/punishments for behavior in lots of worlds, not just Earth, so that this hypothesis doesn't have to point to Earth specifically. However, if utilities are bounded, then this arguably weakens the rewards/punishments relevant to Earth, which is similarly reassuring to giving this hypothesis less prior weight.
Perhaps it doesn't have to weaken the rewards/punishments relevant to Earth, though, if lizards reward only those who always reject counterfactual muggings in all other worlds (not including the Lizard's offer, of course, which is arguably a counterfactual mugging itself).
Also, I think there are more complications related to inner optimizers.

However, there is no tiling theorem for UDT that I am aware of, which means we don't know whether UDT is reflectively consistent; it's only a conjecture.

So, I need to understand the argument better, but it seems to me like this kind of counterexample doesn't significantly wound the spirit of UDT.

it sounds to me like ruling this out requires an assumption about the correlations of an action being the same as the correlations of an earlier self-modifying action to enforce that later action.

I would guess that assumption would be sufficient to defeat my counter-example, yeah.

I think about these sorts of 'technical assumptions' needed for nice DT results as "sanity checks":

I think we need to make several significant assumptions like this in order to get nice theoretical DT results.
These nice DT results won't precisely apply to the real world; however, they do show that the DT being analyzed at least behaves sanely when it is in these 'easier' cases.
So it seems like the natural thing to do is prove tiling results, learning results, etc under the necessary technical assumptions, with some concern for how restrictive the assumptions are (broader sanity checks being better), and then also, check whether behavior is "at least somewhat reasonable" in other cases.

So if UDT fails to tile when we remove these assumptions, but, at least appears to choose its successor in a reasonable way given the situation, this would count as a success.

Nice post. I'm excited about the bargaining interpretation of UDT.

However, if we think of our probability for the coin-flip as the result of bargaining, it makes sense that it might be sensitive to size. The negotiation which was willing to trade $100 from one branch to get $10,000 in another branch need not be equally willing to perform that trade arbitrarily many times.

Given this, is there any reason to focus on iterated counterfactual mugging, as opposed to just counterfactual muggings with higher stakes?

It seems like iteration is maybe related to learning. That doesn't make a difference for counterfactual mugging, because you'll learn nothing relevant over time.

Yeah, I'm kind of connecting a lot of threads here in a messy way. This post definitely could be better-organized.

I have a sense that open-minded UDT should relate to objective probabilities in a frequentist sense. For example, in decision problems involving Omega, it's particularly compelling if we stipulate that Omega has a long history of offering similar choices to mortals and a track record of being honest and predicting correctly. This is in some sense the most compelling way we can come to know what decision problem we face; and, it relies on framing our decision as part of a sequence. Counterlogical mugging on a digit of pi is similarly compelling if we imagine a very large digit, but becomes less compelling as we imagine digits closer to the beginning of pi. I want to suggest a learning principle with frequentist-flavored guarantees (similar to LIDT or BRIA but less updateful).
On the other hand, the bargaining framing does not have anything to do with iteration. The bargaining idea in some sense feels much more promising, since I can already offer a toy analysis supporting my intuition that iterated counterfactual mugging with the some coin is less tempting than iterated muggings with different coins.

Did you ever read Lara Buchak's book? Seems related.

Also, I'm not really intuition-pumped by the repeated mugging example. It seems similar to a mugging where Omega only shows up once, but asks you for a recurring payment.

Yeah, I expect the Lizard World argument to be the more persuasive argument for a similar point. I'm thinking about reorganizing the post to make it more prominent.

Yeah, in hindsight I realize that my iterated mugging scenario only communicates the intuition to people who already have it. The Lizard World example seems more motivating.

Excellent explanation, congratulations! Sad I'll have to miss the discussion.

Interlocutor: Neither option is plausible. If you update, you're not dynamically consistent, and you face an incentive to modify into updatelessness. If you bound cross-branch entanglements in the prior, you need to explain why reality itself also bounds such entanglements, or else you're simply advising people to be delusional.

Me: I'm not sure if that's exactly the condition, but at least it motivates the idea that there's some condition differentiating when we should be updateful vs updateless. I think uncertainty about "our own beliefs" is subtly wrong; it seems more like uncertainty about which beliefs we endorse.

Conceptual or empirical work on which are the robust and safe ways to extract information from humans (Suddenly LLM pre-training becomes safety work)
Conceptual or empirical work on which actions or reasoning are more likely to unearth traps under different assumptions (although this work could unearth traps)
Compilation or observation of properties of our environment (our physical reality) that could have some weak signal on which kinds of moves are safe
- Unavoidably, this will involve some philosophical / almost-ethical reflection about which worlds we care about and which ones we are willing to give up.

This was very though-provoking, but unfortunately I still think this crashes head-on with the realization that, a priori and in full generality, we can't differentiate between safe and unsafe updates. Indeed, why would we expect that no one will punish us by updating on "our own beliefs" or "which beliefs I endorse"? After all, that's just one more part of reality (without a clear boundary separating it).

It sounds like you are correctly explaining that our choice of prior will be, in some important sense, arbitrary: we can't know the correct one in advance, we always have to rely on extrapolating contingent past observations.
But then, it seems like your reaction is still hoping that we can have our cake and eat it: "I will remain uncertain about which beliefs I endorse, and only later will I update on the fact that I am in this or that reality. If I'm in the Infinite Counterlogical Mugging... then I will just eventually change my prior because I noticed I'm in the bad world!". But then again, why would we think this update is safe? That's just not being updateless, and losing out on the strategic gains from not updating.

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

Since a solution doesn't exist in full generality, I think we should pivot to more concrete work related to the "content" (our particular human priors and our particular environment) instead of the "formalism".

That is: in this case at least it seems like there's concrete reason to believe we can have some cake and eat some too.

This content-work seems primarily aimed at discovering and navigating actual problems similar to the decision-theoretic examples I'm using in my arguments. I'm more interested in gaining insights about what sorts of AI designs humans should implement. IE, the specific decision problem I'm interested in doing work to help navigate is the tiling problem.

I disagree with this framing. Sure, if you have 5 different cakes, you can eat some and have some. But for any particular cake, you can't do both. Similarly, if you face 5 (or infinitely many) identical decision problems, you can choose to be updateful in some of them (thus obtaining useful Value of Information, that increases your utility in some worlds), and updateless in others (thus obtaining useful strategic coherence, that increases your utility in other worlds). The fundamental dichotomy remains as sharp, and it's misleading to imply we can surmount it. It's great to discuss, given this dichotomy, which trade-offs we humans are more comfortable making. But I've felt this was obscured in many relevant conversations.

Maybe your disagreement would be better stated in a way that didn't lean on the cake analogy?

My point is that the theoretical work you are shooting for is so general that it's closer to "what sorts of AI designs (priors and decision theories) should always be implemented", rather than "what sorts of AI designs should humans in particular, in this particular environment, implement".
And I think we won't gain insights on the former, because there are no general solutions, due to fundamental trade-offs ("no-free-lunchs").
I think we could gain many insights on the former, but that the methods better fit for that are less formal/theoretical and way messier/"eye-balling"/iterating.

Well, one way to continue this debate would be to discuss the concrete promising-ness of the pseudo-formalisms discussed in the post. I think there are some promising-seeming directions.

Another way to continue the debate would be to discuss theoretically whether theoretical work can be useful.

Imagine if we had narrowed down the human prior to two possibilities, P_1 and P_2 . Humans can’t figure out which one represents our beliefs better, but the superintelligent AI will be able to figure it out. Moreover, suppose that P_2 is bad enough that it will lead to a catastrophe from the human perspective (that is, from the P_1 perspective), even if the AI were using UDT with 50-50 uncertainty between the two. Clearly, we want the AI to be updateful about which of the two hypotheses is correct.

This seems like the central argument in the post, but I don't understand how it works.

Let's frame it in terms of value learning.

The UDT agent gets to observe which universe it is in, although it has a 50-50 prior on the two. There are four policies:

Always make puppies: this has a 50% chance of a utility of 90, and otherwise yields zero.
- EV: 45
Always make rainbows: 50% chance of utility 100, otherwise zero.
- EV: 50
Make puppies in rainbow world; make rainbows in puppy world.
- EV: 0
Make puppies in puppy world, make rainbows in rainbow world.
- EV: 95

The highest EV is to do the obvious value-learning thing; so, there's no problem.

Fixing the naive position: Some hypotheses will "play nice" like the example above, and updateless value learning will work fine.

These utility functions are called "nosy neighbors" because they care about what happens in other realities, not just their own.

Always make puppies: 50% chance of being worthless, if the rainbow hypothesis is true. 50% of getting 90 for making puppies in puppy-universe, plus 90 more for making puppies in rainbow-universe.
- EV: 90
Always make rainbows: 50% worthless, 50% worth 100 + 100.
- EV: 100
Make puppies in rainbow universe, rainbows in puppy universe: 50% a value of 90, 50% a value of 100.
- EV: 95
Puppies in puppy universe, rainbown in rainbow universe:
- EV: 95

In the presence of nosy neighbors, the naive position is vindicated: UDT doesn't do "value learning".

Here's an argument that nosy neighbors can make sense.

For probability, a nosy-neighbor is like the Lizard World hypothesis mentioned in the post: it's a world where what happens there depends a lot on what happens in other worlds.

Notice that learning-UDT implies UDT: an agent eventually behaves as if it were applying UDT with each Pn. Therefore, in particular, it eventually behaves like UDT with prior P0. So (with the exception of some early behavior which might not conform to UDT at all) this is basically UDT with a prior which allows for learning. The prior P0 is required to eventually agree with the recommendations of P1, P2, ... (which also implies that these eventually agree with each other).

I don't understand this argument.

"an agent eventually behaves as if it were applying UDT with each Pn" — why can't an agent skip over some Pn entirely or get stuck on P9 or whatever?

I probably need to clarify the statement of the assumption.

But if UDT starts with a broad prior, it will probably not learn, because it will have some weird stuff in its prior which causes it to obey random imperatives from imaginary Lizards.

Clearly this could still potentially cause problems, but (at least to me) it doesn’t seem like the problem is as ubiquitous as the essay makes it out to be.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

41

In Defense of Open-Minded UDT

41

Iterated Counterfactual Mugging On a Single Coinflip

UDT vs Learning

Lizard World

Learning Desiderata

Bargaining

Possible Critiques

Number of 'yes':	End wealth of benefitting branch:	End wealth of mugged branch:	Product of end wealths:
0	500	500	250000
1	10500	400	4200000
2	20500	300	6150000
3	30500	200	6100000
4	40500	100	4050000
5	50500	0	0