Attention conservation notice: Most of this post is a recap of the standard arguments for UDT, but the way in which the standard proof for value of information being nonnegative breaks down in some decision theory scenarios probably isn't common knowledge yet.
The standard proof that the value of information (VOI) is always positive is very simple and goes something like this. Let x∈X be a true underlying state from the set of all underlying states, let a∈A be an action selected from a space of possible actions, let z∈Z be an observation from the set of all observations, and let U(x,a) be a utility function that maps an action and an underlying state to [0,1] . SI is an abbreviation for sample information.
Let a′ be the action selected by maximizing utility without looking at the sample information. Comparing the right-hand side of both lines, we can see that for any given piece of information z , you can either copy a′ (in which case the utility acquired will be the same), or select some other action (in which case you'll get more utility).
This is just the simple argument that, no matter what information you see, you can always just act as if you hadn't seen it to do as well as the ignorant version of you, and maybe you can do better.
Swapping out P(x) and P(x|z) with P(x|a) and P(x|z,a) respectively, generalize this proof to cover situations where the underlying state is correlated with your choice of action. This alteration gets Newcomb's problem right.
However, the proof breaks down when the probability distribution over which information you see, is correlated your choice of future action. Example problems with this property are XOR Blackmail, Parfait's Hitchiker, Transparent Newcomb, and any fictional setting with stable time loops. All of these problems have the property that selecting the best action conditional on the information you see does make you better off, but this decreases the probability of you getting into a favorable situation in the first place (according to past-you)
To formalize this, we need a way to let the probability distribution over the information vary depending on the probability distribution over future actions. Specifically, we will assume a continuous function f:ΔA→ΔZ which maps a probability distribution over future actions to a probability distribution over seeing the information in the first place. The policy of the agent (how it reacts to observations) is a Markov kernel of type Z→A , so all policies define a continuous function ΔZ→ΔA . Composing these two continuous functions gives a continuous function from ΔZ to itself, so by the Tychonoff fixed-point theorem (the infinite-dimensional version of Brouwer's fixed-point theorem), all policies induce at least one probability distribution on Z that is a fixed point.
Let Zarg be the probability distribution induced by the policy that just does argmax after seeing the observation, while Za is the probability distribution induced by the policy that just takes some fixed action a. Then the final lines from the proof turn into:
maxa∈A∑z∈Za(∑x∈XU(x,a)P(x|a,z))P(z)
∑z∈Zargmaxa∈A(∑x∈XU(x,a)P(x|a,z))P(z)
and suddenly we have that E(U|z) increased or stayed the same for all z (by the same argument as before), while at the same time E(U) may be greater than E(U|SI) because the probability distribution over observations is different. Parfit's Hitchiker is a good example, where the z corresponds to your observation of whether you were taken into town. Selecting the best action conditional on your observations makes you better off in all situations, but because it affects the probability of the observations in the first place, it actually lowers expected utility.
Attention conservation notice: Most of this post is a recap of the standard arguments for UDT, but the way in which the standard proof for value of information being nonnegative breaks down in some decision theory scenarios probably isn't common knowledge yet.
The standard proof that the value of information (VOI) is always positive is very simple and goes something like this. Let x∈X be a true underlying state from the set of all underlying states, let a∈A be an action selected from a space of possible actions, let z∈Z be an observation from the set of all observations, and let U(x,a) be a utility function that maps an action and an underlying state to [0,1] . SI is an abbreviation for sample information.
E(U)=maxa∈A∑x∈XU(x,a)P(x)=maxa∈A∑z∈Z(∑x∈XU(x,a)P(x|z))P(z)
E(U|SI)=∑z∈ZE(U|z)P(z)=∑z∈Zmaxa∈A(∑x∈XU(x,a)P(x|z))P(z)
Let a′ be the action selected by maximizing utility without looking at the sample information. Comparing the right-hand side of both lines, we can see that for any given piece of information z , you can either copy a′ (in which case the utility acquired will be the same), or select some other action (in which case you'll get more utility).
This is just the simple argument that, no matter what information you see, you can always just act as if you hadn't seen it to do as well as the ignorant version of you, and maybe you can do better.
Swapping out P(x) and P(x|z) with P(x|a) and P(x|z,a) respectively, generalize this proof to cover situations where the underlying state is correlated with your choice of action. This alteration gets Newcomb's problem right.
However, the proof breaks down when the probability distribution over which information you see, is correlated your choice of future action. Example problems with this property are XOR Blackmail, Parfait's Hitchiker, Transparent Newcomb, and any fictional setting with stable time loops. All of these problems have the property that selecting the best action conditional on the information you see does make you better off, but this decreases the probability of you getting into a favorable situation in the first place (according to past-you)
To formalize this, we need a way to let the probability distribution over the information vary depending on the probability distribution over future actions. Specifically, we will assume a continuous function f:ΔA→ΔZ which maps a probability distribution over future actions to a probability distribution over seeing the information in the first place. The policy of the agent (how it reacts to observations) is a Markov kernel of type Z→A , so all policies define a continuous function ΔZ→ΔA . Composing these two continuous functions gives a continuous function from ΔZ to itself, so by the Tychonoff fixed-point theorem (the infinite-dimensional version of Brouwer's fixed-point theorem), all policies induce at least one probability distribution on Z that is a fixed point.
Let Zarg be the probability distribution induced by the policy that just does argmax after seeing the observation, while Za is the probability distribution induced by the policy that just takes some fixed action a. Then the final lines from the proof turn into:
maxa∈A∑z∈Za(∑x∈XU(x,a)P(x|a,z))P(z)
∑z∈Zargmaxa∈A(∑x∈XU(x,a)P(x|a,z))P(z)
and suddenly we have that E(U|z) increased or stayed the same for all z (by the same argument as before), while at the same time E(U) may be greater than E(U|SI) because the probability distribution over observations is different. Parfit's Hitchiker is a good example, where the z corresponds to your observation of whether you were taken into town. Selecting the best action conditional on your observations makes you better off in all situations, but because it affects the probability of the observations in the first place, it actually lowers expected utility.