Let ν:ΠH×O<H→ΔO be a Newcombian problem of horizon H∈Nthat satisfies pseudocausality. Let Mν=(S,Θ0,A,O,T,B) denote the associated supra-POMDP with infinite time horizon and time discount γ∈[0,1). Then
limγ→1|minπ∈ΠEMπν[Lγ]−minπ∈ΠEνπ[Lγ]|=0.
Furthermore, if {πγ}γ∈[0,1)is a family of policies such that limγ→1EMπγν[Lγ]−minπ∈ΠEMπν[Lγ]=0, then
limγ→1Eνπγ[Lγ]−minπ∈ΠEνπ[Lγ]=0.
Proof: Let λ denote the empty history. Given a supracontribution Θ, let max(Θ) denote the set of maximal extreme points of Θ. First we remark that for any supra-POMDP,without loss of generality, a set of copolicies Ξ can always be replaced by
{τ∈Ξ|τ(λ)∈max(Θ0),∀h≠λ∈(S×A)∗,τ(h)∈max(T(s,a))}.
Given an episode policy π∈ΠH, let τπ denote the episode copolicy that initializes the state to (π,λ), i.e. τπ(λ)=δπ×λ. Let τππ∈Δ(A×O)H denote the distribution over outcomes determined by the interaction of π and τπ. Note that the expected loss with respect to τππ is equal to the expected loss for the Newcombian problem, i.e.
Eτππ[L]=Eνπ[L].
Recall that throughout this sequence, we assume that O is finite. By the remark at the beginning of the proof, the expected loss in one episode for the corresponding supra-POMDP can be written as a maximum expected loss over a finite set of Mν-copolicies τ. Namely,
EMπν[L]=maxτEτπ[L].
Then
maxτEτπ[L]≥Eτππ[L]=Eνπ[L],
and thus for any episode policy π∈ΠH,
EMπν[L]≥Eνπ[L].
We now extend this analysis to the optimal loss over n episodes for n∈N.[1] Let L∗:=minπ∈ΠHEνπ[L] denote the episode optimal loss for ν. Let πn∈Π be an arbitrary policy for n episodes of Mν. Then as before,
EMπnν[Lγn]=maxτnEτnπn[Lγn]
where the maximum is over a finite set of n-episode copolicies τn. By the single episode case,
maxτnEτnπn[Lγn]≥(1−γ)n−1∑k=0γkL∗=L∗,
and thus minπn∈ΠEMπnν[Lγn]≥L∗.
It remains to show that the opposite inequality holds in the many-episode and γ→1 limit.
Recall that given π∈ΠH, we define
Cπ:={h∈(A×O)H|∀ prefixes h′a∈(A×O)∗×A of h,a=π(h′)}.
Recall that since ν satisfies pseudocausality, there exists a ν-optimal policy π∗∈ΠH such that for all π∈ΠH, if supp(νπ)⊆Cπ∗, then π is also optimal for ν. Consequently, for any episode copolicy τ, either Eτπ∗[L]≤L∗ or Eτπ∗[1]<1. To see this, suppose there exists an episode copolicy τ such that Eτπ∗[1]=1. Then there exists a policy π† such that supp(νπ†)⊆Cπ∗ and Eτπ∗[L]=Eνπ†[L]. By pseudocausality, Eνπ†[L]=L∗. Thus Eτπ∗[L]≤L∗.
Define
α:=maxτ:Eτπ∗[L]>L∗Eτπ∗[1].
By the remark at the beginning of the proof, the relevant set of copolicies in the definition of α is finite, and thus α is well-defined. If Eτπ∗[L]>L∗ then Eτπ∗[1]<1. Thusα∈[0,1).
Consider the iterated Newcombian problem over n episodes. Let π∗n denote the multi-episode policy such that π∗n restricted to every episode is π∗. Let τn denote an arbitrary copolicy that interacts with π∗n. Furthermore, let m≤n denote the number of episodes for which the episode-restriction τ of τn interacting with π∗n satisfies Eτπ∗[L]>L∗.[2]
This proof section accompanies Formalizing Newcombian problems with fuzzy infra-Bayesianism. We prove the following result.
Proof: Let λ denote the empty history. Given a supracontribution Θ, let max(Θ) denote the set of maximal extreme points of Θ. First we remark that for any supra-POMDP, without loss of generality, a set of copolicies Ξ can always be replaced by
{τ∈Ξ | τ(λ)∈max(Θ0),∀h≠λ∈(S×A)∗,τ(h)∈max(T(s,a))}.Given an episode policy π∈ΠH, let τπ denote the episode copolicy that initializes the state to (π,λ), i.e. τπ(λ)=δπ×λ. Let τππ∈Δ(A×O)H denote the distribution over outcomes determined by the interaction of π and τπ. Note that the expected loss with respect to τππ is equal to the expected loss for the Newcombian problem, i.e.
Eτππ[L]=Eνπ[L].Recall that throughout this sequence, we assume that O is finite. By the remark at the beginning of the proof, the expected loss in one episode for the corresponding supra-POMDP can be written as a maximum expected loss over a finite set of Mν-copolicies τ. Namely,
EMπν[L]=maxτEτπ[L].Then
maxτEτπ[L]≥Eτππ[L]=Eνπ[L],and thus for any episode policy π∈ΠH,
EMπν[L]≥Eνπ[L].We now extend this analysis to the optimal loss over n episodes for n∈N.[1] Let L∗:=minπ∈ΠHEνπ[L] denote the episode optimal loss for ν. Let πn∈Π be an arbitrary policy for n episodes of Mν. Then as before,
EMπnν[Lγn]=maxτnEτnπn[Lγn]where the maximum is over a finite set of n-episode copolicies τn. By the single episode case,
maxτnEτnπn[Lγn]≥(1−γ)n−1∑k=0γkL∗=L∗,and thus minπn∈ΠEMπnν[Lγn]≥L∗.
It remains to show that the opposite inequality holds in the many-episode and γ→1 limit.
Recall that given π∈ΠH, we define
Cπ:={h∈(A×O)H | ∀ prefixes h′a∈(A×O)∗×A of h,a=π(h′)}.Recall that since ν satisfies pseudocausality, there exists a ν-optimal policy π∗∈ΠH such that for all π∈ΠH, if supp(νπ)⊆Cπ∗, then π is also optimal for ν. Consequently, for any episode copolicy τ, either Eτπ∗[L]≤L∗ or Eτπ∗[1]<1. To see this, suppose there exists an episode copolicy τ such that Eτπ∗[1]=1. Then there exists a policy π† such that supp(νπ†)⊆Cπ∗ and Eτπ∗[L]=Eνπ†[L]. By pseudocausality, Eνπ†[L]=L∗. Thus Eτπ∗[L]≤L∗.
Define
α:=maxτ:Eτπ∗[L]>L∗Eτπ∗[1].By the remark at the beginning of the proof, the relevant set of copolicies in the definition of α is finite, and thus α is well-defined. If Eτπ∗[L]>L∗ then Eτπ∗[1]<1. Thus α∈[0,1).
Consider the iterated Newcombian problem over n episodes. Let π∗n denote the multi-episode policy such that π∗n restricted to every episode is π∗. Let τn denote an arbitrary copolicy that interacts with π∗n. Furthermore, let m≤n denote the number of episodes for which the episode-restriction τ of τn interacting with π∗n satisfies Eτπ∗[L]>L∗.[2]
We have
Eτπ∗nn[Lγn]≤Eτπ∗nn[1]≤αm.Furthermore,
Eτπ∗nn[Lγn]≤(1−γ)(m−1∑t=0γt⋅1+n−1∑t=mγt⋅L∗)=(1−γ)(1−γm1−γ+L∗γm⋅1−γn−m1−γ)=1−γm+L∗(γm−γn).We leave it to the reader to verify that
limγ→1limn→∞Eτπ∗nn[Lγn]≤limγ→1limn→∞maxm≤nmin(αm,1−γm+L∗(γm−γn))=L∗.□Recall that if n∈N and h∈(A×O)nH is given by h=a0o0…anH−1onH−1, then the loss over n episodes with geometric time discount γ∈[0,1) is defined by
Lγn(h)=(1−γ)n−1∑k=0γkL(akHokH⋯akH+H−1okH+H−1).A copolicy can depend on the past, meaning it can depend on the policy. Thus m can depend on π∗n.