Updateless decision theory was informally defined by Wei Dai in terms of logical conditional expected utility, where the condition corresponds to an algorithm (the agent) producing a given output (action or policy). This kind of conditional expected values can be formalised by optimal predictors. However, since optimal predictor systems which are required to apply optimal predictors to decision theory generally have random advice, we need counterfactuals well-defined for random algorithms i.e. algorithms that produce different outputs with different probabilities depending on internal coin tosses. We propose to define these counterfactuals by a generalization of the notion of conditional expected utility which amounts to linear regression of utility with respect to the probabilities of different outputs in the space of "impossible possible worlds." We formalise this idea by introducing "relative optimal predictors," prove the analogue of the conditional probability formula (which takes matrix form) and uniqueness theorems.
Motivation
We start by explaining the analogous construction in classical probability theory and proceed to defining the logical counterpart in the Results section.
Consider ζ a probability measure on some space, a random variable u representing utility, a finite set A representing possible actions and another random variable p taking values in [0,1]A and satisfying ∑a∈Apa=1 representing the probabilities of taking different actions. For a deterministic algorithm, p takes values {0,1}A allowing defining conditional expected utility as
ua:=Eζ[u∣pa=1]=Eζ[upa]Eζ[pa]
In the general case, it is tempting to consider
Eζ⋉p[u∣a]=Eζ[upa]Eζ[pa]
where ζ⋉p stands for the semidirect product of ζ with p, the latter regarded as a Markov kernel with target A. However, this would lead to behavior similar to EDT since conditioning by a is meaningful even for a single "world" (i.e. completely deterministic u and p). Instead, we select u∗∈RA that minimizes Eζ[(u−ptu∗)2] (we regard elements of RA as column vectors so pt is a row vector). This means u∗ has to satisfy the matrix equation
Eζ[ppt]u∗=Eζ[up]
The solution to this equation is only unique when Eζ[ppt] is non-degenerate. This corresponds to requiring positive probability of the condition for usual conditional expected values. In case p takes values in {0,1}A, u∗ is the usual conditional expected value.
Preliminaries
Notation
We will suppress the k,j indices associated with word ensembles and bischemes whenever possible to reduce clutter. When we do show k,j explicitly, evaluation of functions δ:N2→R will be written as δkj rather than δ(k,j) as before.
The concept of "proto-error space" is renamed to "error space" whereas the previous definition of "error space" is now called "radical error space".
We slightly extend the definition of "distributional estimation problem" as follows.
Definition 1
A distributional estimation problem is a pair (μ,f) where μ is a word ensemble and f:suppμ→R is bounded.
The corresponding notion of optimal predictor is the following.
Definition 2
Fix E an space of rank 2 and (μ,f) a distributional estimation problem. Consider ^P a Q-valued (poly,rlog)-bischeme taking bounded values. ^P is called an E(poly,rlog)-optimal predictor for (μ,f) when for any Q-valued (poly,rlog)-bischeme ^Q, there is δ∈E s.t.
Eμ×^σP[(^P−f)2]≤Eμ×^σQ[(^Q−f)2]+δ
The theory of optimal predictors remains intact with these new definitions (although some care is wish to consider reflective systems with range R instead of [0,1]).
The following concept of "orthogonal predictor" is often more convenient than "optimal predictor", since we no longer require our error spaces to be radical.
Definition 3
Fix E an error space of rank 2 and (μ,f) a distributional estimation problem. Consider ^P a Q-valued (poly,rlog)-bischeme taking bounded values.
^P is called an E(poly,rlog)-orthogonal predictor for (μ,f) when for any Q-valued (poly,rlog)-bischeme ^Q s.t. ^σQ factors as ^σP×τ, there is δ∈E s.t.
The two concepts are tightly related due to the following
Theorem 1
Fix E an error space of rank 2 and (μ,f) a distributional estimation problem. Then any E(poly,rlog)-orthogonal predictor for (μ,f) is an E(poly,rlog)-optimal predictor for (μ,f). Conversely any E(poly,rlog)-optimal predictor for (μ,f) is an E12(poly,rlog)-orthogonal predictor for (μ,f).
The proofs of the results are found in the Appendix.
Definition 4
A relative estimation problem is a quadruple (A,μ,f,g) where A is a finite set, f:suppμ→R bounded and g:suppμ→RA bounded.
Definition 5
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem and ^P a QA-valued (poly,rlog)-bischeme. ^P is called an E(poly,rlog)-optimal predictor for (A,μ,f,g) when for any QA-valued (poly,rlog)-bischeme ^Q s.t. ^σQ factors as ^σP×τ, there is δ∈E s.t.
The appearance of the Eμ×^σP×τ[(gt(^P−^Q))2] term might seem as making the condition substantially weaker. However, we the condition is still sufficient strong to produce the uniqueness theorem which appears as Theorem 1 below. Also, the novelty disappears in the corresponding orthogonality condition.
Definition 6
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem and ^P a QA-valued (poly,rlog)-bischeme. ^P is called an E(poly,rlog)-orthogonal predictor for (A,μ,f,g) when for any QA-valued (poly,rlog)-bischeme ^Q s.t. ^σQ factors as ^σP×τ, there is δ∈E s.t.
Fix E an error space of rank 2 and (A,μ,f,g) a relative estimation problem. Then any E(poly,rlog)-orthogonal predictor for (A,μ,f,g) is an E(poly,rlog)-optimal predictor for (A,μ,f,g). Conversely any E(poly,rlog)-optimal predictor for (A,μ,f,g) is an E12(poly,rlog)-orthogonal predictor for (A,μ,f,g).
Definition 7
Consider E an error space space of rank n. A set I⊆Nn is called E-negligible when χI∈E.
Note 1
The E-negligible sets form a set-theoretic ideal on Nn.
All finite sets are E-negligible (thanks to the the condition 2−h∈E).
Definition 8
Fix E an error space of rank 2. Consider A a finite set, μ a word ensemble, g:suppμ→RA and ^Q1,^Q2QA-valued (poly,rlog)-bischemes. We say ^Q1 is E-similar to ^Q2 relative to g (denoted ^Q1g≃E^Q2) when there is an E-negligible set I and δ∈E s.t.
∀(k,j)∈N2∖I:Eμk×^σ1×^σ2[(gt(^Qkj1−^Qkj2))2]=δkj
Note 2
It's impossible to remove I from the definition since Eμk×^σ1×^σ2[(gt(^Qkj1−^Qkj2))2] doesn't have to be bounded.
Theorem 3
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem and ^P1,^P2E(poly,rlog)-orthogonal predictors for (A,μ,f,g). Then ^P1g≃E^P2.
Theorem 4 below is the orthogonal predictor analogue of the matrix equation from the previous section.
Proposition 1
Consider E an error space of rank n, ϵ>0 and γ:Nn→[ϵ,∞). Then γE:={γδ∣δ∈E} is an error space.
Theorem 4
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem, ^Pgg an End(QA)-valued (poly,rlog)-bischeme, ^Pfg a QA-valued bischeme, ϵ>0 and γ:N2→[ϵ,∞). Assume that for any x∈suppμk, ^Pkjgg(x) is always invertible and the operator norm of ^Pkjgg(x)−1 is at most γkj. Assume further that ^Pgg is an E(poly,rlog)-orthogonal predictor for ggt (componentwise) and ^Pfg is a γ2E(poly,rlog)-orthogonal predictor for fg (componentwise). Define the QA-valued bischeme ^Pf∣g by ^Pf∣g:=^P−1gg^Pfg. Then, ^Pf∣g is a γ2E(poly,rlog)-orthogonal predictor for (A,μ,f,g).
Note 3
It is always possible to choose ^Pgg to be symmetric in which case the operator norm of ^P−1gg is the inverse of the lowest absolute value of an eigenvalue of ^Pgg.
Finally, we extend the stability result for usual conditional probabilities to the matrix setting.
Definition 9
Fix E an error space of rank 2. Consider A a finite set, μ a word ensemble and ^Q1,^Q2QA-valued (poly,rlog)-bischemes. We say ^Q1 is E-similar to ^Q2 relative to μ (denoted ^Q1μ≃E^Q2) when there is an E-negligible set I and δ∈E s.t.
∀(k,j)∈N2∖I:Eμk×^σ1×^σ2[∥^Qkj1−^Qkj2∥2]=δkj
Theorem 5
Fix E an error space of rank 2. Consider ϵ>0 and γ:N2→[ϵ,∞), (A,μ,f,g) a relative estimation problem, ^Pgg a positive-definite E(poly,log)-orthogonal predictor for ggt and ^P1, ^P2γ4E(poly,log)-orthogonal predictors for (A,μ,f,g). Assume that for any x∈suppμk, the lowest eigenvalue of ^Pkjgg(x) is always at least (γkj)−1 and ∥^Pkj1,2(x)∥≤γkj. Then, ^P1μ≃γ5E^P2.
The Corollary below shows that in simple situations these counterfactuals produce the values we expect.
Definition 10
Fix E an space of rank 2 and (μ,f) a distributional estimation problem. Consider ^P a Q-valued (poly,rlog)-bischeme taking bounded values. ^P is called an E(poly,rlog)-perfect predictor for (μ,f) when Eμ×^σP[(^P−f)2]∈E.
Corollary
Fix E an error space of rank 2. Consider ϵ>0 and γ:N2→[ϵ,∞), (A,μ,f∗,g) a relative estimation problem and f:suppμ→RA bounded s.t. f∗=gtf. Suppose ^Pf is a γ8E2(poly,rlog)-perfect predictor for (μ,f). Suppose ^Pgg is a positive-definite E(poly,rlog)-orthogonal predictor for ggt s.t. for any x∈suppμk, the lowest eigenvalue of ^Pkjgg(x) is always at least (γkj)−1. Suppose ^Pfg is a γ2E(poly,rlog)-orthogonal predictor for fg. Then, ^P−1gg^Pfgμ≃γ5E^Pf.
Appendix
Proof of Theorem 2
Consider ^P an E(poly,rlog)-optimal predictor for (A,μ,f,g). Consider ^R a QA-valued (poly,rlog)-bischeme with ^σR=^σP×τ. For any ζ:N2→{−1,+1} and n:N2→Z s.t. |n| can be bounded by a polynomial (and hence its number of digits is logarithmic), we can define tkj:=ζkj2−nkj and the QA-valued (poly,rlog)-bischeme ^Q with ^σQ:=^σR and Q(x,y,z):=P(x,y)+tR(x,y,z). We have δ∈E s.t.
Fix E an space of rank 2 and (A,μ,f,g) a relative estimation problem. Consider ^P a QA-valued (poly,rlog)-bischeme. ^P is called an E(poly,rlog)-perfect predictor for (A,μ,f,g) when Eμ×^σP[(gt^P−f)2]∈E.
Proposition 2
Fix E an error space of rank 2. Consider (A,μ,f∗,g) a relative estimation problem and f:suppμ→RA bounded s.t. f∗=gtf. Suppose ^Pf is an E(poly,rlog)-perfect predictor for (μ,f). Then ^Pf is an E(poly,rlog)-perfect predictor for (A,μ,f∗,g).
Proof of Proposition 2
Eμ×^σP[(gt^P−f∗)2]=Eμ×^σP[(gt^P−gtf)2]
Eμ×^σP[(gt^P−f∗)2]=Eμ×^σP[(gt(^P−f))2]
Eμ×^σP[(gt^P−f∗)2]≤Eμ×^σP[∥g∥2∥^P−f∥2]
Eμ×^σP[(gt^P−f∗)2]≤sup∥g∥2Eμ×^σP[∥^P−f∥2]
Proof of Corollary
By Proposition 2, ^Pf is a γ8E2(poly,rlog)-perfect predictor for (A,μ,f∗,g) and in particular a γ8E2(poly,rlog)-optimal predictor for (A,μ,f∗,g). By Theorem 2 it is a γ4E(poly,rlog)-orthogonal predictor for (A,μ,f∗,g). By Theorem 4, ^P−1gg^Pfg is a γ2E(poly,rlog)-orthogonal predictor for (A,μ,f∗,g). Applying Theorem 5, we get the desired result (the condition ∥^P−1gg^Pfg∥≤γ holds because ^Pfg is bounded and ^Pgg's lowest eigenvalue is at most γ−1).
Updateless decision theory was informally defined by Wei Dai in terms of logical conditional expected utility, where the condition corresponds to an algorithm (the agent) producing a given output (action or policy). This kind of conditional expected values can be formalised by optimal predictors. However, since optimal predictor systems which are required to apply optimal predictors to decision theory generally have random advice, we need counterfactuals well-defined for random algorithms i.e. algorithms that produce different outputs with different probabilities depending on internal coin tosses. We propose to define these counterfactuals by a generalization of the notion of conditional expected utility which amounts to linear regression of utility with respect to the probabilities of different outputs in the space of "impossible possible worlds." We formalise this idea by introducing "relative optimal predictors," prove the analogue of the conditional probability formula (which takes matrix form) and uniqueness theorems.
Motivation
We start by explaining the analogous construction in classical probability theory and proceed to defining the logical counterpart in the Results section.
Consider ζ a probability measure on some space, a random variable u representing utility, a finite set A representing possible actions and another random variable p taking values in [0,1]A and satisfying ∑a∈Apa=1 representing the probabilities of taking different actions. For a deterministic algorithm, p takes values {0,1}A allowing defining conditional expected utility as
ua:=Eζ[u∣pa=1]=Eζ[upa]Eζ[pa]
In the general case, it is tempting to consider
Eζ⋉p[u∣a]=Eζ[upa]Eζ[pa]
where ζ⋉p stands for the semidirect product of ζ with p, the latter regarded as a Markov kernel with target A. However, this would lead to behavior similar to EDT since conditioning by a is meaningful even for a single "world" (i.e. completely deterministic u and p). Instead, we select u∗∈RA that minimizes Eζ[(u−ptu∗)2] (we regard elements of RA as column vectors so pt is a row vector). This means u∗ has to satisfy the matrix equation
Eζ[ppt]u∗=Eζ[up]
The solution to this equation is only unique when Eζ[ppt] is non-degenerate. This corresponds to requiring positive probability of the condition for usual conditional expected values. In case p takes values in {0,1}A, u∗ is the usual conditional expected value.
Preliminaries
Notation
We will suppress the k,j indices associated with word ensembles and bischemes whenever possible to reduce clutter. When we do show k,j explicitly, evaluation of functions δ:N2→R will be written as δkj rather than δ(k,j) as before.
The concept of "proto-error space" is renamed to "error space" whereas the previous definition of "error space" is now called "radical error space".
We slightly extend the definition of "distributional estimation problem" as follows.
Definition 1
A distributional estimation problem is a pair (μ,f) where μ is a word ensemble and f:suppμ→R is bounded.
The corresponding notion of optimal predictor is the following.
Definition 2
Fix E an space of rank 2 and (μ,f) a distributional estimation problem. Consider ^P a Q-valued (poly,rlog)-bischeme taking bounded values. ^P is called an E(poly,rlog)-optimal predictor for (μ,f) when for any Q-valued (poly,rlog)-bischeme ^Q, there is δ∈E s.t.
Eμ×^σP[(^P−f)2]≤Eμ×^σQ[(^Q−f)2]+δ
The theory of optimal predictors remains intact with these new definitions (although some care is wish to consider reflective systems with range R instead of [0,1]).
The following concept of "orthogonal predictor" is often more convenient than "optimal predictor", since we no longer require our error spaces to be radical.
Definition 3
Fix E an error space of rank 2 and (μ,f) a distributional estimation problem. Consider ^P a Q-valued (poly,rlog)-bischeme taking bounded values.
^P is called an E(poly,rlog)-orthogonal predictor for (μ,f) when for any Q-valued (poly,rlog)-bischeme ^Q s.t. ^σQ factors as ^σP×τ, there is δ∈E s.t.
|Eμ×^σP×τ[Q(x,y,z)(P(x,y)−f(x))]|≤(Eμ×^σQ[^Q2]+1)δ
The two concepts are tightly related due to the following
Theorem 1
Fix E an error space of rank 2 and (μ,f) a distributional estimation problem. Then any E(poly,rlog)-orthogonal predictor for (μ,f) is an E(poly,rlog)-optimal predictor for (μ,f). Conversely any E(poly,rlog)-optimal predictor for (μ,f) is an E12(poly,rlog)-orthogonal predictor for (μ,f).
We skip the proof since it is almost identical to the orthogonality lemma.
Results
The proofs of the results are found in the Appendix.
Definition 4
A relative estimation problem is a quadruple (A,μ,f,g) where A is a finite set, f:suppμ→R bounded and g:suppμ→RA bounded.
Definition 5
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem and ^P a QA-valued (poly,rlog)-bischeme. ^P is called an E(poly,rlog)-optimal predictor for (A,μ,f,g) when for any QA-valued (poly,rlog)-bischeme ^Q s.t. ^σQ factors as ^σP×τ, there is δ∈E s.t.
Eμ×^σP[(gt^P−f)2]≤Eμ×^σQ[(gt^Q−f)2]+(Eμ×^σP×τ[(gt(^P−^Q))2]+1)δ
The appearance of the Eμ×^σP×τ[(gt(^P−^Q))2] term might seem as making the condition substantially weaker. However, we the condition is still sufficient strong to produce the uniqueness theorem which appears as Theorem 1 below. Also, the novelty disappears in the corresponding orthogonality condition.
Definition 6
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem and ^P a QA-valued (poly,rlog)-bischeme. ^P is called an E(poly,rlog)-orthogonal predictor for (A,μ,f,g) when for any QA-valued (poly,rlog)-bischeme ^Q s.t. ^σQ factors as ^σP×τ, there is δ∈E s.t.
|Eμ×^σP×τ[gtQ(x,y,z)(gtP(x,y)−f(x))]|≤(Eμ×^σQ[(gtQ)2]+1)δ
Theorem 2
Fix E an error space of rank 2 and (A,μ,f,g) a relative estimation problem. Then any E(poly,rlog)-orthogonal predictor for (A,μ,f,g) is an E(poly,rlog)-optimal predictor for (A,μ,f,g). Conversely any E(poly,rlog)-optimal predictor for (A,μ,f,g) is an E12(poly,rlog)-orthogonal predictor for (A,μ,f,g).
Definition 7
Consider E an error space space of rank n. A set I⊆Nn is called E-negligible when χI∈E.
Note 1
The E-negligible sets form a set-theoretic ideal on Nn.
All finite sets are E-negligible (thanks to the the condition 2−h∈E).
Definition 8
Fix E an error space of rank 2. Consider A a finite set, μ a word ensemble, g:suppμ→RA and ^Q1,^Q2 QA-valued (poly,rlog)-bischemes. We say ^Q1 is E-similar to ^Q2 relative to g (denoted ^Q1g≃E^Q2) when there is an E-negligible set I and δ∈E s.t.
∀(k,j)∈N2∖I:Eμk×^σ1×^σ2[(gt(^Qkj1−^Qkj2))2]=δkj
Note 2
It's impossible to remove I from the definition since Eμk×^σ1×^σ2[(gt(^Qkj1−^Qkj2))2] doesn't have to be bounded.
Theorem 3
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem and ^P1,^P2 E(poly,rlog)-orthogonal predictors for (A,μ,f,g). Then ^P1g≃E^P2.
Theorem 4 below is the orthogonal predictor analogue of the matrix equation from the previous section.
Proposition 1
Consider E an error space of rank n, ϵ>0 and γ:Nn→[ϵ,∞). Then γE:={γδ∣δ∈E} is an error space.
Theorem 4
Fix E an error space of rank 2. Consider (A,μ,f,g) a relative estimation problem, ^Pgg an End(QA)-valued (poly,rlog)-bischeme, ^Pfg a QA-valued bischeme, ϵ>0 and γ:N2→[ϵ,∞). Assume that for any x∈suppμk, ^Pkjgg(x) is always invertible and the operator norm of ^Pkjgg(x)−1 is at most γkj. Assume further that ^Pgg is an E(poly,rlog)-orthogonal predictor for ggt (componentwise) and ^Pfg is a γ2E(poly,rlog)-orthogonal predictor for fg (componentwise). Define the QA-valued bischeme ^Pf∣g by ^Pf∣g:=^P−1gg^Pfg. Then, ^Pf∣g is a γ2E(poly,rlog)-orthogonal predictor for (A,μ,f,g).
Note 3
It is always possible to choose ^Pgg to be symmetric in which case the operator norm of ^P−1gg is the inverse of the lowest absolute value of an eigenvalue of ^Pgg.
Finally, we extend the stability result for usual conditional probabilities to the matrix setting.
Definition 9
Fix E an error space of rank 2. Consider A a finite set, μ a word ensemble and ^Q1,^Q2 QA-valued (poly,rlog)-bischemes. We say ^Q1 is E-similar to ^Q2 relative to μ (denoted ^Q1μ≃E^Q2) when there is an E-negligible set I and δ∈E s.t.
∀(k,j)∈N2∖I:Eμk×^σ1×^σ2[∥^Qkj1−^Qkj2∥2]=δkj
Theorem 5
Fix E an error space of rank 2. Consider ϵ>0 and γ:N2→[ϵ,∞), (A,μ,f,g) a relative estimation problem, ^Pgg a positive-definite E(poly,log)-orthogonal predictor for ggt and ^P1, ^P2 γ4E(poly,log)-orthogonal predictors for (A,μ,f,g). Assume that for any x∈suppμk, the lowest eigenvalue of ^Pkjgg(x) is always at least (γkj)−1 and ∥^Pkj1,2(x)∥≤γkj. Then, ^P1μ≃γ5E^P2.
The Corollary below shows that in simple situations these counterfactuals produce the values we expect.
Definition 10
Fix E an space of rank 2 and (μ,f) a distributional estimation problem. Consider ^P a Q-valued (poly,rlog)-bischeme taking bounded values. ^P is called an E(poly,rlog)-perfect predictor for (μ,f) when Eμ×^σP[(^P−f)2]∈E.
Corollary
Fix E an error space of rank 2. Consider ϵ>0 and γ:N2→[ϵ,∞), (A,μ,f∗,g) a relative estimation problem and f:suppμ→RA bounded s.t. f∗=gtf. Suppose ^Pf is a γ8E2(poly,rlog)-perfect predictor for (μ,f). Suppose ^Pgg is a positive-definite E(poly,rlog)-orthogonal predictor for ggt s.t. for any x∈suppμk, the lowest eigenvalue of ^Pkjgg(x) is always at least (γkj)−1. Suppose ^Pfg is a γ2E(poly,rlog)-orthogonal predictor for fg. Then, ^P−1gg^Pfgμ≃γ5E^Pf.
Appendix
Proof of Theorem 2
Consider ^P an E(poly,rlog)-optimal predictor for (A,μ,f,g). Consider ^R a QA-valued (poly,rlog)-bischeme with ^σR=^σP×τ. For any ζ:N2→{−1,+1} and n:N2→Z s.t. |n| can be bounded by a polynomial (and hence its number of digits is logarithmic), we can define tkj:=ζkj2−nkj and the QA-valued (poly,rlog)-bischeme ^Q with ^σQ:=^σR and Q(x,y,z):=P(x,y)+tR(x,y,z). We have δ∈E s.t.
Eμ×^σP[(gt^P−f)2]≤Eμ×^σQ[(gt^Q−f)2]+(Eμ×^σP×τ[(gt(^P−^Q))2]+1)δ
Eμ×^σP[(gt^P−f)2]≤Eμ×^σP×τ[(gt(^P+t^R)−f)2]+(Eμ×^σR[(gt^R)2]t2+1)δ
Eμ×^σP×τ[(gt^P−f)2−(gt(^P+t^R)−f)2]≤(Eμ×^σR[(gt^R)2]t2+1)δ
−Eμ×^σR[(gt^R)2]t2−2Eμ×^σP×τ[gt^R(gt^P−f)]t≤(Eμ×^σR[(gt^R)2]t2+1)δ
−2Eμ×^σP×τ[gt^R(gt^P−f)]t≤Eμ×^σR[(gt^R)2](1+supδ)t2+δ
By choosing ζ:=−sgnEμ×^σP×τ[gt^R(gt^P−f)] we get
|Eμ×^σP×τ[gt^R(gt^P−f)]|2−n+1≤Eμ×^σR[(gt^R)2](1+supδ)4−n+δ
|Eμ×^σP×τ[gt^R(gt^P−f)]|≤Eμ×^σR[(gt^R)2](1+supδ)2−n+δ2n
Take n:=min(−⌊12logδ⌋,h) where h:N2→N is a polynomial s.t. 2−h∈E. We get
|Eμ×^σP×τ[gt^R(gt^P−f)]|≤Eμ×^σR[(gt^R)2](1+supδ)max(δ12,2−h)+δmin(2δ−12,2h)
|Eμ×^σP×τ[gt^R(gt^P−f)]|≤Eμ×^σR[(gt^R)2](1+supδ)max(δ12,2−h)+2δ12
|Eμ×^σP×τ[gt^R(gt^P−f)]|≤(Eμ×^σR[(gt^R)2]+1)(1+supδ)max(2δ12,2−h)
Conversely, suppose ^P an E(poly,rlog)-orthogonal predictor for (A,μ,f,g). Consider ^Q a QA-valued (poly,rlog)-bischeme with ^σQ=^σP×τ. We have
Eμ×^σP[(gt^P−f)2]−Eμ×^σQ[(gt^Q−f)2]=Eμ×^σP×τ[gt(^P−^Q)(gt(^P+^Q)−2f)]
Eμ×^σP[(gt^P−f)2]−Eμ×^σQ[(gt^Q−f)2]=Eμ×^σP×τ[gt(^P−^Q)(2gt^P+gt(^Q−^P)−2f)]
Eμ×^σP[(gt^P−f)2]−Eμ×^σQ[(gt^Q−f)2]=2Eμ×^σP×τ[gt(^P−^Q)(gt^P−f)]−Eμ×^σP×τ[(gt(^P−^Q))2]
Eμ×^σP[(gt^P−f)2]−Eμ×^σQ[(gt^Q−f)2]≤(Eμ×^σP×τ[(gt(^P−^Q))2]+1)δ
Proof of Theorem 3
Since ^P1 is orthogonal, we have δ1∈E s.t.
|Eμ×^σ1×^σ2[gt(^P1−^P2)(gt^P1−f)]|≤(Eμ×^σ1×^σ2[(gt(^P1−^P2))2]+1)δ1
Since ^P2 is orthogonal, we have δ2∈E s.t.
|Eμ×^σ1×^σ2[gt(^P1−^P2)(gt^P2−f)]|≤(Eμ×^σ1×^σ2[(gt(^P1−^P2))2]+1)δ2
It follows that
Eμ×^σ1×^σ2[(gt(^P1−^P2))2]≤(Eμ×^σ1×^σ2[(gt(^P1−^P2))2]+1)(δ1+δ2)
Define I:={(k,j)∣δkj1+δkj2≥12}. χI≤2(δ1+δ2) so I is E-negligible. We get
∀(k,j)∈N2∖I:Eμk×^σkj1×^σkj2[(gt(^Pkj1−^Pkj2))2]≤δkj1+δkj21−δkj1−δkj2
∀(k,j)∈N2∖I:Eμk×^σkj1×^σkj2[(gt(^Pkj1−^Pkj2))2]≤2(δkj1+δkj2)
Proof of Proposition 1
Given δ1,δ2∈E
γδ1+γδ2=γ(δ1+δ2)∈γE
Given δ1∈E
δ2≤γδ1⟹γ−1δ2≤δ1⟹γ−1δ2∈E⟹δ2∈γE
Given h polynomial s.t. 2−h∈E
2−h≤γϵ−12−h∈γE
Proof of Theorem 4
We know that ^σf∣g=^σgg×^σfg and
Pfg(x,z)−f(x)g(x)=(g(x)tPf∣g(x,y,z)−f(x))g(x)+(Pgg(x,y)−g(x)g(x)t)Pf∣g(x,y,z)
Consider a QA-valued (poly,rlog)-bischeme ^Q s.t. ^σQ factors as ^σgg×^σfg×τ. We get
|Eμ×^σQ[gtQ(gt^Pf∣g−f)]|≤|Eμ×^σQ[Qt(^Pfg−fg)]|+|Eμ×^σQ[Qt(^Pgg−ggt)^Pf∣g]|
Since ^Pfg is a γ2E(poly,rlog)-orthogonal predictor for fg, there is δfg∈E s.t.
|Eμ×^σQ[Qt(^Pfg−fg)]|≤(Eμ×^σQ[∥Q∥2]+1)γ2δfg
Since ^Pgg is an E(poly,rlog)-orthogonal predictor for ggt, there is δgg∈E s.t.
|Eμ×^σQ[Qt(^Pgg−ggt)^Pf∣g]|≤(Eμ×^σQ[∥Q∥2∥^Pf∣g∥2]+1)δgg
^Pfg is bounded and the operator norm of ^P−1gg is at most γ, therefore there is c>0 s.t. ∥^Pf∣g∥2≤cγ2 and
|Eμ×^σQ[Qt(^Pgg−ggt)^Pf∣g]|≤(Eμ×^σQ[∥Q∥2]cγ2+1)δgg
|Eμ×^σQ[Qt(^Pgg−ggt)^Pf∣g]|≤(Eμ×^σQ[∥Q∥2]+1)max(ϵ−2,c)γ2δgg
Putting everything together we get
|Eμ×^σQ[gtQ(gt^Pf∣g−f)]|≤(Eμ×^σQ[∥Q∥2]+1)γ2(δfg+max(ϵ−2,c)δgg)
Proof of Theorem 5
According to Theorem 3, there is δ1∈E and I⊆N2 which is γ4E-negligible s.t.
∀(k,j)∈N2∖I:Eμk×^σkj1×^σkj2[(gt(^Pkj1−^Pkj2))2]≤(γkj)4δkj1
∀(k,j)∈N2∖I:Eμk×^σkj1×^σkj2[(^Pkj1−^Pkj2)tggt(^Pkj1−^Pkj2)]≤(γkj)4δkj1
On the other hand, since ^Pgg is an E(poly,rlog)-orthogonal predictor for ggt, there is δ2∈E s.t.
|Eμk×^σkj1×^σkj2[(^Pkj1−^Pkj2)t(^Pkjgg−ggt)(^Pkj1−^Pkj2)]|≤(Eμk×^σkj1×^σkj2[∥^Pkj1−^Pkj2∥4]+1)δkj2
Since ∥^Pkj1,2(x)∥≤γkj, there is c>0 s.t.
|Eμk×^σkj1×^σkj2[(^Pkj1−^Pkj2)t(^Pkjgg−ggt)(^Pkj1−^Pkj2)]|≤(c(γkj)4+1)δkj2
|Eμk×^σkj1×^σkj2[(^Pkj1−^Pkj2)t(^Pkjgg−ggt)(^Pkj1−^Pkj2)]|≤(c+ϵ−4)(γkj)4δkj2
Combining the two together, we conclude that
∀(k,j)∈N2∖I:Eμk×^σkj1×^σkj2[(^Pkj1−^Pkj2)t^Pkjgg(^Pkj1−^Pkj2)]≤(γkj)4(δkj1+(c+ϵ−4)δkj2)
∀(k,j)∈N2∖I:Eμk×^σkj1×^σkj2[∥(^Pkjgg)12(^Pkj1−^Pkj2)∥2]≤(γkj)4(δkj1+(c+ϵ−4)δkj2)
∀(k,j)∈N2∖I:(γkj)−1Eμk×^σkj1×^σkj2[∥^Pkj1−^Pkj2∥2]≤(γkj)4(δkj1+(c+ϵ−4)δkj2)
∀(k,j)∈N2∖I:Eμk×^σkj1×^σkj2[∥^Pkj1−^Pkj2∥2]≤(γkj)5(δkj1+(c+ϵ−4)δkj2)
Definition 11
Fix E an space of rank 2 and (A,μ,f,g) a relative estimation problem. Consider ^P a QA-valued (poly,rlog)-bischeme. ^P is called an E(poly,rlog)-perfect predictor for (A,μ,f,g) when Eμ×^σP[(gt^P−f)2]∈E.
Proposition 2
Fix E an error space of rank 2. Consider (A,μ,f∗,g) a relative estimation problem and f:suppμ→RA bounded s.t. f∗=gtf. Suppose ^Pf is an E(poly,rlog)-perfect predictor for (μ,f). Then ^Pf is an E(poly,rlog)-perfect predictor for (A,μ,f∗,g).
Proof of Proposition 2
Eμ×^σP[(gt^P−f∗)2]=Eμ×^σP[(gt^P−gtf)2]
Eμ×^σP[(gt^P−f∗)2]=Eμ×^σP[(gt(^P−f))2]
Eμ×^σP[(gt^P−f∗)2]≤Eμ×^σP[∥g∥2∥^P−f∥2]
Eμ×^σP[(gt^P−f∗)2]≤sup∥g∥2Eμ×^σP[∥^P−f∥2]
Proof of Corollary
By Proposition 2, ^Pf is a γ8E2(poly,rlog)-perfect predictor for (A,μ,f∗,g) and in particular a γ8E2(poly,rlog)-optimal predictor for (A,μ,f∗,g). By Theorem 2 it is a γ4E(poly,rlog)-orthogonal predictor for (A,μ,f∗,g). By Theorem 4, ^P−1gg^Pfg is a γ2E(poly,rlog)-orthogonal predictor for (A,μ,f∗,g). Applying Theorem 5, we get the desired result (the condition ∥^P−1gg^Pfg∥≤γ holds because ^Pfg is bounded and ^Pgg's lowest eigenvalue is at most γ−1).