Delegative Inverse Reinforcement Learning

Vanessa Kosoy

We introduce a reinforcement-like learning setting we call Delegative Inverse Reinforcement Learning (DIRL). In DIRL, the agent can, at any point of time, delegate the choice of action to an "advisor". The agent knows neither the environment nor the reward function, whereas the advisor knows both. Thus, DIRL can be regarded as a special case of CIRL. A similar setting was studied in Clouse 1997, but as far as we can tell, the relevant literature offers few theoretical results and virtually all researchers focus on the MDP case (please correct me if I'm wrong). On the other hand, we consider general environments (not necessarily MDP or even POMDP) and prove a natural performance guarantee.

The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of "Bayesian paranoia" and the Charybdis of falling into traps). We prove that, given certain assumption about the advisor, a Bayesian DIRL agent (whose prior is supported on some countable set of hypotheses) is guaranteed to attain most of the value in the slow falling time discount (long-term planning) limit (assuming one of the hypotheses in the prior is true). The assumption about the advisor is quite strong, but the advisor is not required to be fully optimal: a "soft maximizer" satisfies the conditions. Moreover, we allow for the existence of "corrupt states" in which the advisor stops being a relevant signal, thus demonstrating that this approach can deal with wireheading and avoid manipulating the advisor, at least in principle (the assumption about the advisor is still unrealistically strong). Finally we consider advisors that don't know the environment but have some beliefs about the environment, and show that in this case the agent converges to Bayes-optimality w.r.t. the advisor's beliefs, which is arguably the best we can expect.

All the proofs are in the Appendix.

Notation

The set of natural numbers is defined to begin from 0. Given $n \in N$ , $[n]$ denotes the set ${m \in N ∣ m < n}$ . Given a logical formula $φ$ , $[[φ]] \in {0, 1}$ denotes its truth value.

Given a set $X$ , we denote $X^{*} := ⨆_{n \in N} X^{n}$ , the set of finite strings over the alphabet $X$ . The unique string of length 0 is denoted $λ$ . We also denote by $X^{ω}$ the set of infinite strings over alphabet $X$ . Given $α \in X^{*} ⊔ X^{ω}$ and $n \in N$ , $α_{n} \in X$ is the $n$ -th symbol in $α$ (i.e. $α = α_{0} α_{1} α_{2} \dots$ ) and $α_{: n} \in X^{*}$ is the prefix of length $n$ of $α$ (ending in $α_{n - 1}$ ). Given $α, β \in X^{*}$ , $| α | \in N$ is the length of $α$ and $α β \in X^{*}$ is the concatenation of $α$ and $β$ . The latter notation is also applicable when $β \in X^{ω}$ . The notation $α ⊑ β$ means that $α$ is a prefix of $β$ . Given sets $X, Y$ , $x \in X$ and $y \in Y$ , we sometimes use the notation $x y = (x, y) \in X \times Y$ . Given $α \in (X \times Y)^{*}$ , and $n \in N$ , $α_{: n + 1 / 2} \in (X \times Y)^{*} \times X$ is defined by $α_{: n + 1 / 2} = (α_{: n}, x)$ where $α_{n} = (x, y)$ .

Given sets $A$ and $B$ , the notation $f : A \circ \to B$ means that $f$ is a partial mapping from $A$ to $B$ , i.e. a mapping from some set $dom f \subseteq A$ to $B$ .

Given a topological space $A$ , $Δ A$ is the space of Borel probability measures on $A$ . When no topology is specified, $A$ is understood to be discrete: in this case, $μ \in Δ A$ can also be regarded as a function from $A$ to $[0, 1]$ . The space $X^{ω}$ is understood to have the product topology. Given topological spaces $A, B$ , $μ \in Δ A$ and $ν \in Δ B$ , $supp μ \subseteq A$ is the support of $μ$ and $μ \times ν \in Δ (A \times B)$ is the product measure. Given $K$ a Markov kernel from $A$ to $B$ , $μ ⋉ A \in Δ (A \times B)$ is the semidirect product of $μ$ and $K$ . When $A$ and $B$ are discrete, $H (μ)$ is the Shannon entropy of $μ$ (in natural base) and $D_{K L} (μ ∥ ν)$ is the Kullback-Leibler divergence from $ν$ to $μ$ .

Given $d \in N$ and $x \in R^{d}$ , we use the notation

$∥ x ∥_{1} := \sum i < d | x_{i} |$

$∥ x ∥_{\infty} := max i < d | x_{i} |$

The symbols $o, O, ω, Ω, Θ$ will refer to usual $O$ -notation.

Results

An interface $I = (A, O)$ is a pair of finite sets ("actions" and "observations"). An $I$ -policy is a function $π : (A \times O)^{*} \to Δ A$ . An $I$ -environment is a partial function $μ : (A \times O)^{*} \times A \circ \to Δ O$ s.t.

i. $λ \times A \subseteq dom μ$

ii. Given $h \in (A \times O)^{*}$ and $a o b \in A \times O \times A$ , $h a o b \in dom μ$ iff $h a \in dom μ$ and $μ (h a) (o) > 0$ .

It is easy to see that $dom μ$ is always of the form $X \times A$ for some $X \subseteq (A \times O)^{*}$ . We denote $hdom μ := X$ .

Given an $I$ -policy $π$ and an $I$ -environment $μ$ , we get $μ ⋈ π \in Δ (A \times O)^{ω}$ in the usual way.

An $I$ -reward function is a partial function $r : (A \times O)^{*} \circ \to [0, 1]$ . An $I$ -universe is a pair $(μ, r)$ where $μ$ is an $I$ -environment and $r$ is an $I$ -reward function s.t. $dom r \supseteq hdom μ$ . We denote the space of $I$ -universes by $Υ_{I}$ . Given an $I$ -reward function $r$ and $t \in (0, \infty)$ , we have the associated utility function $U_{t}^{r} : (A \times O)^{ω} \circ \to [0, 1]$ defined by

$U_{t}^{r} (x) := \frac{\sum_{n = 0}^{\infty} e^{- n / t} r (x_{: n})}{\sum_{n = 0}^{\infty} e^{- n / t}}$

Here and throughout, we use geometric time discount, however this choice is mostly for notational simplicity. More or less all results carry over to other shapes of the time discount function.

Denote $Π_{I}$ \ the space of $I$ -policies. An $I$ -metapolicy is a family ${π_{t} \in Π_{I}}_{t \in (0, \infty)}$ , where the parameter $t$ is thought of as setting the scale of the time discount. An $I$ -meta-universe is a family ${υ_{t} \in Υ}_{t \in (0, \infty)}$ . This latter concept is useful for analyzing multi-agent systems, where the environment contains other agents and we study the asymptotics when all agents' time discount scales go to infinity. We won't focus on the multi-agent case in this essay, but for future reference, it seems useful to make sure the results hold in the greater generality of meta-universes.

Given an $I$ -policy $π$ , an $I$ -universe $υ$ and $t > 0$ , we denote ${EU}_{υ}^{π} (t) := E μ ⋈ π [U_{t}^{r}]$ (this is well-defined since $U_{t}^{r}$ is defined on the support of $μ ⋈ π$ ). We also denote ${EU}_{υ}^{*} (t) := {max}_{π \in Π_{I}} {EU}_{υ}^{π} (t)$ . We will omit $I$ when it is obvious from the context.

Definition 1

Fix an interface $I$ . Consider $π^{*}$ a metapolicy and $H$ a set of meta-universes. $π^{*}$ is said to learn $H$ when for any $υ \in H$

$lim t \to \infty ({EU}_{υ_{t}}^{*} (t) - {EU}_{υ_{t}}^{π_{t}^{*}} (t)) = 0$

$H$ is said to be learnable when there exists $π^{*}$ that learns $H$ .

Our notion of learnability is closely related to the notion of sublinear regret, as defined in Leike 2016, except that we allow the policy to explicitly depend on the time discount scale. This difference is important: for example, given a single universe $υ$ , it might be impossible to achieve sublinear regret, but ${υ}$ is always learnable.

Proposition 1

Fix an interface $I$ . Consider $H$ a countable learnable set of meta-universes. Consider any $ζ \in Δ H$ s.t. $supp ζ = H$ . Consider $π^{ζ}$ a $ζ$ -Bayes optimal metapolicy, i.e.

$π_{t}^{ζ} \in a r g m a x π \in Π E υ \sim ζ [{EU}_{υ_{t}}^{π} (t)]$

Then, $π^{ζ}$ learns $H$ .

Proposition 1 can be regarded as a "frequentist" justification for Bayesian agents: if any metapolicy is optimal in a "frequentist" sense for the class $H$ (i.e. learns it), then the Bayes optimal metapolicy is such.

Another handy property of learnability is the following.

Proposition 2

Fix an interface $I$ . Let $H$ be a countable set of meta-universes s.t. any finite $G \subseteq H$ is learnable. Then, $H$ is learnable.

We now introduce the formalism needed to discuss advisors. Define $¯ A := A ⊔ {⊥}$ , $¯ O := ¯ A \times O$ and $¯ I := (¯ A, ¯ O)$ . Here, the $¯ A$ factor of $¯ O$ is the action taken by the advisor, assumed to be observable by the agent. The environments we will consider are s.t. this action is $⊥$ unless the agent delegated to the advisor at this point of time, which is specified by the agent taking action $⊥$ . It will also be the case that whenever the agent takes action $⊥$ , the advisor cannot take action $⊥$ .

Denote $¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O := ¯ A \times ¯ O ∖ {⊥ ⊥ o ∣ o \in O}$ . Given $a b o \in ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O$ , we define $a b o - -- - \in A \times O$ by

$a b o - -- - := {\begin{matrix} a o if a \neq ⊥ b o if a = ⊥ \end{matrix}$

Given $h \in {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{*}$ , we define $h - - \in (A \times O)^{*}$ by ${h - -}_{n} := h_{n} - - -$ .

Definition 2

An $¯ I$ -policy $α$ is said to be autonomous when for any $h \in {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{*}$ , $α (h) (⊥) = 0$ .

Consider an $I$ -environment $μ$ and an autonomous $¯ I$ -policy $α$ , which we think of as the advisor policy. We define the $¯ I$ -environment $¯ μ [α]$ as follows. For any $h \in {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{*}$ s.t. $h - - \in hdom μ$ , $a, b \in ¯ A$ and $o \in O$ :

$¯ μ [α] (h a) (b o) := ⎧ ⎨ ⎩ \begin{matrix} μ (h - - a) (o) if a \neq ⊥, b = ⊥ α (h) (b) \cdot μ (h - - b) (o) if a = ⊥, b \neq ⊥ 0 if a \neq ⊥, b \neq ⊥ or a = b = ⊥ \end{matrix}$

It is easy to the above is a well-defined $¯ I$ -environment with $hdom μ \subseteq {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{*}$ .

Given an $I$ -universe $υ = (μ, r)$ , we define the $¯ I$ -reward function $¯ r : {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{*} \circ \to [0, 1]$ by $¯ r (h) := r (h - -)$ and the $¯ I$ -universe $¯ υ [α] := (¯ μ [α], ¯ r)$ .

We now introduce the conditions on the advisor policy which will allow us to prove a learnability theorem. First, we specify an advisor that always remains "approximately rational."

The notation $E x \sim ρ [X ∣ h]$ will be used to mean $E x \sim ρ [X ∣ h ⊑ x]$ . Given a universe $υ = (μ, r)$ and $t \in (0, \infty)$ we define $V_{t}^{υ} : hdom μ \to [0, 1]$ and $Q_{t}^{υ} : hdom μ \times A \to [0, 1]$ by

$V_{t}^{υ} (h) := max π \in Π E x \sim μ ⋈ π [\frac{\sum_{n = | h |}^{\infty} e^{- n / t} r (x_{: n})}{\sum_{n = | h |}^{\infty} e^{- n / t}} ∣ h]$

$Q_{t}^{υ} (h a) := (1 - e^{- 1 / t}) r (h) + e^{- 1 / t} E o \sim μ (h a) [V_{t}^{υ} (h a o)]$

Definition 3

Fix an interface $I$ . Consider a universe $υ = (μ, r)$ . Let $t, β \in (0, \infty)$ . A policy $α$ is called strictly $β$ -rational for $(υ, t)$ when for any $h \in dom μ$ and $a \in A$

$α (h) (a) \leq exp [β (Q_{t}^{υ} (h a) - V_{t}^{υ} (h))] max a^{*} \in A α (h) (a^{*})$

Now we deal with the possibility of the advisor becoming "corrupt". In practical implementations where the "advisor" is a human operator, this can correspond to several types of events, e.g. sabotaging the channel that transmits data from the operator to the AI ("wireheading"), manipulation of the operator or replacement of the operator by a different entity.

Definition 4

Fix an interface $I$ . Consider a family ${C_{t} \subseteq (A \times O)^{*}}_{t \in (0, \infty)}$ s.t. for any $h, g \in (A \times O)^{*}$ , if $h \in C_{t}$ then $h g \in C_{t}$ . We think of $C_{t}$ as the set of histories in which a certain event occurred. Consider a meta-universe $υ = (μ, r)$ . $C$ is said to be a $υ$ -avoidable event when there is a meta-policy $π^{*}$ and $D : (0, \infty) \to N$ s.t.

i. ${lim}_{t \to \infty} ({EU}_{υ_{t}}^{*} (t) - {EU}_{υ_{t}}^{π_{t}^{*}} (t)) = 0$

ii. $D = ω (t)$

iii. ${lim}_{t \to \infty} {Pr}_{x \sim μ_{t} ⋈ π_{t}^{*}} [\exists n \leq D (t) : x_{: n} \in C_{t}] = 0$

That is, $C$ is $υ$ -avoidable when it is possible to avoid the event for a long time while retaining most of the value. Consider a meta-universe $υ = (μ, r)$ and $C$ a $υ$ -avoidable event. Denote $- υ := (μ, 1 - r)$ . We define the reward function $r_{t}^{C}$ by

$r_{t}^{C} (h) := {\begin{matrix} r_{t} (h) if h \notin C_{t} V_{t}^{- υ} (h_{: n}) if h \in C_{t} and n = min {m \in N ∣ h_{: m} \in C_{t}} \end{matrix}$

We think of $r^{C}$ as representing a process wherein once the event represented by $C$ occurs, the agent starts minimizing the utility function. We also use the notation $υ^{C} := (μ, r^{C})$ .

Definition 5

Consider a meta-universe $υ = (μ, r)$ and $β : (0, \infty) \to (0, \infty)$ , where we think of the argument of the function $β$ as the time discount scale. An autonomous $¯ I$ -metapolicy $α$ is called $β$ -rational for $υ$ when there exists a $υ$ -avoidable event $C$ (that we think of as the advisor becoming "corrupt") and an autonomous $¯ I$ -metapolicy $α^{*}$ (representing advisor policy conditional on non-corruption) s.t.

i. For any $h \in {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{*}$ s.t. $h - - \notin C_{t}$ , $α_{t} (h) = α_{t}^{*} (h)$ .

ii. $α_{t}^{*}$ is strictly $β (t)$ -rational for $({¯ υ}_{t}^{C} [α^{*}], t)$ .

Our definition of $β$ -rationality requires the advisor to be extremely averse to corruption: the advisor behaves as if, once a corruption occurs, the agent policy becomes the worst possible. In general, this seems much too strong: by the time corruption occurs, the agent might have already converged into accurate beliefs about the universe that allow it to detect the corruption and keep operating without the advisor. Even better, the agent can usually outperform the worst possible policy using the prior alone. Moreover, we can allow for corruption to depend on unobservable random events and differentiate between different degrees of corruption and treat them accordingly. We leave those further improvements for future work.

We are now ready to formulate the main result.

Theorem

Consider $H = {υ^{k}}_{k \in N}$ a countable family of $I$ -meta-universes and $β : (0, \infty) \to (0, \infty)$ s.t. $β (t) = ω (t^{2 / 3})$ . Let ${α^{k}}_{k \in N}$ be a family of autonomous $¯ I$ -metapolicies s.t. for every $k \in N$ , $α^{k}$ is $β$ -rational for $υ^{k}$ . Define $¯ H := {{¯ υ}^{k} [α^{k}]}_{k \in N}$ . Then, $¯ H$ is learnable.

Some remarks:

By Proposition 1, $¯ H$ is learned by any Bayes optimal metapolicy with prior supported on $¯ H$ .
To get a feeling for the condition $β (t) = ω (t^{2 / 3})$ , consider an environment where the reward depends only on the last action and observation. In such an environment, an advisor that performs softmax (with constant parameter) on the next reward has $β (t) = Θ (t)$ . It is thus "more rational" than the required minimum.
It is easy to see that the Theorem can be generalized by introducing an external penalty (negative reward) for each time the agent delegates to the advisor: as it is, using the advisor already carries a penalty due to its suboptimal choice.

The conditions of the Theorem imply that, in some sense, the advisor "knows" the true environment. This is unrealistic: obviously, we expect the human operator to have some (!) uncertainty about the world. However, we clearly cannot do away with this assumption: if the same action triggers a trap in some universes and is necessary for approaching maximal utility in other universes, and there is no observable difference between the universes before the action is taken, then there is no way to guarantee optimality. The prior knowledge you have about the universe caps the utility you can guarantee to obtain. On the other hand, as an AI designer, one can reasonably expect the AI to do at least as well as possible using the designer's own knowledge. If running the AI is the designer's best strategy, the AI should be close to Bayes optimal (in some sense that includes computational bounds et cetera: complications that we currently ignore) with respect to the designer's posterior rather than with respect to some simple prior. In other words, we need a way to transmit the designer's knowledge to the AI, without hand-crafting an elaborate prior.

The following shows that the DIRL achieves this goal (theoretically, given the considerable simplifying assumptions).

Given an environment $μ$ , we define $^μ : (A \times O)^{*} \to [0, 1]$ as follows. For $h = a_{0} o_{0} a_{1} o_{1} \dots a_{n - 1} o_{n - 1} \in hdom μ$

$^μ (h) := \prod m < n μ (h_{: m} a_{m}) (o_{m})$

For $h \in (A \times O)^{*} ∖ hdom μ$ , $^μ (h) := 0$ .

Given a family of environments ${μ^{k}}_{k \in N}$ and $ξ \in Δ N$ , we will use the notation $E k \sim ξ [μ^{k}]$ to denote the environment given by

$hdom E k \sim ξ [μ^{k}] := ⋃ k \in supp ξ hdom μ^{k}$

$E k \sim ξ [μ^{k}] (h a) := \frac{E k \sim ξ [{^μ}^{k} (h) μ^{k} (h a)]}{E k \sim ξ [{^μ}^{k} (h)]}$

Corollary 1

Consider ${μ^{k}}_{k \in N}$ a countable family of $I$ -meta-environments, ${r^{K}}_{K \in N}$ a countable family of reward functions and ${ξ^{K} \in Δ N}_{K \in N}$ s.t. given $k \in supp ξ^{K}$ , $dom r^{K} \supseteq hdom μ^{k}$ . We think of $ξ^{K}$ as the advisor's belief about the environment in universe $K$ . Let $β : (0, \infty) \to (0, \infty)$ be s.t. $β (t) = ω (t^{2 / 3})$ and ${α^{K}}_{K \in N}$ be a family of autonomous $¯ I$ -metapolicies s.t. for every $K \in N$ , $α^{K}$ is $β$ -rational for $(E k \sim ξ^{K} [μ^{k}], r^{K})$ . Let $ζ \in Δ N^{2}$ be s.t. for any $J, j \in N$

$Pr (K, k) \in ζ [K = J] > 0$

$Pr (K, k) \in ζ [k = j ∣ K = J] = ξ^{J} (j)$

We think of $ζ$ as the agent's prior and the equation above as stating the agent's belief that the advisor's beliefs are "calibrated". Consider $π^{ζ}$ a $ζ$ -Bayes optimal $¯ I$ -metapolicy, i.e.

$π^{ζ} \in a r g m a x π \in Π_{¯ I} E (K, k) \in ζ [{EU}_{{¯ μ}^{k} [α^{K}], {¯ r}^{K}}^{π} (t)]$ Then, for every $K \in N$

$lim t \to \infty (max π \in Π_{I} E k \in ξ^{K} [{EU}_{μ^{k}, r^{K}}^{π} (t)] - E k \in ξ^{K} [{EU}_{{¯ μ}^{k} [α^{K}], {¯ r}^{K}}^{π^{ζ}} (t)]) = 0$

If we happen to be so lucky that the advisor's (presumably justified) belief is supported on a learnable environment class, we get a stronger conclusion.

Corollary 2

In the setting of Corollary 1, fix $K \in N$ . Define the set of meta-universes $H^{K}$ by

$H^{K} := {(μ^{k}, r^{K}) ∣ k \in supp ξ^{K}}$

Assume $H^{K}$ is learnable. Then, for every $k \in N$

$lim t \to \infty ({EU}_{μ^{k}, r^{K}}^{*} (t) - {EU}_{{¯ μ}^{k} [α^{K}], {¯ r}^{K}}^{π^{ζ}} (t)) = 0$

We also believe that to some extent DIRL is effective against acausal attack. Indeed, the optimality we get from the Theorem + Proposition 1 holds for any prior. However, the speed of convergence to optimality certainly depends on the prior. It is therefore desirable to analyze this dependency and bound the damage an adversary can do by controlling a certain portion of the prior. We leave this for future work.

Appendix

Proposition A.0

Fix an interface $I$ . Consider $H$ a countable learnable set of meta-universes. Consider any $ζ \in Δ H$ s.t. $supp ζ = H$ . Consider $π^{ζ}$ a metapolicy s.t.

$lim t \to \infty (max π \in Π E υ \sim ζ [{EU}_{υ_{t}}^{π} (t)] - E υ \sim ζ [{EU}_{υ_{t}}^{π_{t}^{ζ}} (t)]) = 0$

Then, $π^{ζ}$ learns $H$ .

Proof of Proposition A.0

Fix $π^{*}$ a metapolicy that learns $H$ . Consider $ϵ > 0$ and let $H_{ϵ} \subseteq H$ be finite s.t. $ζ (H ∖ H_{ϵ}) < ϵ$ . For $t ≫ 0$ and every $υ \in H_{ϵ}$ we have

${EU}_{υ_{t}}^{π_{t}^{*}} (t) \geq {EU}_{υ_{t}}^{*} (t) - ϵ$

Also

$E υ \sim ζ [{EU}_{υ_{t}}^{π_{t}^{ζ}} (t)] \geq E υ \sim ζ [{EU}_{υ_{t}}^{π_{t}^{*}} (t)] - ϵ \geq E υ \sim ζ [{EU}_{υ_{t}}^{π_{t}^{*}} (t); υ \in H_{ϵ}] - ϵ$

Combining, we get

$E υ \sim ζ [{EU}_{υ_{t}}^{π_{t}^{ζ}} (t)] \geq E υ \sim ζ [{EU}_{υ_{t}}^{*} (t); υ \in H_{ϵ}] - 2 ϵ$

$E υ \sim ζ [{EU}_{υ_{t}}^{π_{t}^{ζ}} (t)] \geq E υ \sim ζ [{EU}_{υ_{t}}^{*} (t)] - E υ \sim ζ [{EU}_{υ_{t}}^{*} (t); υ \notin H_{ϵ}] - 2 ϵ$

By definition of $H_{ϵ}$ , this implies

$E υ \sim ζ [{EU}_{υ_{t}}^{π_{t}^{ζ}} (t)] \geq E υ \sim ζ [{EU}_{υ_{t}}^{*} (t)] - 3 ϵ$

For any $υ \in H$ , we get

${EU}_{υ_{t}}^{π_{t}^{ζ}} (t) \geq {EU}_{υ_{t}}^{*} (t) - \frac{3 ϵ}{ζ (υ)}$

Taking $ϵ$ to 0, we get the desired result.

Proof of Proposition 1

Immediate from Proposition A.0.

Proof of Proposition 2

Let $H = {υ^{k}}_{k \in N}$ . For each $k \in N$ , let $π^{k}$ learn ${υ^{l}}_{l < k}$ . Choose ${t^{k} \in (0, \infty)}_{k \in N}$ s.t.

i. $t^{0} = 0$

ii. $t^{k} < t^{k + 1}$

iii. ${lim}_{k \to \infty} t^{k} = \infty$

iv. For any $l < k$ and $t \geq t^{k}$ , ${EU}_{υ_{t}^{l}}^{π_{t}^{k}} (t) \geq {EU}_{υ_{t}^{l}}^{*} (t) - \frac{1}{k + 1}$ .

Now define $π_{t}^{*} := π_{t}^{max {k ∣ t \geq t^{k}}}$ . Clearly, $π^{*}$ learns $H$ .

Proposition A.1

Consider $d \in N$ and $x, y \in R^{d} ∖ 0$ . Then

$∥ \frac{x}{∥ x ∥_{\infty}} - \frac{y}{∥ y ∥_{\infty}} ∥_{\infty} \leq 2 d ∥ \frac{x}{∥ x ∥_{1}} - \frac{y}{∥ y ∥_{1}} ∥_{1}$

Proof of Proposition A.1

Without loss of generality, assume $∥ x ∥_{1} = ∥ y ∥_{1} = 1$ . For any $i < d$ , we have

$\frac{x_{i}}{∥ x ∥_{\infty}} - \frac{y_{i}}{∥ y ∥_{\infty}} = \frac{x_{i} ∥ y ∥_{\infty} - y_{i} ∥ x ∥_{\infty}}{∥ x ∥_{\infty} ∥ y ∥_{\infty}} = \frac{x_{i} ∥ y ∥_{\infty} - x_{i} ∥ x ∥_{\infty} + x_{i} ∥ x ∥_{\infty} - y_{i} ∥ x ∥_{\infty}}{∥ x ∥_{\infty} ∥ y ∥_{\infty}}$

$\frac{x_{i}}{∥ x ∥_{\infty}} - \frac{y_{i}}{∥ y ∥_{\infty}} = \frac{x_{i} (∥ y ∥_{\infty} - ∥ x ∥_{\infty}) + (x_{i} - y_{i}) ∥ x ∥_{\infty}}{∥ x ∥_{\infty} ∥ y ∥_{\infty}}$

Denote $r := ∥ x - y ∥_{1}$ . Obviously, $∥ x - y ∥_{\infty} \leq r$ and therefore $| x_{i} - y_{i} | \leq r$ and $| ∥ x ∥_{\infty} - ∥ y ∥_{\infty} | \leq r$ . We get

$| \frac{x_{i}}{∥ x ∥_{\infty}} - \frac{y_{i}}{∥ y ∥_{\infty}} | \leq \frac{| x_{i} | r + r ∥ x ∥_{\infty}}{∥ x ∥_{\infty} ∥ y ∥_{\infty}} \leq \frac{∥ x ∥_{\infty} r + r ∥ x ∥_{\infty}}{∥ x ∥_{\infty} ∥ y ∥_{\infty}} \leq \frac{2 r}{∥ y ∥_{\infty}}$

Since $∥ y ∥_{1} = 1$ , $∥ y ∥_{\infty} \geq \frac{1}{d}$ yielding the desired result.

Proposition A.2

Consider $H$ a finite set, $L : H \to [0, \infty)$ , $ζ \in Δ H$ and $β, ϵ \in (0, \infty)$ . Assume that

i. For any $k \in H$ , $ζ (k) \geq β ϵ$ .

ii. $E ζ [L] \geq ϵ$

Then

$E ζ [e^{- β L}] \leq 1 - (1 - e^{- 1}) β ϵ$

Proof of Proposition A.2

Without loss of generality, we can assume that $E ζ [L] = ϵ$ , because otherwise we can rescale $L$ by a constant in $(0, 1)$ which will only make $E ζ [e^{- β L}]$ larger. It now follows from conditions i+ii that for any $k$ , $L (k) \leq β^{- 1}$ and therefore $β L (k) \in [0, 1]$ . We have

$L (k) = (1 - β L (k)) \cdot 0 + β L (k) \cdot β^{- 1}$

Since $e^{- β x}$ is a convex function, we get

$e^{- β L (k)} \leq (1 - β L (k)) \cdot e^{- β \cdot 0} + β L (k) \cdot e^{- β \cdot β^{- 1}} = 1 - β L (k) + e^{- 1} β L (k) = 1 - (1 - e^{- 1}) β L (k)$

$E ζ [e^{- β L}] \leq 1 - (1 - e^{- 1}) β E ζ [L] = 1 - (1 - e^{- 1}) β ϵ$

Proposition A.3

Consider $H$ and $A$ finite sets, $L : H \times A \to [0, \infty)$ , $ζ \in Δ H$ , $α : H \to Δ A$ and $β, ϵ \in (0, \infty)$ . Assume that

i. For any $k \in H$ , $ζ (k) \geq β ϵ$ .

ii. For any $a \in A$ , $E k \in ζ [L (k, a)] \geq ϵ$ .

iii. For any $k \in H$ and $a \in A$ , $α (k) (a) \leq exp [- β L (k, a)] {max}_{b \in A} α (k) (b)$ .

Define $ζ ⋉ α \in Δ (H \times A)$ by $(ζ ⋉ α) (k, a) := ζ (k) α (k, a)$ . Then, the mutual information $I$ between $k$ and $a$ in the distribution $ζ ⋉ α$ satisfies

$I \geq \frac{(1 - e^{- 1})^{2}}{8 | A |^{2}} β^{2} ϵ^{2}$

Proof of Proposition A.3

Define $¯ α \in Δ A$ by $¯ α := E ζ [α]$ . We have

$I = E ζ [D_{K L} (α ∥ ¯ α)]$

Applying Pinsker's inequality

$I \geq 2 E ζ [d_{tv} (α, ¯ α)^{2}] = \frac{1}{2} E ζ [∥ α - ¯ α ∥_{1}^{2}] \geq \frac{1}{2} E ζ [∥ α - ¯ α ∥_{1}]^{2}$

By Proposition A.1

$I \geq \frac{1}{8 | A |^{2}} E ζ [∥ \frac{α}{∥ α ∥_{\infty}} - \frac{¯ α}{∥ ¯ α ∥_{\infty}} ∥_{\infty}]^{2} \geq \frac{1}{8 | A |^{2}} ∥ E ζ [\frac{α}{∥ α ∥_{\infty}} - \frac{¯ α}{∥ ¯ α ∥_{\infty}}] ∥_{\infty}^{2} = \frac{1}{8 | A |^{2}} ∥ E ζ [\frac{α}{∥ α ∥_{\infty}}] - \frac{¯ α}{∥ ¯ α ∥_{\infty}} ∥_{\infty}^{2}$

$I \geq \frac{1}{8 | A |^{2}} (∥ \frac{¯ α}{∥ ¯ α ∥_{\infty}} ∥_{\infty} - ∥ E ζ [\frac{α}{∥ α ∥_{\infty}}] ∥_{\infty})^{2} = \frac{1}{8 | A |^{2}} (1 - ∥ E ζ [\frac{α}{∥ α ∥_{\infty}}] ∥_{\infty})^{2}$

By condition iii, $α (k) (a) / ∥ α (k) ∥_{\infty} \leq exp [- β L (k, a)]$ and therefore

$I \geq \frac{1}{8 | A |^{2}} (1 - max a \in A E k \sim ζ [e^{- β L (k, a)}])^{2}$

Applying Proposition A.2, we get

$I \geq \frac{1}{8 | A |^{2}} (1 - (1 - (1 - e^{- 1}) β ϵ))^{2} = \frac{(1 - e^{- 1})^{2}}{8 | A |^{2}} β^{2} ϵ^{2}$

Proposition A.4

Consider $A$ a finite set, $L : A \to [0, \infty)$ and $α \in Δ A$ and $β \in (0, \infty)$ s.t. for any $a \in A$

$α (a) \leq e^{- β L (a)} max b \in A α (b)$

Then, $E α [L] \leq | A | e^{- 1} β^{- 1}$ .

Proof of Proposition A.4

We have

$E α [L] = \sum a \in A α (a) L (a) \leq max b \in A α (b) \sum a \in A e^{- β L (a)} L (a) \leq | A | max x \in [0, \infty) e^{- β x} x$

We compute ${max}_{x \in [0, \infty)} e^{- β x} x$ :

$0 = \frac{d}{d x} (e^{- β x} x) |_{x = x^{*}} = - β e^{- β x^{*}} x^{*} + e^{- β x^{*}}$

$x^{*} = \frac{1}{β}$

$e^{- β x^{*}} x^{*} = \frac{1}{e β}$

Proposition A.5

Consider a universe $υ = (μ, r)$ , a policy $π^{0}$ and $t \in (0, \infty)$ . Then,

${EU}_{υ}^{*} (t) - {EU}_{υ}^{π^{0}} (t) = \infty \sum n = 0 e^{- n / t} E x \sim μ ⋈ π^{0} [V_{t}^{υ} (x_{: n}) - Q_{t}^{υ} (x_{: n + 1 / 2})]$

Proof of Proposition A.5

For any $x \in (A \times O)^{ω}$ , it is easy to see that

${EU}_{υ}^{*} (t) = V_{t}^{υ} (λ) = \infty \sum n = 0 e^{- n / t} (V_{t}^{υ} (x_{: n}) - e^{- 1 / t} V_{t}^{υ} (x_{: n + 1}))$

$U_{t}^{r} (x) = (1 - e^{- 1 / t}) \infty \sum n = 0 e^{- n / t} r (x_{: n})$

${EU}_{υ}^{*} (t) - U_{t}^{r} (x) = \infty \sum n = 0 e^{- n / t} (V_{t}^{υ} (x_{: n}) - (1 - e^{- 1 / t}) r (x_{: n}) - e^{- 1 / t} V_{t}^{υ} (x_{: n + 1}))$

${EU}_{υ}^{*} (t) - U_{t}^{r} (x) = \infty \sum n = 0 e^{- n / t} (V_{t}^{υ} (x_{: n}) - Q_{t}^{υ} (x_{: n + 1 / 2}) + Q_{t}^{υ} (x_{: n + 1 / 2}) - (1 - e^{- 1 / t}) r (x_{: n}) - e^{- 1 / t} V_{t}^{υ} (x_{: n + 1}))$

Taking expected value over $x$ , we get

${EU}_{υ}^{*} (t) - {EU}_{υ}^{π^{0}} (t) = \infty \sum n = 0 e^{- n / t} (E μ ⋈ π^{0} [V_{t}^{υ} (x_{: n}) - Q_{t}^{υ} (x_{: n + 1 / 2})] + E μ ⋈ π^{0} [Q_{t}^{υ} (x_{: n + 1 / 2}) - (1 - e^{- 1 / t}) r (x_{: n}) - e^{- 1 / t} V_{t}^{υ} (x_{: n + 1})])$

It is easy to see that the second term vanishes, yielding the desired result.

Lemma A

Consider the setting of Theorem, but assume that $H = {υ^{k} = (μ^{k}, r^{k})}_{k < N}$ for some $N \in N$ (i.e. it is finite) and that $α^{k} (t)$ is strictly $β (t)$ -rational for ${¯ υ}^{k} [α^{k}]$ . Denote $ν^{k} := {¯ μ}^{k} [α^{k}]$ and $hdom ¯ H := ⋃_{k < N} hdom ν^{k}$ . Denote $ζ^{0} \in Δ [N]$ the uniform probability distribution. For any $t > N^{3}$ , define $ζ_{t}, {~ ζ}_{t} : hdom ¯ H \to Δ [N]$ recursively as follows

${~ ζ}_{t} (λ) := ζ^{0}$

${~ ζ}_{t} (h a o) (k) := {\begin{matrix} {~ Z}_{t} (h)^{- 1} \cdot ζ_{t} (h) (k) \cdot ν^{k} (h a) (o) if \exists j : ζ_{t} (h) (j) \cdot ν^{j} (h a) (o) > 0 N^{- 1} otherwise \end{matrix}$

$ζ_{t} (h) (k) := Z_{t} (h)^{- 1} {~ ζ}_{t} (h) (k) [[{~ ζ}_{t} (h) (k) > t^{- 1 / 3}]]$

In the above, $Z_{t} (h)$ and ${~ Z}_{t} (h)$ are normalization factor chosen to make the probabilities sum to 1. That is, $ζ_{t} (h)$ is obtained by starting from prior $ζ^{0}$ , updating on every observation, and setting to 0 the probability of any universe whose probability drops below $t^{- 1 / 3}$ . When encountering an "impossible" observation we reset to the uniform distribution, but this is arbitrary.

Define the "loss function" $L_{t} : hdom ¯ H \times A \to [0, 1]$ by

$L_{t} (h a) := E k \sim ζ_{t} (h) [V_{t}^{υ_{t}^{k}} (h - -) - Q_{t}^{υ_{t}^{k}} (h - - a)]$

Denote $ϵ_{t} := β (t)^{- 1} t^{- 1 / 3}$ . Define the following $¯ I$ -metapolicy $π^{*}$ :

$π_{t}^{*} (h) := ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} a r g m i n a \in A L_{t} (h a) if h \in hdom ¯ H, {min}_{a \in A} L_{t} (h a) < ϵ_{t} ⊥ if h \in hdom ¯ H, {min}_{a \in A} L_{t} (h a) \geq ϵ_{t} ⊥ if h \notin hdom ¯ H \end{matrix}$

(Technically, we only defined $π_{t}^{*}$ for $t > N^{3}$ , but it doesn't matter.) Then, $π^{*}$ learns $¯ H$ .

Proof of Lemma A

For every $k \in [N]$ , we define $ζ_{t}^{! k}, {~ ζ}_{t}^{! k} : hdom ¯ H \to Δ [N]$ and $S_{t}^{k} : hdom ¯ H \to 2^{[N]}$ recursively as follows

${~ ζ}_{t}^{! k} (λ) := ζ^{0}$

${~ ζ}_{t}^{! k} (h a o) (i) := \frac{ζ_{t}^{! k} (h) (i) \cdot ν^{i} (h a) (o)}{\sum_{j < N} ζ_{t}^{! k} (h) (j) \cdot ν^{j} (h a) (o)}$

$S_{t}^{k} (h) := {i \in [N] ∣ {~ ζ}_{t}^{! k} (h) (i) > t^{- 1 / 3}}$

$ζ_{t}^{! k} (h) (i) := [[k, i \in S_{t}^{k} (h)]] \frac{{~ ζ}_{t}^{! k} (h) (i)}{\sum_{j \in S_{t}^{k} (h)} {~ ζ}_{t}^{! k} (h) (j)} + [[k \notin S_{t}^{k} (h), i = k]]$

That is, $ζ^{! k}$ is a belief state that, besides updating on observations, behaves as if, at each moment of time, if the true universe $k$ is low probability according to current belief state (i.e. $k \notin S_{t}^{k} (h)$ ), then the true universe is "magically" revealed (i.e. $ζ^{! k}$ becomes the Kronecker delta), and otherwise it updates on the true universe not being revealed. Denote

$ρ_{t} := ζ^{0} ⋉ (ν_{t} ⋈ π_{t}^{*}) \in Δ ([N] \times {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{ω})$

Here, the dependence of $ν_{t}^{k} ⋈ π_{t}^{*}$ on $k$ is used to view it as Markov kernel from $[N]$ to ${¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{ω}$ . It is easy to see that, since the probability of "magic\ revelation" ever happening is at most $(N - 1) t^{- 1 / 3}$ , we have

$Pr (k, x) \sim ρ_{t} [\exists n \in N : ζ_{t} (x_{: n}) \neq ζ_{t}^{! k} (x_{: n})] \leq (N - 1) t^{- 1 / 3}$

Let $π^{! k}$ be defined exactly as $π^{*}$ but with $ζ$ replaced by $ζ^{! k}$ . Denote $ρ_{t}^{!} := ζ^{0} ⋉ (ν_{t} ⋈ π_{t}^{!})$ . From the above, we get

$d_{tv} (ρ_{t}, ρ_{t}^{!}) \leq (N - 1) t^{- 1 / 3}$

Given $k \in [N]$ and $h \in hdom ¯ H$ , we define the set $h^{! k}$ by

$h^{! k} := {(j, x) \in [N] \times {¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ A \times O}^{ω} ∣ h ⊑ x, \forall m \leq | h | : k \in S_{t}^{k} (h_{: m}) \land j \in S_{t}^{j} (h_{: m}) \lor k \notin S_{t}^{k} (h_{: m}) \land j = k}$

We have

$Pr (j, x) \sim ρ_{t}^{!} [j = i ∣ h^{! k}] = ζ^{! k} (h) (i)$

It follows that

$E (j, x) \sim ρ_{t}^{!} [H (ζ_{t}^{! j} (x_{: | h | + 1})) ∣ h^{! k}] = H (ζ_{t}^{! k} (h)) - E (j, x) \sim ρ_{t}^{!} [D_{K L} (ζ_{t}^{! j} (x_{: | h | + 1}) ∥ {~ ζ}_{t}^{! j} (x_{: | h | + 1})) + D_{K L} ({~ ζ}_{t}^{! j} (x_{: | h | + 1}) ∥ ζ_{t}^{! j} (x_{: | h |})) ∣ h^{! k}]$

$E (j, x) \sim ρ_{t}^{!} [H (ζ_{t}^{! j} (x_{: | h | + 1})) ∣ h^{! k}] \leq H (ζ_{t}^{! k} (h)) - E (j, x) \sim ρ_{t}^{!} [D_{K L} ({~ ζ}_{t}^{! j} (x_{: | h | + 1}) ∥ ζ_{t}^{! j} (x_{: | h |})) ∣ h^{! k}]$

If $π^{! k} (h) = ⊥$ , we can apply Proposition A.3: condition i follows from the definition of $S_{t}^{k}$ and the observation that $β (t) ϵ_{t} = t^{- 1 / 3}$ , condition ii follows from the fact we are in the second case in the definition of $π^{! k}$ (see definition of $π^{*}$ ) and condition iii follows from the strict $β (t)$ -rationality of $α^{k} (t)$ . We conclude

$E (j, x) \sim ρ_{t}^{!} [H (ζ_{t}^{! j} (x_{: | h | + 1})) ∣ h^{! k}] \leq H (ζ_{t}^{! k} (h)) - \frac{(1 - e^{- 1})^{2}}{8 | A |^{2}} β (t)^{2} ϵ_{t}^{2} [[π^{! k} (h) = ⊥]]$

$E (j, x) \sim ρ_{t}^{!} [H (ζ_{t}^{! j} (x_{: | h | + 1})) ∣ h^{! k}] \leq H (ζ_{t}^{! k} (h)) - \frac{(1 - e^{- 1})^{2}}{8 | A |^{2}} t^{- 2 / 3} [[π^{! k} (h) = ⊥]]$

Taking $ρ_{t}^{!}$ -expected value over $(k, h)$ , we conclude that for any $n \in N$

$E (k, x) \sim ρ_{t}^{!} [H (ζ_{t}^{! k} (x_{: n + 1}))] \leq E (k, x) \sim ρ_{t}^{!} [H (ζ_{t}^{! k} (x_{: n}))] - \frac{(1 - e^{- 1})^{2}}{8 | A |^{2}} t^{- 2 / 3} Pr (k, x) \sim ρ_{t}^{!} [π^{! k} (x_{: n}) = ⊥]$

By induction, it follows that

$E (k, x) \sim ρ_{t}^{!} [H (ζ_{t}^{! k} (x_{: n}))] \leq H (ζ^{0}) - \frac{(1 - e^{- 1})^{2}}{8 | A |^{2}} t^{- 2 / 3} E (k, x) \sim ρ_{t}^{!} [| {m \in [n] ∣ x_{m} \in ⊥ \times ¯ O} |]$

$E (k, x) \sim ρ_{t}^{!} [| {m \in [n] ∣ x_{m} \in ⊥ \times ¯ O} |] \leq \frac{8 | A |^{2} ln N}{(1 - e^{- 1})^{2}} t^{2 / 3}$

$E (k, x) \sim ρ_{t}^{!} [| {n \in N ∣ x_{n} \in ⊥ \times ¯ O} |] \leq \frac{8 | A |^{2} ln N}{(1 - e^{- 1})^{2}} t^{2 / 3}$

So, the expected number of delegations is $O (t^{2 / 3})$ . For any $k \in [N]$ , Proposition A.5 yields

${EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{! k}} (t) = \infty \sum n = 0 e^{- n / t} E x \sim ν^{k} ⋈ π^{! k} [V_{t}^{υ^{k} [α^{k}]} (x_{: n}) - Q_{t}^{υ^{k} [α^{k}]} (x_{: n} π^{! k} (x_{: n}))]$

Averaging over $k$ , we get

$\frac{1}{N} \sum k < N ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{! k}} (t)) = \infty \sum n = 0 e^{- n / t} E (k, x) \sim ρ_{t}^{!} [V_{t}^{υ^{k} [α^{k}]} (x_{: n}) - Q_{t}^{υ^{k} [α^{k}]} (x_{: n} π^{! k} (x_{: n}))]$

$\frac{1}{N} \sum k < N ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{! k}} (t)) = \infty \sum n = 0 e^{- n / t} E (k, x) \sim ρ_{t}^{!} [E (j, y) \sim ρ_{t}^{!} [V_{t}^{υ^{j} [α^{j}]} (y_{: n}) - Q_{t}^{υ^{j} [α^{j}]} (y_{: n} π^{! j} (y_{: n})) ∣ x_{: n}^{! k}]]$

Using the definitions of $ζ^{! k} (x_{: n})$ and $x_{: n}^{! k}$ , and observing that $π^{! j} (x_{: n}) = π^{! k} (x_{: n})$ for $(j, y) \in x_{: n}^{! l}$

$\frac{1}{N} \sum k < N ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{! k}} (t)) = \infty \sum n = 0 e^{- n / t} E (k, x) \sim ρ_{t}^{!} [E j \sim ζ^{! k} (x_{: n}) [V_{t}^{υ^{j} [α^{j}]} (x_{: n}) - Q_{t}^{υ^{j} [α^{j}]} (x_{: n} π^{! k} (x_{: n}))]]$

When $π^{! k} (x_{: n}) \neq ⊥$ , the expected loss is smaller than $ϵ_{t}$ (by construction of $π^{! k})$ . When $π^{! k} (x_{: n}) = ⊥$ , the expected loss is at most $| A | e^{- 1} β (t)^{- 1}$ , by Proposition A.4. We get

$\frac{1}{N} \sum k < N ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{! k}} (t)) \leq \infty \sum n = 0 e^{- n / t} E (k, x) \sim ρ_{t}^{!} [[[π^{! k} (x_{: n}) \neq ⊥]] ϵ_{t} + [[π^{! k} (x_{: n}) = ⊥]] | A | e^{- 1} β (t)^{- 1}]$

The contribution of the first term on the right hand side can be bounded using the observation

$\infty \sum n = 0 e^{- n / t} = 1 + \infty \sum n = 1 e^{- n / t} \leq 1 + \infty \sum n = 1 \int_{n - 1}^{n} e^{- s / t} d s = 1 + \int_{0}^{\infty} e^{- s / t} d s = 1 + t$

The contribution the of second term on the right hand side can be bounded via the expected number of delegations. We get

$\frac{1}{N} \sum k < N ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{! k}} (t)) \leq (1 + t) β (t)^{- 1} t^{- 1 / 3} + \frac{8 | A |^{2} ln N}{(1 - e^{- 1})^{2}} t^{2 / 3} | A | e^{- 1} β (t)^{- 1}$

$\frac{1}{N} \sum k < N ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{! k}} (t)) \leq (\frac{1}{t} + 1 + \frac{8 | A |^{3} ln N}{e (1 - e^{- 1})^{2}}) \frac{t^{2 / 3}}{β (t)}$

We use the relationship between $ρ_{t}$ and $ρ_{t}^{!}$ on the second term on the left hand side (whose contribution is the $ρ_{t}^{!}$ -expected value of the utility function, whose range lies in [0,1]), and get

$\frac{1}{N} \sum k < N ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{*}} (t)) \leq (\frac{1}{t} + 1 + \frac{8 | A |^{3} ln N}{e (1 - e^{- 1})^{2}}) \frac{t^{2 / 3}}{β (t)} + \frac{N - 1}{t^{1 / 3}}$

Using the assumption $β = ω (t^{2 / 3})$ , we conclude that for each $k \in [N]$

$lim t \to \infty ({EU}_{{¯ υ}^{k} [α^{k}]}^{*} (t) - {EU}_{{¯ υ}^{k} [α^{k}]}^{π^{*}} (t)) = 0$

Proposition A.6

Let $υ = (μ, r)$ be a meta-universe, ${C_{t} \subseteq (A \times O)^{*}}_{t \in (0, \infty)}$ and $π^{0}$ a metapolicy. Then, for all $t \in (0, \infty)$

${EU}_{υ^{C}}^{π^{0}} (t) \leq {EU}_{υ}^{π^{0}} (t)$

Proof of Proposition A.6

Denote $χ_{t} := χ_{C_{t}}$ the characteristic function of $C_{t}$ . For notational convenience, the expression $χ_{t} (x_{: - 1})$ will be understood to mean 0.

${EU}_{υ^{C}}^{π^{0}} (t) = (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} (n \sum m = 0 (1 - χ_{t} (x_{: m - 1})) χ_{t} (x_{: m}) V_{t}^{- υ} (x_{: m}) + (1 - χ_{t} (x_{: n})) r_{t} (x_{: n}))]$

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} (n \sum m = 0 (1 - χ_{t} (x_{: m - 1})) χ_{t} (x_{: m}) E y \sim μ ⋈ π^{0} [\frac{\sum_{l = m}^{\infty} e^{- \frac{l}{t}} r (y_{: l})}{\sum_{l = m}^{\infty} e^{- \frac{l}{t}}} ∣ x_{: m}] + (1 - χ_{t} (x_{: n})) r_{t} (x_{: n}))]$

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} (n \sum m = 0 (1 - χ_{t} (x_{: m - 1})) χ_{t} (x_{: m}) \frac{\sum_{l = m}^{\infty} e^{- \frac{l}{t}} r (x_{: l})}{\sum_{l = m}^{\infty} e^{- \frac{l}{t}}} + (1 - χ_{t} (x_{: n})) r_{t} (x_{: n}))]$

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} ((1 - e^{- \frac{1}{t}}) n \sum m = 0 (1 - χ_{t} (x_{: m - 1})) χ_{t} (x_{: m}) \infty \sum l = m e^{- \frac{l - m}{t}} r (x_{: l}) + (1 - χ_{t} (x_{: n})) r_{t} (x_{: n}))]$

Regrouping the sum to collect the terms $r (x_{: n})$ for the same value of $n$ , we get

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 ((1 - e^{- \frac{1}{t}}) n \sum m = 0 \infty \sum l = m e^{- \frac{l}{t}} (1 - χ_{t} (x_{: m - 1})) χ_{t} (x_{: m}) e^{- \frac{n - m}{t}} + e^{- \frac{n}{t}} (1 - χ_{t} (x_{: n}))) r (x_{: n})]$

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} ((1 - e^{- \frac{1}{t}}) n \sum m = 0 \frac{e^{- \frac{m}{t}}}{1 - e^{- \frac{1}{t}}} (1 - χ_{t} (x_{: m - 1})) χ_{t} (x_{: m}) e^{\frac{m}{t}} + 1 - χ_{t} (x_{: n})) r (x_{: n})]$

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} (n \sum m = 0 (1 - χ_{t} (x_{: m - 1})) χ_{t} (x_{: m}) + 1 - χ_{t} (x_{: n})) r (x_{: n})]$

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} (χ_{t} (x_{: n}) + 1 - χ_{t} (x_{: n})) r (x_{: n})]$

${EU}_{υ^{C}}^{π^{0}} (t) \leq (1 - e^{- \frac{1}{t}}) E x \sim μ ⋈ π^{0} [\infty \sum n = 0 e^{- \frac{n}{t}} r (x_{: n})] = {EU}_{υ}^{π^{0}} (t)$

Proposition A.7

Let $υ$ be a meta-universe and $C$ a $υ$ -avoidable event. Suppose $π^{!}$ is a metapolicy that learns $υ^{C}$ . Then, $π^{!}$ learns $υ$ as well.

Proof of Proposition A.7

It is easy to see for any policy $π$ and $t \in (0, \infty)$

${EU}_{υ^{C}}^{π^{*}} (t) \geq (1 - Pr x \sim μ_{t} ⋈ π_{t}^{*} [\exists n \leq D (t) : x_{: n} \in C_{t}]) ({EU}_{υ}^{π^{*}} (t) - e^{- D (t) / t})$

Let $π^{*}$ be as in Definition 4. Proposition A.6 and the above imply that

$lim t \to \infty ({EU}_{υ}^{π^{*}} (t) - {EU}_{υ^{C}}^{π^{*}} (t)) = 0$

Using condition i, we get

$lim t \to \infty ({EU}_{υ}^{*} (t) - {EU}_{υ^{C}}^{π^{*}} (t)) = 0$

By Proposition A.6, ${EU}_{υ}^{*} (t) \geq {EU}_{υ^{C}}^{*} (t)$ , therefore

$lim t \to \infty ({EU}_{υ^{C}}^{*} (t) - {EU}_{υ^{C}}^{π^{*}} (t)) = 0$

$lim t \to \infty ({EU}_{υ^{C}}^{π^{!}} (t) - {EU}_{υ^{C}}^{π^{*}} (t)) = 0$

$lim t \to \infty ({EU}_{υ^{C}}^{π^{!}} (t) - {EU}_{υ}^{*} (t)) = 0$

By Proposition A.6, ${EU}_{υ}^{π^{!}} (t) \geq {EU}_{υ^{C}}^{π^{!}} (t)$ , therefore

$lim t \to \infty ({EU}_{υ}^{π^{!}} (t) - {EU}_{υ}^{*} (t)) = 0$

Proof of Theorem

For any $k \in N$ , let $C^{k}$ and $α^{k *}$ be for $α^{k}$ as in Definition 5. Denote $ψ^{k} := (υ^{k})^{C^{k}}$ and

${¯ H}^{C} := {{¯ ψ}^{k} [α^{k *}] ∣ k \in N}$

By Lemma A and Proposition 2, ${¯ H}^{C}$ is learnable. Let $π^{*}$ be a metapolicy that learns it. For any $k \in N$ , $π^{*}$ learns ${¯ ψ}^{k} [α^{k *}]$ . It is easy to see that for any policy $π$ , ${EU}_{{¯ ψ}^{k} [α^{k *}]}^{π} (t) = {EU}_{{¯ ψ}^{k} [α^{k}]}^{π} (t)$ ( $ψ^{k}$ is defined s.t. once the corruption occurs, the reward is constant, so the advisor has no effect after this point), therefore $π^{*}$ learns ${¯ ψ}^{k} [α^{k}]$ . By Proposition A.7, we conclude that $π^{*}$ learns ${¯ υ}^{k} [α^{k}]$ .

Proof of Corollary 1

Denote $υ^{K} := (E k \in ζ^{K} [μ^{k}], r^{K})$ . By the Theorem, ${{¯ υ}^{K} [α^{K}]}_{K \in N}$ is learnable. Denote $ψ^{k K} := (μ^{k}, r^{K})$ . It is easy to see that for any $¯ I$ -policy $π$ and $t \in (0, \infty)$ ,

$E k \in ξ^{K} [{EU}_{ψ^{k K} [α^{K}]}^{π} (t)] = {EU}_{{¯ υ}^{K} [α^{K}]}^{π} (t)$

By the calibration condition

$E (K, k) \sim ζ [{EU}_{ψ^{k K} [α^{K}]}^{π} (t)] = E (K, k) \in ζ [E j \in ξ^{K} [{EU}_{ψ^{j K} [α^{K}]}^{π} (t)]] = E (K, k) \in ζ [{EU}_{{¯ υ}^{K} [α^{K}]}^{π} (t)]$

Applying Proposition 1, we conclude that $π^{ζ}$ learns ${{¯ υ}^{K} [α^{K}]}_{K \in N}$ . That is, for any $K \in N$

$lim t \to \infty ({EU}_{{¯ υ}^{K} [α^{K}]}^{*} (t) - {EU}_{{¯ υ}^{K} [α^{K}]}^{π^{ζ}} (t)) = 0$

$lim t \to \infty ({EU}_{υ^{K}}^{*} (t) - {EU}_{{¯ υ}^{K} [α^{K}]}^{π^{ζ}} (t)) = 0$

$lim t \to \infty (max π \in Π_{I} E k \in ξ^{K} [{EU}_{ψ^{k K}}^{π} (t)] - E k \in ξ^{K} [{EU}_{{¯ ψ}^{k K} [α^{K}]}^{π^{ζ}} (t)]) = 0$

Proof of Corollary 2

Immediate from Corollary 1 and Proposition A.0.

[-]Diffractor7y10

A summary that might be informative to other people: Where does the $ω (\frac{2}{3})$ requirement on the growth rate of the "rationality parameter" $β$ come from?

Well, the expected loss of the agent comes from two sources. Making a suboptimal choice on its own, and incurring a loss from consulting a not-fully-rational advisor. The policy of the agent is basically "defer to the advisor when the expected loss over all time of acting (relative to the optimal move by an agent who knew the true environment) is too high". Too high, in this case, cashes out as "higher than $β (t)^{- 1} t^{- 1 / x}$ ", where t is the time discount parameter and $β$ is the level-of-rationality parameter. Note that as the operator gets more rational, the agent gets less reluctant about deferring. Also note that t is reversed from what you might think, high values of t mean that the agent has a very distant planning horizon, low values mean the agent is more present-oriented.

On most rounds, the agent acts on its own, so the expected all-time loss on a single round from taking suboptimal choices is on the order of $β (t)^{- 1} t^{- 1 / x}$ , and also we're summing up over about t rounds (technically exponential discount, but they're similar enough). So the loss from acting on its own ends up being about $β (t)^{- 1} t^{(x - 1) / x}$ .

On the other hand, delegation will happen on at most ~ $t^{2 / x}$ rounds, with a loss of $β (t)^{- 1}$ value, so the loss from delegation ends up being around $β (t)^{- 1} t^{2 / x}$ .

Setting these two losses equal to each other/minimizing the exponent on the t when they are smooshed together gets you x=3. And then $β (t)$ must grow asymptotically faster than $t^{2 / 3}$ to have the loss shrink to 0. So that's basically where the 2/3 comes from, it comes from setting the delegation threshold to equalize long-term losses from the AI acting on its own, and the human picking bad choices, as the time horizon t goes to infinity.

[-]Diffractor7y00

I don't believe that $x_{: n}^{! k}$ was defined anywhere, but we "use the definition" in the proof of Lemma 1.

As far as I can tell, it's a set of (j,y) pairs, where j is the index of a hypothesis, and y is an infinite history string, rather like the set $h^{! k}$ .

How do the definitions of $h^{! k}$ and $x_{: n}^{! k}$ differ?

[-]Vanessa Kosoy7y00

Hi Alex!

The definition of $h^{! k}$ makes sense for any $h$ , that is, the superscript $! k$ in this context is a mapping from finite histories to sets of pairs as you said. In the line in question we just apply this mapping to $x_{: n}$ where $x$ is a bound variable coming from the expected value.

I hope this helps?

By the stated definitions, "v-avoidable event" is pretty much trivial when the event doesn't lead to lasting utility loss. The conditions on "v-avoidable event" are basically:

The agent's policy converges to optimality.

There's a sublinear function D(t) where the agent avoids the event with probability 1 for D(t) time, in the limit as t goes to infinity.

By this definition, "getting hit in the face with a brick before round 3" is an avoidable event, even when the sequence of policies lead to the agent getting hit in the face with a brick on round 2 with certainty and it's possible to dodge it. Let the sublinear function be the constant 1, and let the sequence of policies converge to "dodge" on round 1 and "stay" on round 2, and let the brick incur sublinear utility loss.

This fulfills the conditions, so getting hit in the face with a brick before timestep 3 is a "v-avoidable" event despite certainly occuring. Thus, this condition is only meaningful about lasting failures that incur enough utility loss to prevent convergence to the optimal policy.

[-]Tom Everitt7y00

"The use of an advisor allows us to kill two birds with one stone: learning the reward function and safe exploration (i.e. avoiding both the Scylla of “Bayesian paranoia” and the Charybdis of falling into traps)."

This sounds quite nice. But how is it possible to achieve this if the advisor is a soft-maximiser? Doesn't that mean that there is a positive probability that the advisor falls into the trap?

Hi Tom!

There is a positive probability that the advisor falls into the trap, but this probability goes to $0$ as the time discount parameter $t$ goes to $\infty$ (which is the limit I study here). This follows from the condition $β (t) = ω (t^{2 / 3})$ in the Theorem. To give a simple example, suppose that $A = {0, 1, 2}$ and the environment is s.t.:

When you take action 0, you fall into a trap and get reward 0 forever.
When you take action 1, you get reward 0 for the current round and remain in the same state.
When you take action 2, you get reward 1 for the current round (unless you are in the trap) and remain in the same state.

In this case, our advisor would have to take action 0 with probability $exp (- ω (t^{2 / 3}))$ and action 2 has to be more probable than action 1 by a factor of $exp (ω (t^{- 1 / 3})) \approx 1 + ω (t^{- 1 / 3})$ .

Hi Vanessa!

So basically the advisor will be increasingly careful as the cost of falling into the trap goes to infinity? Makes sense I guess.

What is the incentive for the agent not to always let the advisor choose? Is there always some probability that the advisor saves them from infinite loss, or only in certain situations that can be detected by the agent?

If the agent always delegates to the advisor, it loses a large fraction of the value. Returning again to the simple example above, the advisor on its own is only guaranteed to get expected utility $1 / 2 + ω (t^{- 1 / 3})$ (because it often takes the suboptimal action 1). On the other hand, for any prior over a countable set of environments that includes this one, the corresponding DIRL agent gets expected utility $1 - o (1)$ on this environment (because it will learn to only take action 2). You can also add an external penalty for each delegation, adjusting the proof is straightforward.

So, the agent has to exercise judgement about whether to delegate, using its prior + past observations. For example, the policy I construct in Lemma A delegates iff there is no action whose expected loss (according to current beliefs) is less than $β (t)^{- 1} t^{- 1 / 3}$ .

So this requires the agent's prior to incorporate information about which states are potentially risky?

Because if there is always some probability of there being a risky action (with infinitely negative value), then regardless how small the probability is and how large the penalty is for asking, the agent will always be better off asking.

(Did you see Owain Evans recent paper about trying to teach the agent to detect risky states.)

The only assumptions about the prior are that it is supported on a countable set of hypotheses, and that in each hypothesis the advisor is $β$ -rational (for some fixed $β (t) = ω (t^{2 / 3})$ ).

There is no such thing as infinitely negative value in this framework. The utility function is bounded because of the geometric time discount (and because the momentary rewards are assumed to be bounded), and in fact I normalize it to lie in $[0, 1]$ (see the equation defining $U$ in the beginning of the Results section).

Falling into a trap is an event associated with $Ω (1)$ loss (i.e. loss that remains constant as $t$ goes to $\infty$ ). Therefore, we can risk such an event, as long as the probability is $o (1)$ (i.e. goes to $0$ as $t$ goes to $\infty$ ). This means that as $t$ grows, the agent will spend more rounds delegating to the advisor, but for any given $t$ , it won't delegate on most rounds (even on most of the important rounds, i.e. during the first $O (t)$ -length "horizon"). In fact, you can see in the proof of Lemma A, that the policy I construct delegates on $O (t^{2 / 3})$ rounds.

As a simple example, consider again the toy environment from before. Consider also the environments you get from it by applying a permutation to the set of actions $A$ . Thus, you get a hypothesis class of 6 environments. Then, the corresponding DIRL agent will spend $O (t^{2 / 3})$ rounds delegating, observe which action is chosen by the advisor most frequently, and perform this action forevermore. (The phenomenon that all delegations happen in the beginning is specific to this toy example, because it only has 1 non-trap state.)

If you mean this paper, I saw it?

My confusion is the following:

Premises (*) and inferences (=>):

The primary way for the agent to avoid traps is to delegate to a soft-maximiser.
Any action with boundedly negative utility, a soft-maximiser will take with positive probability.
Actions leading to traps do not have infinitely negative utility.

=> The agent will fall into traps with positive probability.

If the agent falls into a trap with positive probability, then it will have linear regret.

=> The agent will have linear regret.

So when you say in the beginning of the post "a Bayesian DIRL agent is guaranteed to attain most of the value", you must mean that in a different sense than a regret sense?

[-]Vanessa Kosoy7y10

Your confusion is because you are thinking about regret in an anytime setting. In an anytime setting, there is a fixed policy $π$ , we measure the expected reward of $π$ over a time interval $t$ and compare it to the optimal expected reward over the same time interval. If $π$ has probability $p > 0$ to walk into a trap, regret has the linear lower bound $Ω (p t)$ .

On other hand, I am talking about policies $π_{t}$ that explicitly depend on the parameter $t$ (I call this a "metapolicy"). Both the advisor and the agent policies are like that. As $t$ goes to $\infty$ , the probability $p (t)$ to walk into a trap goes to $0$ , so $p (t) t$ is a sublinear function.

A second difference with the usual definition of regret is that I use an infinite sum of rewards with geometric time discount $e^{- 1 / t}$ instead of a step function time discount that cuts off at $t$ . However, this second difference is entirely inessential, and all the theorems work about the same with step function time discount.