Yesterday, I gave a definition of "decision theory" and "fair decision problem" in the context of provability logic, and gave a formal version of drnickbone's argument that no decision theory can perform optimally on every "fair" decision problem. Today, I prove that for any provably "fair" decision problem $\to P (\to u)$ , there is a sound extension of $P A$ by a finite set of formulas, expressible in the language of provability logic, such that Vladimir Slepnev's modal logic version of UDT performs optimally on $\to P (\to u)$ if we have it search for proofs in this extended proof system.

This is a version of a result that Kenny Easwaran, Nate Soares and I proved in the context of UDT with bounded proof length when Kenny visited MIRI for two days earlier this year. (I'll post about the bounded proof length version some other time; it's more complicated, as you'd expect.)

Prerequisites: A primer on provability logic, "Evil" decision problems in provability logic.

Provability in extensions of PA

In order to carry out the idea above, we need to talk about provability in extensions of $P A$ , not just about the provability in $P A$ expressed by the $□ (\cdot)$ operator of Gödel-Löb provability logic. Fortunately, the systems we'll need in this post only extend $P A$ by a finite number of axioms which can be expressed as closed formulas in the language of $G L$ , and there's a simple trick to reason about such extensions in $G L$ itself:

Suppose that we extend $P A$ by a single axiom, $φ$ . Then, by the deduction theorem for first-order logic, $P A + φ ⊢ ψ$ if and only if $P A ⊢ φ \to ψ$ . Thus, if both $φ$ and $ψ$ can be expressed as modal formulas, then we can express the proposition that $ψ$ is provable in $P A + φ$ by the modal formula $□ (φ \to ψ)$ . The same trick works, of course, if we want to add finitely many different modal formulas to $P A$ , since we can simply let $φ$ be their conjunction.

In this post, when I'm using this trick to talk about $P A + φ$ , I'll emphasize this fact by abbreviating $□ (φ \to ψ)$ to $□_{φ} (ψ)$ (since we often subscript $□$ by the proof system when it's not clear which system we are referring to).

Modal UDT

The version of modal UDT I'm using in this article is equivalent to the one in Vladimir's article, except for using $□_{φ} (\cdot)$ instead of $□ (\cdot)$ , for an appropriate $φ$ . Nevertheless, let me give a quick recap, which puts a slightly different spin on the system, and check that this does indeed meet the specification of a "modal decision theory".

As a modal decision theory, $m$ -action, $n$ -preference level modal UDT is a sequence $\to U D T (\to u)$ of $m$ formulas in $n$ parameters, such that ${U D T}_{i} (\to U)$ gives the action UDT chooses in the modal universe $\to U$ . Let's write $^{(φ)} (\to u)$ for the version which uses provability in $P A + φ$ .

I think the easiest way to specify $^{(φ)} (\to u)$ is to say that it's a sequence of modal formulas which describe the behavior of the following an algorithm (which needs a halting oracle):

For each $j = 1, \dots, n$ :
- For each $i = 1, \dots, m$ :
  - If it's provable in $P A + φ$ that "I take action $i$ " implies "I obtain outcome $j$ ", then return $i$ .
If still here, return $1$ .

In other words, UDT starts with the most preferred outcome, and tries to find an action that provably implies that outcome; if there is such an action, it returns it, breaking ties lexicographically. If there isn't, it tries the second-most-preferred outcome, and so on down to the least preferred one. If no actions imply any outcomes, it returns an arbitrary default action.

To actually write out ${U D T}_{i}^{(φ)}$ , we need to find a formula that describes when the above algorithm returns the action $i$ . Clearly, for actions other than the default action, the condition is that there's some outcome $j$ such that (a) it's provable that $i$ implies $j$ ; (b) there's no action $i^{'} < i$ which provably implies $j$ ; (c) and there isn't any outcome $j^{'} < j$ which is provably implied by any action $i^{'}$ . The default action $1$ is taken if the condition above holds or if no action provably implies any outcome.

Let's agree to order pairs $(j, i)$ lexicographically, i.e., $(j, i) < (j^{'}, i^{'})$ if $j < j^{'}$ or $j = j^{'}$ and $i < i^{'}$ . Then, we can write ${U D T}_{i}^{(φ)} (\to u)$ as follows, for $i > 1$ : ${U D T}_{i}^{(φ)} (\to u) :\equiv n ⋁ j = 1 ⎛ ⎝ □_{φ} ({U D T}_{i}^{(φ)} (\to u) \to u_{j}) \land ⋀ (j^{'}, i^{'}) < (j, i) \neg □_{φ} ({U D T}_{i^{'}}^{(φ)} (\to u) \to u_{j^{'}}) ⎞ ⎠,$ and for $i = 1$ : $\begin{matrix} {U D T}_{1}^{(φ)} (\to u) & :\equiv n ⋁ j = 1 ⎛ ⎝ □_{φ} ({U D T}_{1}^{(φ)} (\to u) \to u_{j}) \land ⋀ (j^{'}, i^{'}) < (j, 1) \neg □_{φ} ({U D T}_{i^{'}}^{(φ)} (\to u) \to u_{j^{'}}) ⎞ ⎠ \lor (m, n) ⋀ (j^{'}, i^{'}) = (1, 1) \neg □_{φ} ({U D T}_{i^{'}}^{(φ)} (\to u) \to u_{j^{'}}) . \end{matrix}$ From this definition, it's straight-forward to see that $^{(φ)}$ is provably mutually exclusive and exhaustive (p.m.e.e.); we work in $G L$ and distinguish the following cases: There either is no pair $(j, i)$ such that $□ ({U D T}_{i}^{(φ)} (\to u) \to u_{j})$ , or some pair $(j, i)$ is the first one. If there is no such pair, then ${U D T}_{1}^{(φ)} (\to u)$ is true, and all other formulas in the sequence are false. If $(j, i)$ is the first such pair, then ${U D T}_{i}^{(φ)} (\to u)$ is true, and all the other formulas are false.

Provably extensional problems

Let's call a closed formula in the language of $G L$ sound if its translation into the language of arithmetic is true on the natural numbers; unsurprisingly, we'll write this as $N ⊨ φ$ for short. My claim is that for every provably extensional decision problem $\to P (\to a)$ , there is a sound formula $φ$ such that $^{(φ)} (\to u)$ performs optimally on $\to P (\to a)$ , that is, obtains the best outcome that any decision theory can achieve on this outcome.

Recall that a decision problem $\to P (\to a)$ is defined to be provably extensional if $G L ⊢ (\to a \leftrightarrow \to b) \to (\to P (\to a) \leftrightarrow \to P (\to b)),$ where $\to φ \leftrightarrow \to ψ$ stands for the conjunction $(φ_{1} \leftrightarrow ψ_{1}) \land \dots \land (φ_{n} \leftrightarrow ψ_{n})$ . (Of course, the two sequences need to be of the same length.)

Intuitively, such a decision problem assigns a single outcome $f (i)$ to every action $i$ , and every agent which chooses action $i$ will obtain outcome $f (i)$ ; this is a formalization of the idea that we should consider a decision problem "fair" if it only rewards or punishes you for the actions you decide to take, not for the decision process you used to arrive at these decisions.

This doesn't mean that the mapping $f (i)$ is all that matters about the decision problem---for example, any particular version of modal UDT fails on its "evil" decision problem, despite the fact that these "evil" problems are provably extensional. That's because it can be difficult to determine which mapping $f (\cdot)$ a particular $\to P (\to a)$ corresponds to.

The idea of the optimality proof in this post is that for any particular $\to P (\to u)$ , there is some truth about what the corresponding mapping $f (\cdot)$ is, even if it's hard to determine, and if we run UDT with a proof system strong enough to handle this problem, this version of UDT will simply be able to figure out which action is best and then take it.

How do we find a proof system strong enough to determine the action/outcome mapping $f (\cdot)$ for a given $\to P (\to a)$ ? The simplest possibility is to just take $P A$ and add to it the information about the true action/outcome mapping.

Let $χ_{i}^{(i)} :\equiv ⊤$ and $χ_{i^{'}}^{(i)} :\equiv ⊥$ for $i^{'} \neq i$ , and write ${\to χ}^{(i)}$ for the corresponding sequence of length $m$ or $n$ (depending on the context). By the true mapping, I mean the mapping $f (\cdot)$ such that for all $i \in {1, \dots, m}$ , we have $N ⊨ P_{f (i)} ({\to χ}^{(i)})$ . This mapping is unique, because $\to P (\to a)$ is a decision problem, which by definition implies that the formulas in this sequence are p.m.e.e. (i.e., exactly one of them is true, for any particular value of the $\to a$ ).

Now it's clear how we can encode the fact that this is the true mapping as a formula of $G L$ : we simply set $φ :\equiv ⋀_{i = 1}^{n} P_{f (i)} ({\to χ}^{(i)})$ .

Optimality

It's now clear the best outcome any agent can obtain on the provably extensional decision problem $\to P (\to a)$ is ${min}_{i} f (i)$ : An agent $\to A$ is a sequence of closed p.m.e.e. formulas, which means that for any agent $\to A$ there is some $i$ such that $N ⊨ \to A \leftrightarrow {\to χ}^{(i)}$ , which by extensionality implies $N ⊨ \to P (\to A) \leftrightarrow \to P ({\to χ}^{(i)})$ , so the outcome obtained by $\to A$ is $f (i)$ .

Write $(\to A, \to U)$ for the fixed point of $^{(φ)} (\to u)$ and $\to P (\to a)$ , with $φ$ defined as in the previous section. Then, in order to show that $^{(φ)} (\to u)$ performs optimally on this decision problem, we need to show that $N ⊨ \to A \leftrightarrow {\to χ}^{(i^{*})}$ for some $i^{*}$ such that $f (i^{*}) = j^{*} := {min}_{i} f (i)$ .

I claim that this holds for $i^{*}$ the smallest $i$ such that $f (i) = j^{*}$ . Given the way $^{(φ)} (\to u)$ is constructed, in order to show this I need to show that $N ⊨ □_{φ} ({U D T}_{i^{*}}^{(φ)} (\to U) \to U_{j^{*}})$ , plus $N ⊨ \neg □_{φ} ({U D T}_{i^{'}}^{(φ)} (\to U) \to U_{j^{'}})$ for all $(j^{'}, i^{'}) < (j^{*}, i^{*})$ .

The first of these statements is equivalent to $G L ⊢ φ \to ({U D T}_{i^{*}}^{(φ)} (\to U) \to U_{j^{*}}),$ which, given the properties $G L ⊢ \to A \leftrightarrow^{(φ)} (\to U)$ and $G L ⊢ \to U \leftrightarrow \to P (\to A)$ defining $\to A$ and $\to U$ , we can rewrite as $G L ⊢ φ \to (A_{i^{*}} \to P_{j^{*}} (\to A)) .$ Since $\to A$ is p.m.e.e., $A_{i^{*}}$ is equivalent to $\to A \leftrightarrow χ^{(i^{*})}$ , so by provable extensionality of $\to P (\to a)$ , we have $G L ⊢ A_{i^{*}} \to (\to P (\to A) \leftrightarrow \to P ({\to χ}^{(i^{*})}))$ . The definition of $φ$ implies $G L ⊢ φ \to (\to P ({\to χ}^{(i^{*})} \leftrightarrow {\to χ}^{(j^{*})})$ , so we obtain $G L ⊢ (φ \land A_{i^{*}}) \to (\to P (\to A) \leftrightarrow {\to χ}^{(j^{*})})$ or $G L ⊢ (φ \land A_{i^{*}}) \to P_{j^{*}} (\to A)$ , which is equivalent to the desired result.

For the second statement, we need to show that ${U D T}_{i^{'}}^{(φ)} (\to U) \to U_{j^{'}}$ is not provable in $P A + φ$ for any $(j^{'}, i^{'}) < (j^{*}, i^{*})$ . So suppose it were provable, for some such $(j^{'}, i^{'})$ . Then $^{(φ)} (\to U)$ would take the action $i^{'}$ corresponding to the smallest such pair; in other words, we would have $N ⊨ {U D T}_{i^{'}}^{(φ)}$ . We can again use the defining properties of $\to A$ and $\to U$ to rewrite this as $N ⊨ A_{i^{'}}$ .

Since $P A$ is sound and $N ⊨ φ$ , the assumption that ${U D T}_{i^{'}}^{(φ)} (\to U) \to U_{j^{'}}$ is provable in $P A + φ$ implies that it is true on $N$ . We can rewrite this as well, to $N ⊨ A_{i^{'}} \to P_{j^{'}} (\to A)$ . Hence, we have $N ⊨ A_{i^{'}} \land P_{j^{'}} (\to A)$ .

If $j^{'} < j^{*}$ , this says that $\to A$ achieves an outcome better than $j^{*}$ , which we've shown above to be the best outcome that any agent can possibly achieve---contradiction. If $j^{'} = j^{*}$ and $i^{'} < i^{*}$ , then we have found an $i^{'} < i^{*}$ such that $f (i^{'}) = j^{*}$ (since by extensionality, $N ⊨ A_{i^{'}}$ implies $N ⊨ P_{f (i^{'})} (\to A)$ , and hence $N ⊨ \neg P_{j^{'}} (\to A)$ for $j^{'} \neq f (i^{'})$ as $\to P (\to a)$ is p.m.e.e.)---contradiction to the definition of $i^{*}$ .

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

6

An optimality result for modal UDT

6

Provability in extensions of PA

Modal UDT

Provably extensional problems

Optimality