A limit-computable, self-reflective distribution

TsviBT

We present a $Δ_{2}$ -definable probability distribution $Ψ$ that satisfies Christiano's reflection schema for its own defining formula. The strategy is analogous to the chicken step employed by modal decision theory to obfuscate itself from the eyes of $P A$ ; we will prevent the base theory $T$ from knowing much about $Ψ$ , so that $Ψ$ can be coherent over $T$ and also consistently believe in reflection statements. So, the method used here is technical and not fundamental, but it does at least show that limit-computable and reflective distributions exist. These results are due to Sam Eisenstat and me, and this post benefited greatly from extensive notes from Sam; any remaining errors are probably mine.

Prerequisites: we assume familiarity with Christiano's original result and the methods used there. In particular, we will freely use Kakutani's fixed point theorem. See Christiano et al.'s paper.

Outline

Section 1. Introduction and problem statement
Section 2. A definable, self-reflective distribution $Ψ$
Section 3. A $Δ_{2}$ -definable, self-reflective distribution $Λ$
Section 4. Discussion and open problems

Section 1. Problem statement

Probabilistic reflection

We have some base theory $T$ in a language $L$ , where $T$ is able to talk about arithmetic (e.g. $P A$ or $Z F C$ ). We wish to find probability distributions over completions of $T$ , or equivalently functions $P : L \to [0, 1]$ satisfying $ϕ \in T$ implies $P (ϕ) = 1$ and probabilistic coherence conditions like $P (\neg ϕ) = 1 - P (ϕ)$ . In particular, we want $P$ to have accurate beliefs about itself:

$\forall ϕ \in L : \forall a, b \in Q : P (ϕ) \in (a, b) \Rightarrow P (P (┌ ϕ ┐) \in (a, b)) = 1,$

where $P$ is a symbol in $L$ . Christiano showed that there exists such a distribution $P$ . In other words, taking an additional symbol $P$ in the metalanguage, writing ${Coh}_{T} (P)$ for the statement that $P$ is a coherent distribution over $T$ , and for the statement that $P$ is reflective, we have that the theory

$Z F C + {Coh}_{T} (P) + Refl (P)$

is consistent. That consistency of this theory is equivalent to the existence of a reflective $P$ requires an argument due to Fallenstein, wherein we take the standard part of $P$ according to some completion of this theory. We will use this idea in Section 3 and explain it there.

Definable reflection

Christiano asked: Can we find a distribution that is definable, and furthermore reflective about its own defining formula?
This is different from the original reflective $P$ because we had a symbol $P$ in the language, and $P$ reflected itself in that symbol; in that case, speaking imprecisely, $P$ knew that $P$ assigned the same probabilities that $P$ assigned. Now we want a formula $Ψ$ that defines a distribution, and that distribution assigns probability 1 to certain true statements about what probabilities $Ψ$ assigns. ( $Ψ$ stands for $P$ -Self-reflective.)

Formally, can we have a coherent distribution on $T$ defined by some formula $Ψ$ in the language $L$ , where $L$ now does not contain a distinguished symbol $P$ , satisfying:

$\forall ϕ \in L : \forall a, b \in Q : if Ψ (┌ ϕ ┐) \in (a, b), then Ψ (┌ Ψ (┌ ϕ ┐) \in (a, b) ┐) = 1 ?$

Our strategy is to apply Kakutani's fixed point theorem to the following reflection correspondence $⊲_{Ψ}$ on the set $Δ (T) \subset [0, 1]^{L}$ of coherent distributions over $T$ . For any $P, Q \in Δ (T)$ , we write $P ⊲_{Ψ} Q$ and say that $Q$ reflects $P$ in $Ψ$ if and only if, whenever $P (ϕ)$ is in some interval, $Q$ believes that $Ψ$ assigns to $ϕ$ a value in that interval. Formally, for all $P, Q \in Δ (T)$ , we have $P ⊲_{Ψ} Q$ if and only if:

$\forall ϕ \in L : \forall a, b \in Q : if P (ϕ) \in (a, b), then Q (Ψ (┌ ϕ ┐) \in (a, b)) = 1 .$

If we can obtain a fixed point of this correspondence $⊲_{Ψ}$ , and have $Ψ$ define this same distribution, then we will have a definable self-reflective distribution.

(The notation $⊲_{Ψ}$ does not denote an ordering, and is meant to suggest a single point on the left with a larger corresponding set on the right; better recommendations are welcome. Also, in the following section, $P$ and $Q$ are actual distributions, not symbols in a language; this is distinguished by their arguments having the type of formulas rather than Gödel numbers of formulas.)

Most of the conditions for Kakutani's theorem are satisfied by $⊲_{ϕ}$ for any formula $ϕ$ just as in Christiano's proof, so we won't verify them in detail. For example, the set of distributions reflect $Q$ in $ϕ$ is still convex, because interpolation preserves coherence and reflection. (I can make an auxiliary post if people would like to see the details.) However, we need $⊲_{Ψ}$ to satisfy the crucial non-emptiness condition:

$\forall P \in Δ (T) : \exists Q \in Δ (T) : P ⊲_{Ψ} Q .$

In the next section, we give a definition $Ψ$ that meets this criterion, assuming a strong background theory such as $Z F C$ . This will illustrate the diagonalization method we will apply. Section 3 gives a $Λ$ that uses the same strategy as $Ψ$ , but is $Δ_{2}$ -definable in arithmetic, i.e. computable relative to a halting oracle. Section 4 discusses the meta-theoretic assumptions used, optimality of our results, and open questions.

Section 2. A definable, self-reflective distribution $Ψ$

Consider the theory $T_{Ψ = P} := T \cup {Ψ (┌ ϕ ┐) \in (a, b) | ϕ \in L, a, b \in Q, P (ϕ) \in (a, b)}$ stating that $Ψ$ behaves according to $P$ . If $T_{Ψ = P}$ is a consistent theory, then we can take a completion $T^{'}$ of $T_{Ψ = P}$ , and then take the 0-1-valued distribution $Q (ϕ) := I {ϕ \in T^{'}}$ , i.e. the indicator function for membership in the complete theory $T^{'}$ . This $Q$ is coherent over $T$ , and furthermore by fiat $Q$ assigns probability 1 to each reflection statement for $P$ by $Ψ$ .

In the original existence proof of a reflective $P$ , the analogous reflection theory $T_{p = P}$ for a symbol $p$ is consistent: the distribution $P$ itself provides a model for $T_{p = P}$ . However, in the present case, we have to be more careful. Perhaps $T$ proves some non-trivial facts about the behavior of the function $Ψ$ we are defining. Then $T_{Ψ = P}$ may be inconsistent, if $T$ proves that $Ψ$ cannot behave according to $P$ , e.g. if $P (ϕ) \in (a, b)$ but $T ⊢ Ψ (┌ ϕ ┐) \notin (a, b)$ .

To overcome this, we will obfuscate the behavior of $Ψ$ from the prying proofs of $T$ , to such an extreme extent that $T$ cannot rule out any finite set of behaviors of $Ψ$ ; then $T_{Ψ = P}$ will be consistent. (Unfortunately, this weakens the main potential use in reflective agents of having a definable self-reflective distribution, namely, that the theory can reason concretely about the distribution, other than what can be said about generic coherent distributions. It may be possible to have the base theory prove more about the distribution by not diagonalizing against certain behaviors, and working around that in the application of Kakutani's theorem.)

Definition of $Ψ$

Take the set $B_{Ψ} := {⋀_{i \in [n]} (Ψ (┌ ϕ_{i} ┐) \in (a_{i}, b_{i})) | n \in N; a_{j}, b_{j} \in Q \cap [0, 1]; \forall k \in [n] : a_{k} < b_{k}}$ of possible finite specifications of behaviors for $Ψ$ . Here $ϕ_{i}$ is some enumeration of all sentences in $L$ . Note that $B_{Ψ}$ is countable, and each sentence in $B_{Ψ}$ is consistent with $Ψ$ being a function, though not necessarily a coherent distribution. Now we can define $Ψ$ by the following informally stated construction, which can be translated into a formula $Ψ (n, r)$ defining the graph of a function from sentences to real numbers:

Step 1. Search for a proof in $T$ of the negation of some statement in $B_{Ψ}$ . If such a statement $⋀_{i \in [n]} (Ψ (┌ ϕ_{i} ┐) \in (a_{i}, b_{i}))$ is found, then output some value in $(a_{i}, b_{i})$ on input $ϕ_{i}$ , and otherwise output 0.

Step 2. Otherwise, output values according to the least fixed point of the reflection correspondence $⊲_{Ψ}$ .

By the least fixed point, we mean the fixed point with minimal first coordinate; minimal second coordinate out of those with minimal first coordinate; minimal third coordinate . . . etc. This exists because the set of fixed points is closed, and therefore also compact. Note that we require $L$ and $T$ to be expressive enough to define fixed points of $⊲_{Ψ}$ and the ordering just described. Taking $T = Z F C$ suffices. This is not a crucial point, as we will later show that a much weaker theory suffices.

Theorem 1.

If $T$ is a sound theory, then $Ψ$ defines a function on $L$ that assigns probability 1 to each sentence in $T$ , is coherent, and is reflective.

Proof.

Suppose we have $T ⊢ \neg ξ$ for some $ξ \in B_{Ψ}$ , where $ξ = ⋀_{i \in [n]} (Ψ (┌ ϕ_{i} ┐) \in (a_{i}, b_{i}))$ . Then by step 1 of $Ψ$ , we have that $Ψ (ϕ) \in (a_{i}, b_{i})$ for all $i \in [n]$ . This contradicts soundness of $T$ .

Thus, by compactness, we have that for all $P \in Δ (T)$ , the theory $T_{Ψ = P}$ is consistent. Then by the argument above, the correspondence $⊲_{Ψ}$ gives a non-empty image for each point in $Δ (T)$ . By Kakutani's theorem, there is a fixed point of this correspondence.

Step 2 then outputs probabilities according to some such fixed point $P$ . That is, $\forall ϕ : Ψ (ϕ) = P (ϕ)$ . Furthermore, $P ⊲_{Ψ} P$ , i.e. $Ψ$ is reflective. $□$

In fact, we do not need to assume soundness of $T$ in Theorem 1; we only need to assume that $T$ is consistent. Proof: suppose that $T ⊢ \neg ξ$ for some $ξ \in B_{Ψ}$ . Then there is a least such $ξ$ . But then $T$ can prove that $ξ$ is the first sentence in $B_{Ψ}$ that $T$ refutes, so $T$ can prove that $Ψ$ finds this proof and outputs according to $ξ$ . Then we have that $T$ both proves and refutes that $Ψ$ behaves according to $ξ$ , contradicting the consistency of $T$ .

Section 3. A $Δ_{2}$ -definable, self-reflective distribution $Λ$

Step 1 in $Ψ$ as defined above is $Δ_{2}$ -definable, i.e. computable with a halting oracle: we just need to check the halting of the machine that searches for refutations by $T$ of the (computably recognizable) sentences in $B_{Ψ}$ . If the machine halts, find the proof computably, and output rational numbers given by the refuted behavior.

Step 2, however, uses set-theoretic notions; namely, it quantifies over the large and complicated set of fixed points of $⊲_{Ψ}$ . In this section we replace Step 2 with a $Δ_{2}$ computation, thereby defining a $Δ_{2}$ formula $Λ$ that is self-reflective. ( $Λ$ stands for Limit-computable.)

Definition of $Λ$

By the arguments for Theorem 1, regardless of what $Λ$ does in Step 2, there will be some $P \in Δ (T)$ such that $P ⊲_{Λ} P$ . Thus, for Step 2, it suffices to run a $Δ_{2}$ computation that outputs a fixed point as long as one exists. We now define $Λ$ by the following construction, which can be viewed as an algorithm executed by a machine with a halting oracle:

Step 1. Search for a proof in $T$ of the negation of some statement in $B_{Λ}$ . If such a statement $⋀_{i \in [n]} (Λ (┌ ϕ_{i} ┐) \in (a_{i}, b_{i}))$ is found, then output some value in $(a_{i}, b_{i})$ on input $ϕ_{i}$ , and otherwise output 0.

Step 2. Otherwise, enumerate a completion $¯ Z$ of the theory $Z := Z F C + {Coh}_{T} (P) + P ⊲_{Λ} P$ in the language of set theory plus a constant symbol $P$ . That is, list all sentences $σ$ , and print out each $σ$ that is consistent with $Z$ and the finitely many sentences that have already been printed. Given a sentence $ϕ$ and a precision $0 < ε \in Q$ , search $¯ Z$ for the first quadruple of rational numbers $a < a^{'} < b^{'} < b$ such that $¯ Z ⊢ P (┌ ϕ ┐) \in (a^{'}, b^{'})$ , and such that $b - a < ε$ . Output $(a, b)$ .

(Note: This is the technique alluded to earlier of taking the standard part of a distribution. If we just output $(a, b)$ with $(P (┌ ϕ ┐) \in (a, b)) \in ¯ Z$ , then we might output e.g. $(0, 1 / n)$ for all $n \in N$ , which has empty intersection. Also note that we could take a much more basic theory than $Z$ ; we could have a base theory that says all true atomic statements about $Q$ , and take ${Coh}_{T} (P)$ and $P ⊲_{Λ} P$ to be schema of quantifier-free formulas.)

Observation: $Λ$ is $Δ_{2}$ .

Proof: Step 1 is $Δ_{2}$ by the argument given at the beginning of this section. Step 2 is computable, except that we need to check whether each new sentence $σ$ is consistent with $Z$ and the previously printed sentences. To check the consistency of a computably enumerable theory $S$ , we ask our halting oracle whether not $M$ halts, where $M$ is a machine that enumerates all the theorems of $S$ , and halts iff it prints $⊥$ .

Theorem 2.

$Λ$ defines a coherent, self-reflective distribution.

Proof.

By the argument for $Ψ$ , if $T$ is consistent, it does not refute any behavior of $Λ$ . So there is some fixed point $P ⊲_{Λ} P$ , and also $Λ$ behaves according to Step 2.

The distribution $P$ provides an interpretation for the symbol $P$ satisfying coherence over $T$ and self-reflection in $Λ$ . Hence the theory $Z$ is consistent, and so $¯ Z$ is non-trivial. By construction, for each $ϕ$ there is a unique (standard) real $P_{¯ Z} (ϕ)$ that, for all $ε > 0$ , lies in the interval output by $Λ$ on $ϕ$ and $ε$ .

Now we argue that $P_{¯ Z}$ is coherent over $T$ and reflects itself in $Λ$ . For any $ϕ \in T$ , since $¯ Z ⊢ {Coh}_{T} (P)$ , we also have $¯ Z ⊢ P (┌ ϕ ┐) = 1$ . Thus, for all $ε > 0$ we have that $¯ Z ⊢ P (┌ ϕ ┐) \in (1 - ε, 1 + ε)$ , and therefore $P_{¯ Z} (ϕ) = 1$ . A very similar argument works to show the other coherence conditions and reflection. For example, if $P_{¯ Z} (ϕ) \in (a, b)$ , then we have $¯ Z ⊢ P (┌ ϕ ┐) \in (a, b)$ . By an instance of $P ⊲_{Λ} P$ , also $¯ Z ⊢ P (┌ Λ (┌ ϕ ┐) \in (a, b) ┐) = 1$ , and hence $P_{¯ Z} (Λ (┌ ϕ ┐) \in (a, b)) = 1$ .

Thus $Λ$ computes a coherent distribution over $T$ which assigns probability 1 to each reflection statement for $Λ$ . $□$

Section 4. Discussion

$Δ_{2}$ is roughly optimal for the properties of $Λ$ ; computable distributions can't even be coherent over essentially undecidable theories. (If we could compute a coherent distribution over $T$ , then for each $ϕ$ we could wait for 1 or 0 to be excluded from the intervals we output for $ϕ$ , and thereby computably separate the provable and refutable sentences, which is impossible.) A slightly stronger statement is that for any definable completion $¯ P A$ of $P A$ , there is a distribution $P$ over completions of $T$ that is computable from $¯ P A$ , which is reflective for the definition of $P$ that uses $¯ P A$ in Step 2 to complete $Z$ .

As mentioned earlier, $Λ$ is very obfuscated from $T$ . In fact, if $T$ is strong enough, then $Λ (Con (T)) = 0$ . Otherwise, since $Λ$ believes each sentence of $T$ , by coherence and , we would have that . But this is not possible for a reflective, coherent distribution.

If our metatheory $K$ is strong enough to prove some facts like Kakutani's theorem, then $K + Con (T)$ proves ${Coh}_{T} (Λ)$ and $Refl (Λ)$ . This likely roughly optimal: if the meta-theory even proves ${Coh}_{T} (Λ)$ , then it also proves $Con (T)$ . Probably $P A$ is sufficient to prove the necessary first-order statements, since the second-order theory ${ACA}_{0}$ is probably strong enough to prove a literal statement of Kakutani's theorem, and $P A$ is the first-order part of ${ACA}_{0}$ . See for example Fixed point theory in weak second-order arithmetic, by Naoki Shioji and Kazuyuki Tanaka , though that paper only proves Kakutani in finite dimensions and the Tychonoff-Schauder theorem (in the weaker system ${WKL}_{0}$ ).

Questions

We conjecture that a modification of $Λ$ might give a distribution $Υ$ that satisfies the same properties, and also assigns probability 1 to the statement ${Coh}_{T} (Υ)$ .
We also conjecture that Fallenstein's reflection principle for expectations can be similarly defined.
Can there be a distribution that is reflective and computably approximable (from below)? (This may be easy to refute.)
Can a definable reflective distribution be coherent over a theory that proves some useful facts about the distribution?
What are some weaker notions of reflection that can be computable or computably approximable (from below)?
What reflective properties of beliefs are needed for self-verification in the context of general decision-making?

Open questions (Christiano):

What sorts of bad beliefs can be avoided given coherence and reflection?
What are some weaker notions of reflection such that probability distributions can be coherent, reflective, and also assign high probability to their own reflectivity?

[-]paulfchristiano10y10

Looks good! As you say, it is more technical than illuminating, but it's probably a necessary step if the whole project is going to work out. I had definitely thought that this might be tricky, so it's good to see it done. Also I think that there were some natural follow-up questions that I had set aside because I didn't have this piece.

I don't think we should really care about whether the theory proves useful facts about the distribution; as long as the distribution doesn't dogmatically believe falsehoods about itself, we can just condition on explicit reflective observations and that will probably be fine.

So I'm basically fine with this diagonalization step, though I might come to share your concerns if I thought about it longer.

I think the key next step, that may or may not be possible, is getting a distribution that can assign non-zero probability to its own reflection principle while being reflective. This obviously requires weakening the reflection principle, though I think that there are some weak enough reflection principles floating around.

For example, it might be worth taking another look at the double expectation property (E[f(E)] = E[E[f(E)]] for functions f that are continuous in the product topology), and seeing what that translates into in this framework. Obviously continuity will be replaced by something different, since now E and f are both just Turing machines; hopefully we would get E[f] = E[E[f]] for all f that are approximable relative to some powerful oracle. It looks to me like this might work, and if the oracle was powerful enough to implement E I'd consider that promising as a potential basis for self-trust.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

9

A limit-computable, self-reflective distribution

9

Outline

Section 1. Problem statement

Probabilistic reflection

Definable reflection

Section 2. A definable, self-reflective distribution $Ψ$

Definition of $Ψ$

Theorem 1.

Proof.

Section 3. A $Δ_{2}$ -definable, self-reflective distribution $Λ$

Definition of $Λ$

Observation: $Λ$ is $Δ_{2}$ .

Theorem 2.

Proof.

Section 4. Discussion

Questions

9

A limit-computable, self-reflective distribution

9

Outline

Section 1. Problem statement

Probabilistic reflection

Definable reflection

Section 2. A definable, self-reflective distribution Ψ

Definition of Ψ

Theorem 1.

Proof.

Section 3. A Δ2-definable, self-reflective distribution Λ

Definition of Λ

Observation: Λ is Δ2.

Theorem 2.

Proof.

Section 4. Discussion

Questions

Section 2. A definable, self-reflective distribution $Ψ$

Definition of $Ψ$

Section 3. A $Δ_{2}$ -definable, self-reflective distribution $Λ$

Definition of $Λ$

Observation: $Λ$ is $Δ_{2}$ .