A Simple Toy Coherence Theorem

johnswentworth; David Lorell

This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments.

Setting

Deterministic MDP. That means at each time there's a state $S [t]$ ^[1], the agent/policy takes an action $A [t]$ (which can depend on both time $t$ and current state $S [t]$ ), and then the next state $S [t + 1]$ is fully determined by $S [t]$ and $A [t]$ . The current state and current action are sufficient to tell us the next state.

We will think about values over the state at some final time $T$ . Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep.

One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable.

Theorem

There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states.

Proof

As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state $S$ every other state is reachable, yet the policy chooses $S$ , so in order for the policy to be optimal $S$ must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general.

Bulk of Proof (click to expand)

Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming:
$V [S, t] = {max}_{S^{'} reachable in next timestep from S} V [S^{'}, t + 1]$
where $V [S, T]$ are the values over final states.

A policy is optimal for final values $V [S, T]$ if-and-only-if at each timestep $t - 1$ it chooses a next state with highest reachable $V [S, t]$ .

Now, suppose that at timestep $t$ there are two different states either of which can reach either state $A$ or state $B$ in the next timestep. From one of those states the policy chooses $A$ ; from the other the policy chooses $B$ . This is an inconsistent revealed preference between $A$ and $B$ at time $t$ : sometimes the policy has a revealed preference for $A$ over $B$ , sometimes for $B$ over $A$ .

In order for a policy with an inconsistent revealed preference between $A$ and $B$ at time $t$ to be optimal, the values must satisfy
$V [A, t] = V [B, t]$
Why? Well, a policy is optimal for final values $V [S, T]$ if-and-only if at each timestep $t - 1$ it chooses a next state with highest reachable $V [S, t]$ . So, if an optimal policy sometimes chooses $A$ over $B$ at timestep $t$ when both are reachable, then we must have $V [A, t] \geq V [B, t]$ . And if an optimal policy sometimes chooses $B$ over $A$ at timestep $t$ when both are reachable, then we must have $V [A, t] \leq V [B, t]$ . If both of those occur, i.e. the policy has an inconsistent revealed preference between $A$ and $B$ at time $t$ , then $V [A, t] = V [B, t]$ .

Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from $A$ at time $t$ must have the highest reachable value, and that value is equal (by definition) to $V [A, t]$ . Similarly for $B$ . So, if we call the final state which the policy in fact reaches starting from state $S$ at time $t$ $F I N A L (S, t)$ , our condition $V [A, t] = V [B, t]$ becomes
$V [F I N A L (A, t), T] = V [F I N A L (B, t), T]$
When the policy ends in different final states starting from $A$ versus $B$ , this is an inconsistent revealed preference between final states $F I N A L (A, t)$ and $F I N A L (B, t)$ : there are states at $t - 1$ from which both states $F I N A L (A, t)$ and $F I N A L (B, t)$ are achievable (over multiple timesteps), and the policy sometimes chooses one and sometimes the other when both are achievable.

Let's pause a moment. We've now shown that there is a property of the policy - ie. inconsistent revealed preference between two final states $F I N A L (A, t)$ and $F I N A L (B, t)$ - such that a certain constraint $V [F I N A L (A, t), T] = V [F I N A L (B, t), T]$ must be satisfied by any final values for which the policy is optimal.

Note that we can also chain together such constraints - e.g. if the policy's inconsistent revealed preferences between final states $X$ and $Y$ , and between final states $Y$ and $Z$ , imply both $V [X, T] = V [Y, T]$ and $V [Y, T] = V [Z, T]$ , then we get the full chain $V [X, T] = V [Y, T] = V [Z, T]$ . Thus we have a "transitively" inconsistent revealed preference between $X$ and $Z$ .

If the policy displays inconsistent revealed preferences transitively between all final states, that means the chain of equalities covers all final states, and therefore the values over final state must be uniform. That's the main claim of the theorem.

Lastly, to show that policies which are optimal only for uniform values are exactly those with inconsistent revealed preferences transitively between all final states, we need to show that there are some non-uniform values for which the policy is optimal if there aren't inconsistent revealed preferences transitively between all final states. This part is less interesting and kinda mathematically tedious IMO, so I'll be more terse and technical: the equality constraints yield equivalence classes between the final states. Between each equivalence class pair, there's either a revealed preference (if the policy ever chooses a state in one class over a state in the other), or no constraint (if there's never a starting point from which states in both classes are available and the policy chooses one of them). The revealed preferences between equivalence classes are acyclic, since any cycle would be another inconsistent preference. So, toposort the equivalence classes by revealed preference, take the value to be the toposort index, and we have a value function for which the policy is optimal.

Anti-Takeaways: Things Which Don't Generalize

Determinism

This theorem does not involve any uncertainty. That's the most important sense in which it is "toy". We can easily add a little uncertainty, in the form of nondeterministic state transitions, but that's a pretty narrow form of uncertainty. The more interesting and realistic possibility is uncertainty over current state, i.e. turning the MDP into a POMDP, and that completely destroys the proof; it no longer makes sense to use a value function over earlier states at all. Interesting new possibilities come up, like e.g. using the state to store information for the future^[2]. Also ideally we'd like to derive the implied probability distribution along with the value function; that's a whole additional dimension to the problem under uncertainty. It's a pretty fun problem.

Takeaways: Things Which (I Expect) Do Generalize

Coherence Is Nontrivial For Optimization "At A Distance"

Notice that we used values over final state, and explicitly set incremental reward at earlier timesteps to zero. That was load-bearing: with arbitrary freedom to choose rewards at earlier timesteps, any policy is optimal for some nontrivial values/rewards. (Proof: just pick the rewards at timestep $t$ to reward whatever the policy does enough to overwhelm future value/rewards.)

This ties to a common criticism: that any system can be well-modeled as a utility maximizer, by simply choosing the utility function which rewards whatever the system in fact does. As far as I can tell, that criticism usually reflects ignorance of what coherence says^[3]. Coherence is not about whether a system "can be well-modeled as a utility maximizer" for some utility function over anything at all, it's about whether a system can be well-modeled as a utility maximizer for utility over some specific stuff.

The utility in the toy coherence theorem in this post is very explicitly over final states, and the theorem says nontrivial things mainly when the agent is making decisions at earlier times in order to influence that final state - i.e. the agent is optimizing the state "far away" (in time) from its current decision. That's the prototypical picture in my head when I think of coherence. Insofar as an incoherent system can be well-modeled as a utility maximizer, its optimization efforts must be dominated by relatively short-range, myopic objectives. Coherence arguments kick in when optimization for long-range objectives dominates.

(A visual anology: one can in-principle arrange positive and negative charges to form any electric field. So I can't look at an arbitrary field and say "ah, that does/doesn't look like an electric field". But in a vacuum, an electric field is much more restricted - I can look at an arbitrary field and say "ah, that does/doesn't look like an electric field in a vacuum". It's away from the charges that we can say nontrivial things about what the field looks like, without needing to know about the charges. Likewise for coherence: coherence is like vacuum equations for goals. It tells us what optimal policies look like away from the things which the goal cares about directly.)

We Didn't Need Trades, Adversaries, Money-Pumping, Etc

Another common criticism of coherence arguments which mostly reflects ignorance: in real life, nobody will actually try to money-pump me, and even if they did I'd notice and then change my revealed preferences.

The usual response to that critique is that coherence is not really about trades and adversaries and money-pumping; the world presents us with choices constantly, and coherence is a requirement for any "non-dominated" strategy. But that part usually isn't explained as well.

The toy theorem in this post sidesteps the entire issue, making it clear that coherence is indeed not really about trades and adversaries and money-pumping. We didn't even mention any of those things.

Approximation

Though the theorem in this post is worded in terms of exact optimality, it extends pretty easily to approximate optimality. Basically, rather than "inconsistent preference between $A$ and $B$ implies $V [A, t] = V [B, t]$ ", we say "inconsistent preference between $A$ and $B$ implies the difference between $V [A, t]$ and $V [B, t]$ is at most $ϵ$ ", and then the theorem talks about policies which achieve value within $ϵ$ of optimal (or sum of $ϵ$ at each timestep, or some such approximation error). So "coherence theorems only talk about optimal agents, and real-world agents aren't fully optimal" is yet another common criticism which mostly reflects ignorance.

Coherence Is About Revealed Preferences

Among actual academic decision theorists some popularity has accrued in recent years to frameworks for preferences which are not revealed preference. The theorem in this post illustrates that coherence is about revealed preferences.

Importantly, even when one is using some other model or notion of preferences, the system in question usually still has revealed preferences and coherence arguments will still apply to it. So if you're using some other notion of preferences, and want to see what coherence has to say about your agent, then you do need to look at its revealed preferences, and those may be different from whatever other kinds of "preferences" it has.

^{^}
I'm using square brackets here to evoke the mental picture of an array, since when solving this problem via dynamic programming we'd typically keep all this data in arrays.
^{^}
Note that the agent has no memory of its own other than the state. In order to derive a Bayesian agent we'd probably want to give it memory separate from the current state, i.e. allow the policy's choice at time $t$ to depend on all previous states.
^{^}
...though notably in Rohin's case, the "all behavior can be rationalized as EU maximization" critique did not reflect ignorance, but rather (according to him) he knew it was misleading-in-isolation but used it to make a different point which he didn't know a better way to make.

AI ALIGNMENT FORUM
AF

30