(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)

David Lorell

Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjectured that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). Four months ago, we put up a bounty to prove this conjecture.

We've been bottlenecked pretty hard on this problem, and spent most of the last four months attacking it. At long last, we have a proof. As hoped, the proof comes with some qualitative new insights about natural latents, and we expect it will unbottleneck a bunch of future work. The main purpose of this post is to present the proof.

This post officially closes the corresponding bounty.

Recap: What Was The Problem Again?

(This section is mostly copied from the bounty post.)

Some Intuition From The Exact Case

In the exact case, in order for a natural latent to exist over random variables , the distribution has to look roughly like this:

Each value of $X_{1}$ and each value of $X_{2}$ occurs in only one "block", and within the "blocks", $X_{1}$ and $X_{2}$ are independent. In that case, we can take the (exact) natural latent to be a block label.

Notably, that block label is a deterministic function of X.

However, we can also construct other natural latents for this system: we simply append some independent random noise to the block label. That natural latent is not a deterministic function of X; it's a "stochastic" natural latent.

In the exact case, if a stochastic natural latent exists, then the distribution must have the form pictured above, and therefore the block label is a deterministic natural latent. In other words: in the exact case, if a stochastic natural latent exists, then a deterministic natural latent also exists.

Our goal here is to prove that this still holds in the approximate case, using the same information theoretic approximation methods used in our other posts on natural latents (and explained here).

The Problem

"Stochastic" Natural Latents

Stochastic natural latents were introduced in the original Natural Latents post. Any latent $Λ$ over random variables $X_{1}, X_{2}$ is defined to be a stochastic natural latent when it satisfies these diagrams:

... and $Λ$ is an approximate stochastic natural latent (with error $ϵ$ ) when it satisfies the approximate versions of those diagrams to within $ϵ$ , i.e.

$ϵ \geq D_{K L} (P [X, Λ] | | P [Λ] P [X_{1} | Λ] P [X_{2} | Λ])$

$ϵ \geq D_{K L} (P [X, Λ] | | P [X_{2}] P [X_{1} | X_{2}] P [Λ | X_{1}])$

$ϵ \geq D_{K L} (P [X, Λ] | | P [X_{1}] P [X_{2} | X_{1}] P [Λ | X_{2}])$

Key thing to note: if $Λ$ satisfies these conditions, then we can create another stochastic natural latent $Λ^{'}$ by simply appending some random noise to $Λ$ , independent of $X$ . This shows that $Λ$ can, in general, contain arbitrary amounts of irrelevant noise while still satisfying the stochastic natural latent conditions.

"Deterministic" Natural Latents

Deterministic natural latents were introduced in a post by the same name. Any latent $Λ$ over random variables $X_{1}, X_{2}$ is defined to be a deterministic natural latent when it satisfies these diagrams:

... and $Λ$ is an approximate deterministic natural latent (with error $ϵ$ ) when it satisfies the approximate versions of those diagrams to within $ϵ$ , i.e.

$ϵ \geq D_{K L} (P [X, Λ] | | P [Λ] P [X_{1} | Λ] P [X_{2} | Λ])$

$ϵ \geq H (Λ | X_{1})$

$ϵ \geq H (Λ | X_{2})$

See the linked post for explanation of a variable appearing multiple times in a diagram, and how the approximation conditions for those diagrams simplify to entropy bounds.

Note that the deterministic natural latent conditions, either with or without approximation, imply the stochastic natural latent conditions; a deterministic natural latent is also a stochastic natural latent.

The Goal

We'd like a proof that, if a stochastic natural latent exists over two variables $X_{1}, X_{2}$ to within approximation $ϵ$ , then a deterministic natural latent exists over those two variables to within approximation roughly $ϵ$ - specifically, 9 $ϵ$ .

The Proof

Key Ideas

There are two key ideas to the proof.

The first key idea is to use resampling to obtain a latent which satisfies one of the natural latent conditions exactly, and the others approximately.

The second key idea is to consider pareto optimal stochastic natural latents - i.e. latents with pareto minimal error on the three natural latent conditions.

It turns out that stochastic natural latents which exactly satisfy one of the natural latent conditions and are pareto optimal work like the exact case, even when no exact natural latent exists.

Specifically: pareto optimal stochastic natural latents $Λ^{'}$ over $X_{1}, X_{2}$ which satisfy one redundancy condition exactly can always be coarse grained into a deterministic natural latent - i.e.

$f_{Λ} (Λ^{'}) = f_{X} (X)$ with probability 1 for some $f_{Λ}$ and $f_{X}$ , and
$Λ^{'}, X$ are independent conditional on $f_{Λ} (Λ^{'})$ .

So, $f_{Λ} (Λ^{'})$ is itself a natural latent with the same errors as $Λ^{'}$ , and it's exactly a deterministic function of $X$ .

This was a big and very welcome surprise to us!

Math

Assumptions & Preconditions

We will assume $P [X, Λ] > 0$ for the original stochastic natural latent $Λ$ . This implies that there are no nontrivial exact natural latents.^[1] We make this assumption mainly for mathematical convenience; because we're interested in practical approximation it is a reasonable assumption to use. But we do expect the assumption can be dropped, at the cost of more details to handle in the proof.

The main preconditions for our proof are that three random variables $X_{1}$ , $X_{2}$ , $Λ$ approximately satisfy the three natural latent conditions, i.e.

Left to right: The mediation condition, the first redundancy condition, the second redundancy condition.

or, written out (and simplified a little),

First redundancy condition: $ϵ \geq D_{K L} (P [X, Λ] | | P [X] P [Λ | X_{1}])$

Second redundancy condition: $ϵ \geq D_{K L} (P [X, Λ] | | P [X] P [Λ | X_{2}])$

Mediation condition: $ϵ \geq D_{K L} (P [X, Λ] | | P [Λ] P [X_{1} | Λ] P [X_{2} | Λ])$

Resampling Conserves Naturality

A previous post showed that resampling conserves redundancy. Specifically, we can construct a new latent $Λ^{'}$ by sampling from $P [Λ | X_{2}]$ ; the new joint distribution is then

$P [Λ^{'}, X] = P [X] P [Λ = λ^{'} | X_{2}]$

Given the two redundancy conditions and $P [X, Λ] > 0$ , the new latent $Λ^{'}$ satisfies the second redundancy condition perfectly (by construction), and satisfies the first redundancy condition to within $9 ϵ$ . (Leveraging all three natural latent conditions, empirical tests also strongly suggest that that bound can be improved from $9 ϵ$ to $3 ϵ$ , but that is not yet proven.)

Now, imagine that instead of constructing $Λ^{'}$ by sampling from $P [Λ | X_{2}]$ , we instead construct $X_{1}^{'}$ by sampling from $P [X_{1} | X_{2}]$ . Key insight: this results in exactly the same joint distribution as sampling $Λ^{'}$ from $P [Λ | X_{2}]$ , i.e.

$P [X_{1}^{'} = x_{1}, X_{2}, Λ] = P [X_{1}, X_{2}, Λ^{'} = λ] = P [X_{2}] P [X_{1} | X_{2}] P [Λ | X_{2}]$

By the same "resampling conserves redundancy" theorem, the $X_{1}^{'}$ construction satisfies the mediation condition to within $9 ϵ$ (rather than the first redundancy condition). But since the $X_{1}^{'}$ construction and the $Λ^{'}$ construction yield the same joint distribution, with one of them implying the first redundancy condition is satisfied to within $9 ϵ$ and the other implying the mediation condition is satisfied to within $9 ϵ$ , that same joint distribution must satisfy both conditions to within $9 ϵ$ .

Putting that all together: starting from a latent $Λ$ over $X_{1}, X_{2}$ which satisfies all three natural latent conditions to within $ϵ$ , we can construct a new latent $Λ^{'}$ which satisfies the second redundancy condition perfectly, and satisfies the other two conditions to within $9 ϵ$ .

We'll use that new latent as our starting point for the second half of the proof, in which we look for pareto improvements upon $Λ^{'}$ .

Pareto Minimization -> Single Objective Minimization

In the second half of the proof, we'll consider taking pareto improvements upon $Λ^{'}$ , i.e. looking for a latent $Λ^{*}$ with pareto-optimal error on the three natural latent conditions. Since we're pareto improving on $Λ^{'}, Λ^{*}$ will have zero error on the second redundancy condition (i.e. $P [X, Λ^{*}] = P [X] P [Λ^{*} | X_{2}]$ ), and at most $9 ϵ$ error on the other two conditions.

First, we convert our pareto minimization problem into a single objective minimization problem in the standard way, in order to use the standard optimization toolset.

A latent $Λ^{*}$ is defined by $P [Λ^{*} | X]$ . We want to characterize latents for which the errors on the three natural latent conditions are pareto minimal, holding $P [X_{1}, X_{2}]$ constant. The three errors are:

$D_{K L} (X_{1} \leftarrow Λ^{*} \to X_{2}) := D_{K L} (P [X, Λ^{*}] | | P [X_{1} | Λ^{*}] P [Λ^{*}] P [X_{2} | Λ^{*}])$
$D_{K L} (Λ^{*} \to X_{1} \to X_{2}) := D_{K L} (P [X, Λ^{*}] | | P [Λ^{*}] P [X_{1} | Λ^{*}] P [X_{2} | X_{1}])$
$D_{K L} (Λ^{*} \to X_{2} \to X_{1}) := D_{K L} (P [X, Λ^{*}] | | P [Λ^{*}] P [X_{2} | Λ^{*}] P [X_{1} | X_{2}])$

(Note that we've written these slightly differently from the previous section. They are equivalent, and these expressions will save some minor rearrangement in the proof.)

To use the usual optimization toolset, we convert the pareto minimization problem into a single objective minimization problem by assigning weights $α$ to each error. Our single objective is

$o b j := α_{m e d} D_{K L} (X_{1} \leftarrow Λ^{*} \to X_{2}) + α_{1} D_{K L} (Λ^{*} \to X_{1} \to X_{2}) + α_{2} D_{K L} (Λ^{*} \to X_{2} \to X_{1})$

Any pareto minimum for the original problem must be a minimum of $o b j$ for some $(α_{m e d}, α_{1}, α_{2})$ with $α_{i} \geq 0$ for all indices (including $m e d$ , which is an ordinary index). Without loss of generality, we assume $\sum_{i} α_{i} = 1$ . In general, different $α$ 's in the single objective problem pick out different pareto minima in the original problem.

Lagrangian & First Order Conditions

Now we turn the crank.

We (implicitly so far) have two constraints on our optimization problem:

$\forall X : \sum_{Λ^{*}} P [Λ^{*} | X] = 1$
$\forall Λ^{*}, X : P [Λ^{*} | X] \geq 0$

We introduce Lagrange multipliers $ν [X]$ , $μ [Λ^{*}, X]$ , respectively, for these two constraints. That gives the Lagrangian

$L := o b j - \sum_{X} ν [X] (\sum_{Λ}^{*} P [Λ^{*} | X] - 1) - \sum_{Λ^{*}, X} μ [Λ^{*}, X] P [Λ^{*} | X]$

Differentiating $L$ with respect to $P [Λ^{*} | X]$ (at constant $P [X]$ ), simplifying, and absorbing some terms into the Lagrange multipliers yields our first order condition:

$0 = l n P [X | Λ^{*}] - (α_{1} + α_{m e d}) l n P [X_{1} | Λ^{*}] - (α_{2} + α_{m e d}) l n P [X_{2} | Λ^{*}] - ν^{'} [X] - μ [X, Λ^{*}]$

where the Lagrange multiplier $ν^{'}$ has absorbed some terms which depend only on $X$ .

Note that, while the term $μ [X, Λ^{*}]$ looks completely arbitrary, it is constrained by complementary slackness: $μ [X, Λ^{*}]$ can be nonzero only if $P [Λ^{*} | X]$ is zero, i.e. $μ [X, Λ^{*}] P [Λ^{*} | X] = 0$ for all $X, Λ^{*}$ .

Putting The Pieces Together & Solving The Equations

Earlier, we established that a latent exists which satisfies the second redundancy condition perfectly and has error at most $9 ϵ$ on the other two conditions. So, let's use our first order condition to characterize specifically those pareto optimal latents which perfectly satisfy the second redundancy condition.

Perfect satisfaction of the second redundancy condition means $P [X | Λ^{*}] = P [X_{1} | X_{2}] P [X_{2} | Λ^{*}]$ . Substituting that into the first order condition and simplifying gives

$0 = - (α_{1} + α_{m e d}) l n P [X_{1} | Λ^{*}] + α_{1} l n P [X_{2} | Λ^{*}] - ν^{''} [X] - μ [X, Λ^{*}]$

Now, pick values $x, λ^{0}, λ^{1}$ such that $P [X | Λ^{*} = λ^{0}] > 0$ and $P [X = x | Λ^{*} = λ^{1}] > 0$ . Then $μ [X = x, Λ^{*} = λ^{0}] = μ [X = x, Λ^{*} = λ^{1}] = 0$ by complementary slackness, and we can subtract the first order conditions at $(x, λ^{1})$ and $(x, λ^{0})$ to get

$0 = - (α_{1} + α_{m e d}) (l n P [X_{1} = x_{1} | Λ^{*} = λ^{1}] - l n P [X_{1} = x_{1} | Λ^{*} = λ^{0}]) + α_{1} (l n P [X_{2} = x_{2} | Λ^{*} = λ^{1}] - l n P [X_{2} = x_{2} | Λ^{*} = λ^{0}])$

Note that one of those terms depends on $X_{1}$ (but not $X_{2}$ ), and the other depends on $X_{2}$ (but not $X_{1}$ ), so the only way they can add to 0 for all $X$ values for which $P [X | Λ^{*} = λ^{0}] > 0$ and $P [X | Λ^{*} = λ^{1}] > 0$ is if both are equal to some $C (λ^{0}, λ^{1})$ which does not depend on $X$ :

$C (λ^{0}, λ^{1}) = (α_{1} + α_{m e d}) (l n P [X_{1} | Λ^{*} = λ^{1}] - l n P [X_{1} | Λ^{*} = λ^{0}])$

$C (λ^{0}, λ^{1}) = α_{1} (l n P [X_{2} | Λ^{*} = λ^{1}] - l n P [X_{2} | Λ^{*} = λ^{0}])$

Both of those equations must hold for all X such that $P [X | Λ^{*} = λ^{0}] > 0$ and $P [X | Λ^{*} = λ^{1}] > 0$ .

Notably, our assumption $P [X, Λ] > 0$ ^[2] implies $P [X] > 0$ implies $P [X_{1} | X_{2}] > 0$ . Combined with $P [X | Λ^{*}] = P [X_{1} | X_{2}] P [X_{2} | Λ^{*}]$ (the perfect second redundancy condition for $Λ^{*}$ ), that means $P [X_{1} | Λ^{*}] > 0$ for all $X_{1}, Λ^{*}$ - or, put differently, for all $X_{1}$ , $Λ^{*}$ , there exists some $X_{2}$ such that $P [X_{1}, X_{2}, Λ^{*}] > 0$ . So,

$C (λ^{0}, λ^{1}) = (α_{1} + α_{m e d}) (l n P [X_{1} | Λ^{*} = λ^{1}] - l n P [X_{1} | Λ^{*} = λ^{0}])$

must hold for all $X_{1}$ , whenever the support of $P [X_{2} | Λ^{*} = λ^{0}]$ and $P [X_{2} | Λ^{*} = λ^{1}]$ overlap at all. Shuffling terms around, we get

$P [X_{1} | Λ^{*} = λ^{1}] = P [X_{1} | Λ^{*} = λ^{0}] e^{C (λ^{0}, λ^{1}) / (α_{1} + α_{m e d})}$

Sum on $X_{1}$ on both sides, and we get $1 = e^{C (λ^{0}, λ^{1}) / (α_{1} + α_{m e d})}$ , implying $C (λ^{0}, λ^{1}) = 0$ and therefore $P [X_{1} | Λ^{*} = λ^{1}] = P [X_{1} | Λ^{*} = λ^{0}]$ .

In short: given two $Λ^{*}$ values $λ^{0}, λ^{1}$ , if there exists any $X_{2}$ such that $P [X_{2} | Λ^{*} = λ^{0}] > 0$ and $P [X_{2} | Λ^{*} = λ^{1}] > 0$ (i.e. the support of $P [X_{2} | Λ^{*}]$ overlaps for the two $Λ^{*}$ values), then the two values yield exactly the same distribution $P [X_{1} | Λ^{*}]$ .

Furthermore, since $C (λ^{0}, λ^{1}) = 0$ whenever $P [X_{2} | Λ^{*} = λ^{0}]$ and $P [X_{2} | Λ^{*} = λ^{1}]$ have overlapping support, we also have

$P [X_{2} | Λ^{*} = λ^{0}] = P [X_{2} | Λ^{*} = λ^{1}]$

anywhere that both of those quantities are nonzero.

A (Non-Strict) Pareto Improvement Via Coarse Graining

A quick recap of where that last section leaves us. We've established that:

For any two $Λ^{*}$ values $λ^{0}, λ^{1}$ for which the support of $P [X_{2} | Λ^{*}]$ overlaps, we have $P [X_{1} | Λ^{*} = λ^{0}] = P [X_{1} | Λ^{*} = λ^{1}]$ for all $X_{1}$ .
Furthermore, $P [X_{2} | Λ^{*} = λ^{0}] = P [X_{2} | Λ^{*} = λ^{1}]$ wherever the two distributions overlap (i.e. wherever both quantities are nonzero).

Now, assume the supports of $P [X_{2} | Λ^{*} = λ^{0}]$ and $P [X_{2} | Λ^{*} = λ^{1}]$ overlap somewhere, and consider coarse graining those two values of $Λ^{*}$ . Compared to $Λ^{*}$ itself, how does the coarse grained variable $g (Λ^{*})$ score on each of the natural latent conditions?

Since $P [X_{1} | Λ^{*}]$ is exactly the same for both $Λ^{*}$ values, the error on $X_{2} \leftarrow Λ^{*} \to X_{1}$ will be the same; $g (Λ^{*})$ mediates between $X_{1}$ and $X_{2}$ exactly as well as $Λ^{*}$ does.
Since $X \to Λ^{*} \to g (Λ^{*})$ , the coarse grained variable cannot do any worse on the redundancy conditions $X_{1} \to X_{2} \to Λ^{*}$ and $X_{2} \to X_{1} \to Λ^{*}$ .

So, without making the errors on any of the three natural latent conditions any worse, we can coarse grain all $Λ^{*}$ values $λ^{0}, λ^{1}$ for which the supports of $P [X_{2} | Λ^{*} = λ^{0}]$ and $P [X_{2} | Λ^{*} = λ^{1}]$ overlap somewhere.

Once all such coarse graining is performed, we have a new coarse grained latent g( $Λ^{*}$ ) for which the support of $P [X_{2} | g (Λ^{*})]$ is nonoverlapping for all pairs of $Λ^{*}$ values.

In other words: $g (Λ^{*})$ is exactly a deterministic function of $X_{2}$ (and therefore still perfectly satisfies the second redundancy condition), and satisfies the first redundancy condition and mediation condition each to within $9 ϵ$ .

Finally, A Deterministic Natural Latent

Lastly, note that $H (g (Λ^{*}) | X_{2}) = 0$ and $X_{2} \to X_{1} \to g (Λ^{*})$ to within $9 ϵ$ implies $9 ϵ \geq H (g (Λ^{*}) | X_{1})$ . Combined with the mediation condition, that implies $g (Λ^{*})$ is approximately a deterministic natural latent, with errors at most:

$9 ϵ$ on the first deterministic redundancy condition
0 on the second deterministic redundancy condition
$9 ϵ$ on the mediation condition.

Can we do better?

The main room for improvement of the bounds in this proof is in the resampling step. The resampling conserves redundancy post notes where those bounds could be improved, and presents a little empirical evidence that they can be improved to $3 ϵ$ (using all three natural latent conditions) or $4 ϵ$ (using only redundancy).

What's Next?

We've been bottlenecked pretty hard on this theorem for the past 3-4 months.

Now that we finally have it, we expect to largely abandon stochastic natural latents in favor of deterministic natural latents. For instance, one immediate next step will be to rewrite our Illiad paper from last year to work with deterministic natural latents, which will eliminate the weakest parts of that paper and give a much more compelling case. (No, we're not linking to the old paper, because the new one is going to be a lot better.)

On another front: stochastic natural latents are relatively easy to test for in datasets, by looking for three variables each of which mediates between the other two. Now we have some idea of what to do with those triples when we find them: compute the deterministic constraint between them.

Beyond those two immediate projects, we expect this result to be foundational for basically all of our work on natural latents going forward.

^{^}
This is because P[X, $Λ$ ] > 0 implies P[X] > 0, and nontrivial exact natural latents rely necessarily on some values of X to have probability 0.
^{^}
Note that that's $Λ$ , i.e. the original latent, not $Λ^{*}$ .

AI ALIGNMENT FORUM
AF