Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjectured that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). Four months ago, we put up a bounty to prove this conjecture.
We've been bottlenecked pretty hard on this problem, and spent most of the last four months attacking it. At long last, we have a proof. As hoped, the proof comes with some qualitative new insights about natural latents, and we expect it will unbottleneck a bunch of future work. The main purpose of this post is to present the proof.
This post officially closes the corresponding bounty.
(This section is mostly copied from the bounty post.)
In the exact case, in order for a natural latent to exist over random variables , the distribution has to look roughly like this:
Each value of and each value of occurs in only one "block", and within the "blocks", and are independent. In that case, we can take the (exact) natural latent to be a block label.
Notably, that block label is a deterministic function of X.
However, we can also construct other natural latents for this system: we simply append some independent random noise to the block label. That natural latent is not a deterministic function of X; it's a "stochastic" natural latent.
In the exact case, if a stochastic natural latent exists, then the distribution must have the form pictured above, and therefore the block label is a deterministic natural latent. In other words: in the exact case, if a stochastic natural latent exists, then a deterministic natural latent also exists.
Our goal here is to prove that this still holds in the approximate case, using the same information theoretic approximation methods used in our other posts on natural latents (and explained here).
Stochastic natural latents were introduced in the original Natural Latents post. Any latent over random variables is defined to be a stochastic natural latent when it satisfies these diagrams:
... and is an approximate stochastic natural latent (with error ) when it satisfies the approximate versions of those diagrams to within , i.e.
Key thing to note: if satisfies these conditions, then we can create another stochastic natural latent by simply appending some random noise to , independent of . This shows that can, in general, contain arbitrary amounts of irrelevant noise while still satisfying the stochastic natural latent conditions.
Deterministic natural latents were introduced in a post by the same name. Any latent over random variables is defined to be a deterministic natural latent when it satisfies these diagrams:
... and is an approximate deterministic natural latent (with error ) when it satisfies the approximate versions of those diagrams to within , i.e.
See the linked post for explanation of a variable appearing multiple times in a diagram, and how the approximation conditions for those diagrams simplify to entropy bounds.
Note that the deterministic natural latent conditions, either with or without approximation, imply the stochastic natural latent conditions; a deterministic natural latent is also a stochastic natural latent.
We'd like a proof that, if a stochastic natural latent exists over two variables to within approximation , then a deterministic natural latent exists over those two variables to within approximation roughly - specifically, 9.
There are two key ideas to the proof.
The first key idea is to use resampling to obtain a latent which satisfies one of the natural latent conditions exactly, and the others approximately.
The second key idea is to consider pareto optimal stochastic natural latents - i.e. latents with pareto minimal error on the three natural latent conditions.
It turns out that stochastic natural latents which exactly satisfy one of the natural latent conditions and are pareto optimal work like the exact case, even when no exact natural latent exists.
Specifically: pareto optimal stochastic natural latents over which satisfy one redundancy condition exactly can always be coarse grained into a deterministic natural latent - i.e.
So, is itself a natural latent with the same errors as , and it's exactly a deterministic function of .
This was a big and very welcome surprise to us!
We will assume for the original stochastic natural latent . This implies that there are no nontrivial exact natural latents.[1] We make this assumption mainly for mathematical convenience; because we're interested in practical approximation it is a reasonable assumption to use. But we do expect the assumption can be dropped, at the cost of more details to handle in the proof.
The main preconditions for our proof are that three random variables , , approximately satisfy the three natural latent conditions, i.e.
or, written out (and simplified a little),
First redundancy condition:
Second redundancy condition:
Mediation condition:
A previous post showed that resampling conserves redundancy. Specifically, we can construct a new latent by sampling from ; the new joint distribution is then
Given the two redundancy conditions and , the new latent satisfies the second redundancy condition perfectly (by construction), and satisfies the first redundancy condition to within . (Leveraging all three natural latent conditions, empirical tests also strongly suggest that that bound can be improved from to , but that is not yet proven.)
Now, imagine that instead of constructing by sampling from , we instead construct by sampling from . Key insight: this results in exactly the same joint distribution as sampling from , i.e.
By the same "resampling conserves redundancy" theorem, the construction satisfies the mediation condition to within (rather than the first redundancy condition). But since the construction and the construction yield the same joint distribution, with one of them implying the first redundancy condition is satisfied to within and the other implying the mediation condition is satisfied to within , that same joint distribution must satisfy both conditions to within .
Putting that all together: starting from a latent over which satisfies all three natural latent conditions to within , we can construct a new latent which satisfies the second redundancy condition perfectly, and satisfies the other two conditions to within .
We'll use that new latent as our starting point for the second half of the proof, in which we look for pareto improvements upon .
In the second half of the proof, we'll consider taking pareto improvements upon , i.e. looking for a latent with pareto-optimal error on the three natural latent conditions. Since we're pareto improving on will have zero error on the second redundancy condition (i.e. ), and at most error on the other two conditions.
First, we convert our pareto minimization problem into a single objective minimization problem in the standard way, in order to use the standard optimization toolset.
A latent is defined by . We want to characterize latents for which the errors on the three natural latent conditions are pareto minimal, holding constant. The three errors are:
(Note that we've written these slightly differently from the previous section. They are equivalent, and these expressions will save some minor rearrangement in the proof.)
To use the usual optimization toolset, we convert the pareto minimization problem into a single objective minimization problem by assigning weights to each error. Our single objective is
Any pareto minimum for the original problem must be a minimum of for some with for all indices (including , which is an ordinary index). Without loss of generality, we assume . In general, different 's in the single objective problem pick out different pareto minima in the original problem.
Now we turn the crank.
We (implicitly so far) have two constraints on our optimization problem:
We introduce Lagrange multipliers , , respectively, for these two constraints. That gives the Lagrangian
Differentiating with respect to (at constant ), simplifying, and absorbing some terms into the Lagrange multipliers yields our first order condition:
where the Lagrange multiplier has absorbed some terms which depend only on .
Note that, while the term looks completely arbitrary, it is constrained by complementary slackness: can be nonzero only if is zero, i.e. for all .
Earlier, we established that a latent exists which satisfies the second redundancy condition perfectly and has error at most on the other two conditions. So, let's use our first order condition to characterize specifically those pareto optimal latents which perfectly satisfy the second redundancy condition.
Perfect satisfaction of the second redundancy condition means . Substituting that into the first order condition and simplifying gives
Now, pick values such that and . Then by complementary slackness, and we can subtract the first order conditions at and to get
Note that one of those terms depends on (but not ), and the other depends on (but not ), so the only way they can add to 0 for all values for which and is if both are equal to some which does not depend on :
Both of those equations must hold for all X such that and .
Notably, our assumption [2] implies implies . Combined with (the perfect second redundancy condition for ), that means for all - or, put differently, for all , , there exists some such that . So,
must hold for all , whenever the support of and overlap at all. Shuffling terms around, we get
Sum on on both sides, and we get , implying and therefore .
In short: given two values , if there exists any such that and (i.e. the support of overlaps for the two values), then the two values yield exactly the same distribution .
Furthermore, since whenever and have overlapping support, we also have
anywhere that both of those quantities are nonzero.
A quick recap of where that last section leaves us. We've established that:
Now, assume the supports of and overlap somewhere, and consider coarse graining those two values of . Compared to itself, how does the coarse grained variable score on each of the natural latent conditions?
So, without making the errors on any of the three natural latent conditions any worse, we can coarse grain all values for which the supports of and overlap somewhere.
Once all such coarse graining is performed, we have a new coarse grained latent g() for which the support of is nonoverlapping for all pairs of values.
In other words: is exactly a deterministic function of (and therefore still perfectly satisfies the second redundancy condition), and satisfies the first redundancy condition and mediation condition each to within .
Lastly, note that and to within implies . Combined with the mediation condition, that implies is approximately a deterministic natural latent, with errors at most:
The main room for improvement of the bounds in this proof is in the resampling step. The resampling conserves redundancy post notes where those bounds could be improved, and presents a little empirical evidence that they can be improved to (using all three natural latent conditions) or (using only redundancy).
We've been bottlenecked pretty hard on this theorem for the past 3-4 months.
Now that we finally have it, we expect to largely abandon stochastic natural latents in favor of deterministic natural latents. For instance, one immediate next step will be to rewrite our Illiad paper from last year to work with deterministic natural latents, which will eliminate the weakest parts of that paper and give a much more compelling case. (No, we're not linking to the old paper, because the new one is going to be a lot better.)
On another front: stochastic natural latents are relatively easy to test for in datasets, by looking for three variables each of which mediates between the other two. Now we have some idea of what to do with those triples when we find them: compute the deterministic constraint between them.
Beyond those two immediate projects, we expect this result to be foundational for basically all of our work on natural latents going forward.