Suppose random variables and contain approximately the same information about a third random variable , i.e. both of the following diagrams are satisfied to within approximation :
We call a "redund" over , since conceptually, any information contains about must be redundantly represented in both and (to within approximation).
Here's an intuitive claim which is surprisingly tricky to prove: suppose we construct a new variable by sampling from , so the new joint distribution is
By construction, this "resampled" variable satisfies one of the two redundancy diagrams perfectly: . Intuitively, we might expect that approximately satisfies the other redundancy diagram as well; conceptually, (approximately) only contains redundant information about , so contains (approximately) the same information about as does, so the resampling operation should result in (approximately, in some sense) the same distribution we started with and therefore (approximately) the same properties.
In this post, we'll prove that claim and give a bound for the approximation error.
Specifically:
Theorem: Resampling (Approximately) Conserves (Approximate) Redundancy
Let random variables , satisfy the diagrams and to within , i.e.
Also, assume .
Construct by sampling from , so . Then is perfectly satisfied by construction, and is satisfied to within , i.e.
In diagrammatic form:
We will use the shorthand to mean . For instance, is shorthand for , which is equivalent to .
We will work with nats for mathematical convenience (i.e. all logarithms are natural logs).
The proof proceeds in three steps:
First we construct the new variable as a stochastic function of . Specifically, with probability , else is a constant , where is outside the support of (so when we see , we gain no information about ).
A little algebra confirms that 's errors are simply 's errors scaled down by :
Similarly, constructing just like (i.e. ) is equivalent to constructing as a stochastic function of where with probability , else is . So, by the same algebra as above,
The upshot: if there exists a distribution over variables for which
then there also exists a distribution satisfying the same inequality with all 's arbitrarily small[1]. Flipping that statement around: if there does not exist any distribution for which the 's are all arbitrarily small and the inequality is satisfied, then there does not exist any distribution for which the inequality is satisfied.
In other words: if we can show
in the regime where all the 's are arbitrarily small, then the same inequality is also established globally, proving our theorem. The rest of the proof will therefore show
in the regime where all the 's are arbitrarily small. In particular, we'll use a second order approximation for the 's.
Before we can use a second order approximation of the 's, we need to show that small implies that second order approximationis valid.
For that purpose, we use the Hellinger-KL inequality:
where is the squared Hellinger distance.[2]
Using standard logarithm inequalities, we can weaken the Hellinger-KL inequality to
So, as goes to 0, the Hellinger distance goes to 0, and therefore and are arbitrarily close together in standard Euclidean distance. Since is smooth (for strictly positive distributions, which we have assumed), we can therefore use a second order approximation (with respect to ) for our arbitrarily small 's.
Now for the second order expansion itself.
Our small quantity is . Then
To simplify that further, we can use the sum-to-1 constraints on the distributions: implies
so . That simplifies our second order approximation to
i.e. in the second order regime is twice the Hellinger distance.
Combining this with Step 1, we've now established that if we can prove our desired bound for Hellinger distances rather than , then the bound also applies globally for errors. So now, we can set aside the notoriously finicky KL divergences, and work with good ol' Euclidean geometry.
Writing everything out in the second order regime, our preconditions say
and we want to bound
That last expression has a Jensen vibe to it, so let's use Jensen's inequality.
We're going to use Jensen's inequality on the squared Hellinger distance, so we need to establish that squared Hellinger distance is convex as a function of the distributions .
Differentiating twice with respect to yields the Hessian
Note that one column is the other column multiplied by , so one of the eigenvalues is 0. The trace is positive, so the other eigenvalue is positive. Thus, the function is (non-strictly) convex.
Now, we'll use Jensen's inequality on
Specifically:
So, applying Jensen's, we get
With that, we have bounds on three (squared) Hellinger distances:
So, on average over :
So, on average, the Euclidean distance from end-to-end, between and , is at most .
That gives us the desired bound:
implying
Combined with our previous two sections, that establishes the desired upper bound of on .
Below is a plot of the maximal error achieved via numerical minimization of subject to a constraint on, searching over distributions of X and . Above, we proved that the ratio of those two quantities can be no higher than . As expected from the proof, it is visually clear that each point on the curve lies on a line between itself and the origin which itself always lies below the curve. Some, presumably, noise is present due to occasional failures of the optimizer to find the maximum error.
Zooming in on the steepest part of the curve and eyeballing the plot, it looks like the maximum ratio achieved is around 4 (.02/.005), implying an empirical upper bound of the resampled diagram of ~8:
Looking into the actual solutions found, the solutions with ratio of ~4 involve one of the two terms in the x-axis sum being much larger than the other (5-10x). Therefore we expect to be able, in principle, to get a tighter bound (~4, empirically, rather than the proven 9 or empirical 8.) The most likely place for improvement in the proof is to bound the Hellinger distance between and directly by , cutting one step out of the "path", and that would indeed reduce the bound from 9 to 4. We'll leave that for future work.
Interesting additional piece for future reference: If we include the mediation condition into the denominator, and so look for a bound in terms of a factor of the sum of all natural latent condition epsilons, we find that the empirical factor in question is 1 (roughly. Not sure what happened at ~0.4):
Note that, since by assumption, none of the 's are infinite. This is the only place where we need ; that assumption can probably be eliminated by considering the infinite case directly, but we're not going to do that here.
A quick aside: while it might look messy at first, the Hellinger distance is a particularly natural way to talk about Euclidean distances between probability distributions. In general, if one wants to view a distribution as a vector, is the most natural vector to consider, since the sum-to-1 constraint says is a unit vector under the standard Euclidean distance.