AI ALIGNMENT FORUM
AF

World Modeling
Frontpage

29

Resampling Conserves Redundancy (Approximately)

by johnswentworth, David Lorell
21st Aug 2025
8 min read
2

29

World Modeling
Frontpage
New Comment
Moderation Log
More from johnswentworth
View more
Curated and popular this week
0Comments

Suppose random variables X1 and X2 contain approximately the same information about a third random variable Λ, i.e. both of the following diagrams are satisfied to within approximation ϵ:

"Red" for redundancy

We call Λ a "redund" over X1,X2, since conceptually, any information Λ contains about X must be redundantly represented in both X1 and X2 (to within approximation).

Here's an intuitive claim which is surprisingly tricky to prove: suppose we construct a new variable Λ′ by sampling from P[Λ|X2], so the new joint distribution is 

P[X1=x1,X2=x2,Λ′=λ′]=P[X1=x1,X2=x2]P[Λ=λ′|X2=x2]

By construction, this "resampled" variable satisfies one of the two redundancy diagrams perfectly: X1→X2→Λ′. Intuitively, we might expect that Λ′ approximately satisfies the other redundancy diagram as well; conceptually, Λ′ (approximately) only contains redundant information about X, so X2 contains (approximately) the same information about Λ′ as X does, so the resampling operation should result in (approximately, in some sense) the same distribution we started with and therefore (approximately) the same properties.

In this post, we'll prove that claim and give a bound for the approximation error.

Specifically:

Theorem: Resampling (Approximately) Conserves (Approximate) Redundancy

Let random variables X1,X2, Λ satisfy the diagrams X1→X2→Λ and X2→X1→Λ to within ϵ, i.e.

ϵ≥DKL(P[X1,X2,Λ]||P[X1,X2]P[Λ|X2])

ϵ≥DKL(P[X1,X2,Λ]||P[X1,X2]P[Λ|X1])

Also, assume P[X,Λ]>0.

Construct Λ′ by sampling from P[Λ|X2], so P[X1,X2,Λ′]=P[X1,X2]P[Λ=λ′|X2]. Then X1→X2→Λ′ is perfectly satisfied by construction, and X2→X1→Λ′ is satisfied to within 9ϵ, i.e.

9ϵ≥DKL(P[X1,X2,Λ′]||P[X1,X2]P[Λ′|X1])


In diagrammatic form:

(Where O(Ɛ) in this version of the proof is 9Ɛ, specifically.)

Notation

We will use the shorthand DKL(<diagram over X>) to mean DKL(P[X]||∏iP[Xi|Xpa(i)]). For instance, DKL(X1→X2→Λ)  is shorthand for DKL(P[X1,X2,Λ]||P[X1]P[X2|X1]P[Λ|X2]), which is equivalent to DKL(P[X,Λ]||P[X]P[Λ|X2]).

We will work with nats for mathematical convenience (i.e. all logarithms are natural logs).

Proof

The proof proceeds in three steps:

  • From Λ, we can construct a new variable Γ which is equal to Λ with probability p and is otherwise a constant. Then the errors on all diagrams in the theorem for Γ are the same as the corresponding errors for Λ, but scaled down by factor p. This allows us to focus only on the regime of arbitrarily small errors, and obtain a global bound from that regime.
  • Within the regime of arbitrarily small errors, DKL(P||Q) is approximately (to second order) equal to twice the Hellinger distance, 2∑X(√P[X]−√Q[X])2. Hellinger distance is a straightforward Euclidean distance metric, so we can use standard geometric reasoning on it.
  • The Hellinger distance between P[Λ|X2] and P[Λ|X], between P[Λ|X] and P[Λ|X1], and between P[Λ|X1] and P[Λ′|X1] are all small, so end-to-end the Hellinger distance between P[Λ|X2] and P[Λ′|X1] is small. That yields the bound we're after.

Step 1: Scaling Down The Errors

First we construct the new variable Γ as a stochastic function of Λ. Specifically, Γ=Λ with probability p, else Γ is a constant C, where C is outside the support of Λ (so when we see Γ=C, we gain no information about Λ).

A little algebra confirms that Γ's errors are simply Λ's errors scaled down by p:

DKL(P[X,Γ]||P[X]P[Γ|Xi])

=EX[P[Γ=C](lnP[Γ=C|X]−lnP[Γ=C|Xi])+∑λP[Γ=λ|X](lnP[Γ=λ|X]−lnP[Γ=λ|Xi])]

=EX[(1−p)(ln(1−p)−ln(1−p))+∑λpP[Λ=λ|X](ln(pP[Λ=λ|X])−ln(pP[Λ=λ|Xi]))]

=pEX[∑λP[Λ=λ|X](ln(P[Λ=λ|X])−ln(P[Λ=λ|Xi]))]

=pDKL(P[X,Λ]||P[X]P[Λ|Xi])

Similarly, constructing Γ′ just like Λ′ (i.e. P[X,Γ′]=P[X]P[Γ=γ′|X2]) is equivalent to constructing Γ′ as a stochastic function of Λ′ where Γ′=Λ′ with probability p, else Γ′ is C. So, by the same algebra as above,

DKL(P[X,Γ′]||P[X]P[Γ′|Xi])=pDKL(P[X,Λ′]||P[X]P[Λ′|Xi])

The upshot: if there exists a distribution over variables X1,X2,Λ for which

DKL(X2→X1→Λ′)>n⋅max(DKL(X1→X2→Λ),DKL(X2→X1→Λ))

then there also exists a distribution satisfying the same inequality with all DKL's arbitrarily small[1]. Flipping that statement around: if there does not exist any distribution for which the DKL's are all arbitrarily small and the inequality is satisfied, then there does not exist any distribution for which the inequality is satisfied.

In other words: if we can show

DKL(X2→X1→Λ′)≤n⋅max(DKL(X1→X2→Λ),DKL(X2→X1→Λ))≤nϵ

in the regime where all the DKL's are arbitrarily small, then the same inequality is also established globally, proving our theorem. The rest of the proof will therefore show

DKL(X2→X1→Λ′)≤n⋅max(DKL(X1→X2→Λ),DKL(X2→X1→Λ))

in the regime where all the DKL's are arbitrarily small. In particular, we'll use a second order approximation for the DKL's.

Step 2: Second Order Approximation

Validity

Before we can use a second order approximation of the DKL's, we need to show that small DKL implies that second order approximationis valid.

For that purpose, we use the Hellinger-KL inequality:

D(P||Q)≥2ln22−H2(P,Q)

where H2(P,Q):=∑X(√P[X]−√Q[X])2 is the squared Hellinger distance.[2]

Using standard logarithm inequalities, we can weaken the Hellinger-KL inequality to

D(P||Q)≥H2(P,Q)

So, as DKL(P||Q) goes to 0, the Hellinger distance goes to 0, and therefore √P and √Q are arbitrarily close together in standard Euclidean distance. Since DKL is smooth (for strictly positive distributions, which we have assumed), we can therefore use a second order approximation (with respect to √P−√Q) for our arbitrarily small DKL's.

Expansion

Now for the second order expansion itself.

Our small quantity is δ[X]:=√P[X]−√Q[X]. Then

DKL(P||Q)=∑XP[X](lnP[X]−2ln√Q[X])

=∑XP[X](lnP[X]−2ln(√P[X]−δ[X]))

=∑XP[X](lnP[X]−2ln(√P[X])+2δ[X]√P[X]+212(δ[X]√P[X])2)+o(δ3)

=∑X2√P[X]δ[X]+δ[X]2+o(δ3)

To simplify that further, we can use the sum-to-1 constraints on the distributions: √Q[X]=√P[X]−δ[X] implies

∑XQ[X]=∑X(√P[X]−δ[X])2=∑XP[X]−2√P[X]δ[X]+δ[X]2

so ∑X2√P[X]δ[X]=∑Xδ[X]2. That simplifies our second order approximation to

DKL(P||Q)=2∑Xδ[X]2+o(δ3)

=2∑X(√P[X]−√Q[X])2+o(δ3)

i.e. in the second order regime DKL is twice the Hellinger distance.

Combining this with Step 1, we've now established that if we can prove our desired bound for Hellinger distances rather than DKL, then the bound also applies globally for DKL errors. So now, we can set aside the notoriously finicky KL divergences, and work with good ol' Euclidean geometry.

Step 3: Good Ol' Euclidean Geometry

Writing everything out in the second order regime, our preconditions say

ϵ≥2EX[∑Λ(√P[Λ|X]−√P[Λ|X2])2]

ϵ≥2EX[∑Λ(√P[Λ|X]−√P[Λ|X1])2]

and we want to bound

2EX[∑Λ′(√P[Λ′|X]−√P[Λ′|X1])2]

=2EX[∑Λ(√P[Λ|X2]−√∑X′2P[Λ|X′2]P[X′2|X1])2]

That last expression has a Jensen vibe to it, so let's use Jensen's inequality.

Jensen

We're going to use Jensen's inequality on the squared Hellinger distance, so we need to establish that squared Hellinger distance is convex as a function of the distributions P,Q.

Differentiating (√p−√q)2 twice with respect to (p,q) yields the Hessian

121√pq(qp−1−1pq)

Note that one column is the other column multiplied by −pq, so one of the eigenvalues is 0. The trace is positive, so the other eigenvalue is positive. Thus, the function is (non-strictly) convex.

Now, we'll use Jensen's inequality on

ϵ≥EX[∑Λ(√P[Λ|X]−√P[Λ|X2])2]

Specifically:

  • We'll hold X1 constant, so it's not involved in our application of Jensen.
  • For each X1 value, the expression takes a convex combination with weights P[X2|X1] of the function ∑Λ(√P[Λ|X]−√P[Λ|X2])2.
  • Viewing (P[Λ|X],P[Λ|X2]) at fixed X1 as the function's inputs, the function is convex.

So, applying Jensen's, we get

ϵ≥2EX[∑Λ(√P[Λ|X]−√P[Λ|X2])2]

≥2EX1[∑Λ(√∑X2P[Λ|X]P[X2|X1]−√∑X2P[Λ|X2]P[X2|X1])2]

=2EX1[∑Λ(√P[Λ|X1]−√∑X2P[Λ|X2]P[X2|X1])2]

=2EX[∑Λ(√P[Λ|X1]−√∑X2P[Λ|X2]P[X2|X1])2]

Euclidean Distances

With that, we have bounds on three (squared) Hellinger distances:

ϵ≥2EX[∑Λ(√P[Λ|X]−√P[Λ|X2])2]

ϵ≥2EX[∑Λ(√P[Λ|X]−√P[Λ|X1])2]

ϵ≥2EX[∑Λ(√P[Λ|X1]−√∑X2P[Λ|X2]P[X2|X1])2]

So, on average over X:

  • The Euclidean distance between √P[Λ|X2] and √P[Λ|X] is at most √ϵ2.
  • The Euclidean distance between √P[Λ|X] and √P[Λ|X1] is at most √ϵ2.
  • The Euclidean distance between √P[Λ|X1] and √∑X2P[Λ|X2]P[X2|X1] is at most √ϵ2.

So, on average, the Euclidean distance from end-to-end, between √P[Λ|X2] and √∑X2P[Λ|X2]P[X2|X1], is at most 3√ϵ2.

That gives us the desired bound:

(3√ϵ2)2≥EX[∑Λ(√P[Λ|X2]−√∑X′2P[Λ|X′2]P[X′2|X1])2]

implying

9ϵ≥2EX[∑Λ(√P[Λ|X2]−√∑X′2P[Λ|X′2]P[X′2|X1])2]

Combined with our previous two sections, that establishes the desired upper bound of 9ϵ on DKL(X2→X1→Λ′).

Empirical Results and Room for Improvement

Below is a plot of the maximal error achieved via numerical minimization of −DKL(X2→X1→Λ′) subject to a constraint onDKL(X1→X2→Λ)+DKL(X2→X1→Λ), searching over distributions of X and Λ. Above, we proved that the ratio of those two quantities can be no higher than 92. As expected from the proof, it is visually clear that each point on the curve lies on a line between itself and the origin which itself always lies below the curve. Some, presumably, noise is present due to occasional failures of the optimizer to find the maximum error.

Zooming in on the steepest part of the curve and eyeballing the plot, it looks like the maximum ratio achieved is around 4 (.02/.005), implying an empirical upper bound of the resampled diagram of ~8ϵ:

Looking into the actual solutions found, the solutions with ratio of ~4 involve one of the two terms in the x-axis sum being much larger than the other (5-10x). Therefore we expect to be able, in principle, to get a tighter bound (~4ϵ, empirically, rather than the proven 9ϵ or empirical 8ϵ.) The most likely place for improvement in the proof is to bound the Hellinger distance between P[Λ|X1] and P[Λ|X2] directly by ϵ, cutting one step out of the "path", and that would indeed reduce the bound from 9ϵ to 4ϵ. We'll leave that for future work.

Interesting additional piece for future reference: If we include the mediation condition into the denominator, and so look for a bound in terms of a factor of the sum of all natural latent condition epsilons, we find that the empirical factor in question is 1 (roughly. Not sure what happened at ~0.4):

  1. ^

    Note that, since P[X,Λ]>0 by assumption, none of the DKL's are infinite. This is the only place where we need P[X,Λ]>0; that assumption can probably be eliminated by considering the infinite case directly, but we're not going to do that here.

  2. ^

    A quick aside: while it might look messy at first, the Hellinger distance is a particularly natural way to talk about Euclidean distances between probability distributions. In general, if one wants to view a distribution P[X] as a vector, √P[X] is the most natural vector to consider, since the sum-to-1 constraint says √P[X] is a unit vector under the standard Euclidean distance.

Mentioned in
48(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)