Suppose I roll a biased die 1000 times, and then roll the same biased die another 1000 times. Then...
Mediation: The first 1000 rolls are approximately independent of the second 1000 given the bias (to reasonable precision).
Redundancy: I can estimate the die’s bias (to reasonable precision) with high confidence from either the first or second 1000 rolls.
The die’s bias is therefore a natural latent, which means it has various nice properties.
Minimality: The bias is the smallest summary of all the information about the first 1000 rolls relevant to the second 1000 (and vice-versa).
Maximality: The bias is the largest piece of information which can be calculated from the first 1000 rolls and also can separately be calculated from the second 1000 rolls.
Any other variable which satisfies the above properties must tell us (approximately) the same information about the die rolls as the bias.
Furthermore, the bias is a(n approximate) deterministic natural latent: the die’s bias (to reasonable precision) is approximately determined by[1] the first 1000 die rolls, and also approximately determined by the second 1000 die rolls. That implies one more nice property:
Uniqueness: The bias is the unique-up-to(-approximate)-isomorphism latent which has the above properties, making it a natural Schelling point for communication between agents.
We’ve proven all that before, mostly in Natural Latents: The Math (including the addendum added six months after the rest of the post). But it turns out that the math is a lot shorter and simpler, and easily yields better bounds, if we’re willing to assume (approximate) determinism up-front. That does lose us some theoretical tools (notably the resampling construction), but it gives a cleaner foundation for our expected typical use cases (like e.g. semantics). The goal of this post is to walk through that math.
Background Tool: Determinism in Diagrams
We’re going to use diagrammatic proofs, specifically using Bayes nets. But it’s non-obvious how to express (approximate) determinism using Bayes nets, or what rules diagrams follow when determinism is involved, so we’ll walk through that first.
This diagram says that Y is (approximately) determined by X:
Intuitively, the literal interpretation of the diagram is: X mediates between Y and Y, i.e. Y itself tells me nothing more about Y once I already know X. That only makes sense if X tells me everything there is to know about Y, i.e. Y is determined by X.
In the approximate case, we express the approximation error of the diagram as a KL-divergence, same as usual:
If you get confused later about what it means to have two copies of the same variable in a diagram, go back to that line; that’s the definition of the approximation error of the diagram. (One way to view that definition: there’s actually two variables Y and Y′, but P says that Y and Y′ always have the same value.)
So the diagram says Y is determined by X, and the approximation error of the diagram is the entropy H of Y given X - i.e. the number of bits required on average to specify Y once one already knows X. Very intuitive!
The Dangly Bit Lemma
Intuitively, if Y is determined by X, then X always mediates between Y and anything else. So, when working with diagrams, we can always add a copy of Y with only X as a parent.
Here’s the lemma:
If Y←X→Y holds to within ϵ bits, and any other diagram D involving X holds to within ϵ′ bits, then we can create a new diagram D′ which is identical to D but has another copy of Y (the “dangly bit”) as a child of X. The new diagram D′ will hold to within ϵ+ϵ′ bits.
Proof (click to expand)
Let Q[X,Z] be the distribution over X and any other variables Z specified by the diagram D (note that Z may include some copies of Y). Then D′ specifies the distribution Q[X,Z]P[Y|X], so the approximation error for D′ is
DKL(P[X,Y,Z]||Q[X,Z]P[Y|X])
=DKL(P[X,Z]||Q[X,Z])+EX,Z[DKL(P[Y|X,Z]||P[Y|X])]
=DKL(P[X,Z]||Q[X,Z])+I(Y;Z|X)
≤DKL(P[X,Z]||Q[X,Z])+H(Y|X)
≤ϵ′+ϵ
A simple example: for any X, Z the diagram X→Z is always satisfied exactly. So, if
is satisfied, then
is satisfied for anyZ. If we write out the approximation errors for these diagrams, this example is equivalent to H(Y|X)≥I(Y;Z|X).
Deterministic Natural Latents
Fundamental Theorem
We’ll start with the core idea of natural latents: if one variable Λ mediates between two other (sets of) variables XS,X¯S, and another variable Λ′ is redundantly represented in both XS and X¯S, then Λ′ is also represented in Λ. Intuitively: imagine that Λ is a pipe between XS and X¯S, and the only way for information to move between XS and X¯S is through that pipe. Then if some information Λ′ is present in both XS and X¯S, it must have gone through the pipe.
Diagrammatically, the theorem says:
The proof is just two applications of The Dangly Bit Lemma, followed by marginalizing out XS and X¯S:
Minimality, Maximality, and Isomorphism of Deterministic Natural Latents
Now suppose that Λ (approximately) satisfies both the mediation and redundancy conditions. Then:
Λ (approximately) satisfies the mediation condition, and
For any other Λ′ which (approximately) satisfies the mediation condition, Λ←Λ′→Λ (approximately) by the Fundamental Theorem.
So, Λ is (approximately) the smallest variable which (approximately) satisfies the mediation condition, in the sense that Λ is (approximately) determined by any other variable which (approximately) satisfies the mediation condition. This is the “minimality” property of deterministic natural latents.
Similarly:
Λ (approximately) satisfies the redundancy condition, and
For any other Λ′ which (approximately) satisfies the redundancy condition, Λ′←Λ→Λ′ (approximately) by the Fundamental Theorem.
So, Λ′ is (approximately) the largest variable which (approximately) satisfies the redundancy condition, in the sense that any other variable which (approximately) satisfies the redundancy condition is (approximately) determined by Λ. This is the “maximality” property of deterministic natural latents.
Finally, suppose that Λ and Λ′both satisfy both the mediation and redundancy conditions. Then by the Fundamental Theorem, each is (approximately) determined by other; they’re approximately isomorphic. So, the (approximate) deterministic natural latent is (approximately) unique up to isomorphism.
And just like that, we’ve proven the main properties of deterministic natural latents which we typically want to use: (approximate) minimality, (approximate) maximality, and (approximate) uniqueness.
When we say "Y is (approximately) determined by X" in this post, we mean "conditional on X, the entropy of Y is (approximately) zero". This may or may not imply any other notion of "Y is (approximately) determined by X", like e.g. "there exists a deterministic function of X which is equal to Y with high probability".
Background: Natural Latents: The Math, Natural Latents: The Concepts, Why Care About Natural Latents?, the prototypical semantics use-case. This post does not assume that you’ve read all of those, or even any of them.
Suppose I roll a biased die 1000 times, and then roll the same biased die another 1000 times. Then...
The die’s bias is therefore a natural latent, which means it has various nice properties.
Furthermore, the bias is a(n approximate) deterministic natural latent: the die’s bias (to reasonable precision) is approximately determined by[1] the first 1000 die rolls, and also approximately determined by the second 1000 die rolls. That implies one more nice property:
We’ve proven all that before, mostly in Natural Latents: The Math (including the addendum added six months after the rest of the post). But it turns out that the math is a lot shorter and simpler, and easily yields better bounds, if we’re willing to assume (approximate) determinism up-front. That does lose us some theoretical tools (notably the resampling construction), but it gives a cleaner foundation for our expected typical use cases (like e.g. semantics). The goal of this post is to walk through that math.
Background Tool: Determinism in Diagrams
We’re going to use diagrammatic proofs, specifically using Bayes nets. But it’s non-obvious how to express (approximate) determinism using Bayes nets, or what rules diagrams follow when determinism is involved, so we’ll walk through that first.
This diagram says that Y is (approximately) determined by X:
Intuitively, the literal interpretation of the diagram is: X mediates between Y and Y, i.e. Y itself tells me nothing more about Y once I already know X. That only makes sense if X tells me everything there is to know about Y, i.e. Y is determined by X.
In the approximate case, we express the approximation error of the diagram as a KL-divergence, same as usual:
ϵ≥DKL(P[X=x,Y=y,Y=y′]||P[X=x]P[Y=y|X=x]P[Y=y′|X=x])
If you get confused later about what it means to have two copies of the same variable in a diagram, go back to that line; that’s the definition of the approximation error of the diagram. (One way to view that definition: there’s actually two variables Y and Y′, but P says that Y and Y′ always have the same value.)
That approximation error simplifies:
DKL(P[X=x,Y=y,Y=y′]||P[X=x]P[Y=y|X=x]P[Y=y′|X=x])
=DKL(P[X=x,Y=y]I[y=y′]||P[X=x]P[Y=y|X=x]P[Y=y′|X=x])
=∑x,y,y′P[X=x,Y=y]I[y=y′](log(P[X=x,Y=y]I[y=y′])−log(P[X=x]P[Y=y|X=x]P[Y=y′|X=x]))
=∑x,yP[X=x,Y=y](log(P[X=x,Y=y])−log(P[X=x]P[Y=y|X=x]P[Y=y|X=x]))
=−∑x,yP[X=x,Y=y]log(P[Y=y|X=x])
=H(Y|X)
So the diagram says Y is determined by X, and the approximation error of the diagram is the entropy H of Y given X - i.e. the number of bits required on average to specify Y once one already knows X. Very intuitive!
The Dangly Bit Lemma
Intuitively, if Y is determined by X, then X always mediates between Y and anything else. So, when working with diagrams, we can always add a copy of Y with only X as a parent.
Here’s the lemma:
If Y←X→Y holds to within ϵ bits, and any other diagram D involving X holds to within ϵ′ bits, then we can create a new diagram D′ which is identical to D but has another copy of Y (the “dangly bit”) as a child of X. The new diagram D′ will hold to within ϵ+ϵ′ bits.
Proof (click to expand)
Let Q[X,Z] be the distribution over X and any other variables Z specified by the diagram D (note that Z may include some copies of Y). Then D′ specifies the distribution Q[X,Z]P[Y|X], so the approximation error for D′ is
DKL(P[X,Y,Z]||Q[X,Z]P[Y|X])
=DKL(P[X,Z]||Q[X,Z])+EX,Z[DKL(P[Y|X,Z]||P[Y|X])]
=DKL(P[X,Z]||Q[X,Z])+I(Y;Z|X)
≤DKL(P[X,Z]||Q[X,Z])+H(Y|X)
≤ϵ′+ϵ
A simple example: for any X, Z the diagram X→Z is always satisfied exactly. So, if
is satisfied, then
is satisfied for any Z. If we write out the approximation errors for these diagrams, this example is equivalent to H(Y|X)≥I(Y;Z|X).
Deterministic Natural Latents
Fundamental Theorem
We’ll start with the core idea of natural latents: if one variable Λ mediates between two other (sets of) variables XS,X¯S, and another variable Λ′ is redundantly represented in both XS and X¯S, then Λ′ is also represented in Λ. Intuitively: imagine that Λ is a pipe between XS and X¯S, and the only way for information to move between XS and X¯S is through that pipe. Then if some information Λ′ is present in both XS and X¯S, it must have gone through the pipe.
Diagrammatically, the theorem says:
The proof is just two applications of The Dangly Bit Lemma, followed by marginalizing out XS and X¯S:
Minimality, Maximality, and Isomorphism of Deterministic Natural Latents
Now suppose that Λ (approximately) satisfies both the mediation and redundancy conditions. Then:
So, Λ is (approximately) the smallest variable which (approximately) satisfies the mediation condition, in the sense that Λ is (approximately) determined by any other variable which (approximately) satisfies the mediation condition. This is the “minimality” property of deterministic natural latents.
Similarly:
So, Λ′ is (approximately) the largest variable which (approximately) satisfies the redundancy condition, in the sense that any other variable which (approximately) satisfies the redundancy condition is (approximately) determined by Λ. This is the “maximality” property of deterministic natural latents.
Finally, suppose that Λ and Λ′ both satisfy both the mediation and redundancy conditions. Then by the Fundamental Theorem, each is (approximately) determined by other; they’re approximately isomorphic. So, the (approximate) deterministic natural latent is (approximately) unique up to isomorphism.
And just like that, we’ve proven the main properties of deterministic natural latents which we typically want to use: (approximate) minimality, (approximate) maximality, and (approximate) uniqueness.
When we say "Y is (approximately) determined by X" in this post, we mean "conditional on X, the entropy of Y is (approximately) zero". This may or may not imply any other notion of "Y is (approximately) determined by X", like e.g. "there exists a deterministic function of X which is equal to Y with high probability".