This is particularly interesting if we take and to be two different models, and take the indices 1, 2 to be different values of another random variable with distribution given by . In that case, the above inequality becomes:
Note to self: this assumes P[Y] = Q[Y].
I am usually reasonably good at translating from math to non-abstract intuitive examples...but I didn't have much success here. Do you have "in English, for simpletons" example to go with this? :-) (You know, something that uses apples and biscuits rather than English-but-abstract words like "there are many hidden variables mediating the interactions between observables" :D.)
Otherwise, my current abstract interpretation of this is something like: "There are detailed models, and those might vary a lot. And then there are very abstract models, which will be more similar to each other...well, except that they might also be totally useless." So I was hoping that a more specific example would clarify things for a bit and tell me whether there is more to this (and also whether I got it all wrong or not :-).)
I recommend skipping to the next post. This post was kind of a stub, the next one explains the same idea better.
Fun fact: the KL divergence of distribution P[X] from distribution Q[X] is convex in the pair P,Q. Writing it out: DKL(λP1[X]+(1−λ)P2[X]||λQ1[X]+(1−λ)Q2[X])≤λDKL(P1[X]||Q1[X])+(1−λ)DKL(P2[X]||Q2[X]) with 0≤λ≤1.
This is particularly interesting if we take P and Q to be two different models, and take the indices 1, 2 to be different values of another random variable Y with distribution P[Y] given by (λ,1−λ). In that case, the above inequality becomes:
DKL(P[X]||Q[X])≤EY[DKL(P[X|Y]||Q[X|Y])]
In English: the divergence between our models of the X-distribution ignoring Y is at least as small as the average divergence between our models of the X-distribution given Y. This is true regardless of what the two models are - any approximation of the observable distribution improves (or gets no worse) when we integrate out a hidden variable, compared to fixing the value of the hidden variable.
Of course, this doesn't say anything about how much the approximation improves. Presumably for bad approximations, the divergence will not converge to anywhere near zero as we integrate more and more hidden variables. And if the hidden variable doesn't actually interact with the observables significantly, then presumably the divergence decrease will be near-zero.
So when would we expect this to matter?
I'd expect it to matter mainly when the observable X consists of multiple variables which are "far apart" in a large model - i.e. there are many hidden variables mediating the interactions between observables. In other words, I'd expect this phenomenon to mainly be relevant to information at a distance. It's a hint that information at a distance, in complex systems, converges to some sort of universal behavior/properties, which is simpler in some sense than the full fine-grained model.