TLDR; This is the first post of Distilling Singular Learning Theory (DSLT), an introduction to which can be read at DSLT0. In this post I explain how singular models (like neural networks) differ from regular ones (like linear regression), give examples of singular loss landscapes, and then explain why the Real Log Canonical Threshold (aka the learning coefficient) is the correct measure of effective dimension in singular models.
When a model class is singular (like neural networks), the complexity of a parameter w in parameter space W⊂Rd needs a new interpretation. Instead of being defined by the total parameters available to the model d, the complexity (or effective dimensionality) of w is defined by a positive rational λ∈Q>0 called the Real Log Canonical Threshold (RLCT), also known as the learning coefficient. The geometry of the loss K(w) is fundamentally defined by the singularity structure of its minima, which λ measures. Moreover, in regular models like linear regression the RLCT is λ=d2, but in singular models it satisfies λ≤d2 in general. At its core, then, Sumio Watanabe's Singular Learning Theory (SLT) shows the following key insight:
The RLCT λ∈Q>0 is the correct measure of effective dimensionality of a model w∈W.
Watanabe shows that the RLCT λ has strong effects on the learning process: it is the correct generalisation of model complexity in the Bayesian Information Criterion for singular models, and therefore plays a central role in the asymptotic generalisation error, thereby inheriting the name "learning coefficient".
In this first post, after outlining the Bayesian setup of SLT, we will start by defining what a singular model is and explain what makes them fundamentally different to regular models. After examining different examples of singular K(w) loss landscapes, we will define the RLCT to be the scaling exponent of the volume integral of nearly true parameters, and conclude by summarising how this quantity correctly generalises dimensionality.
Preliminaries of SLT
The following section introduces some necessary technical terminology, so use it as a reference point, not necessarily something to cram into your head on a first read through. A more thorough setup can be found in [Car21, Chapter 2], which follows [Wat09] and [Wat18].
SLT is established in the Bayesian paradigm, where the Bayesian posterior on the parameter space W is the primary object of focus, containing information on which parameters w∈W correspond to "good" models.
Our statistical learning setup consists of the following data:
A dataset Dn={(X1,Y1),…,(Xn,Yn)}, where for i=1,…,n each Xi∈RN is an input and Yi∈RM is an output (so we are in the supervised learning setting).
We suppose the sequence in Dn is independent and identically distributed according to a true distribution q(y,x)=q(y|x)q(x). For our purposes, we assume the true distribution of inputs q(x) to be known, but the true distribution of outputs q(y|x) to be unknown.
We then choose a model class p(y|x,w) defined by parameters w in a compact parameter space W⊆Rd that contains the origin. We hope to find model parameters w that will adequately approximate the truth, or in other words, learn how to accurately predict an output given an input. For example, a model class could be a fixed neural network architecture with Gaussian noise, as below.
We can select a prior distribution φ(w) of our choosing[1] that is non-zero on W, so φ(w)>0.
Using this data, the error of the model w on the dataset Dn is defined by the empirical negative log likelihood (NLL), Ln(w),
Ln(w)=−1nn∑i=1logp(yi|xi,w),
where e−nLn(w)=∏ni=1p(yi|xi,w)=p(Dn|w) is the model likelihood. [2][3]
This gives rise to the Bayesian posterior of w defined by [4]
p(w|Dn):=1Znφ(w)e−nLn(w)
where the partition function (or in Bayesian terms the evidence) is given by
Zn=∫Wφ(w)e−nLn(w)dw.
The partition function Zn measures posterior density, and thus contains a lot of macroscopic data about a system. Inspired by its role in physics, and for theoretical ease, we consider the free energy
Fn=−logZn.
Performing asymptotic analysis on Zn (and therefore Fn) is the main task of SLT. The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.
Though we never have access to the truth in the learning procedure, for theoretical purposes we nonetheless define the empirical entropy of the true distribution
Sn:=−1nn∑i=1logq(yi|xi).
Even though this quantity is always inaccessible in real settings, there is almost sure convergence Sn→S as n→∞ to a constant S that doesn't depend on n, [5]
S=EX[−logq(y|x)]=−∬RN+Mq(y,x)logq(y|x)dxdy,
To study the posterior, we define the Kullback-Leibler divergenceK(w) between the truth and the model,
K(w):=∬RN+Mq(y|x)q(x)logq(y|x)p(y|x,w)dxdy,
which is the infinite-data limit of its empirical counterpart,
Kn(w):=1nn∑i=1logq(yi|xi)p(yi|xi,w)=Ln(w)−Sn.
The KL divergence is usually thought of as a "loss metric"[6] between the truth and and the model since
K(w)≥0 for all w∈W, and;
K(w)=0 if and only if p(y|x,w)=q(y|x) for all x∈RN and all y∈RM.
As such, I will colloquially refer to K(w) as the loss landscape. [7] The current state of results in SLT require K(w) to be an analytic function, but it seems likely that the results can be generalised to non-analytic settings with suitable hypotheses and constraints.
To analyse where the loss K(w) is minimised, we are then lead to defining the set of true parameters,
W0={w∈W|K(w)=0}={w∈W|p(y|x,w)=q(y|x)}.
We say that the true distribution q(y|x) is realisable by the model class p(y|x,w) if W0 is non empty, that is, there exists somew(0)∈W such that q(y|x):=p(y|x,w(0)) for all x,y. Despite being unrealistic in real world applications, this is nonetheless an important starting point to the theory, which will then generalise to the set of optimal parameters in DSLT2.
We are going to restrict our attention to a particular kind of model: neural networks with Gaussian noise. We will formally define a neural network f(x,w) in a following chapter of this sequence, but for now it suffices to say that it is a function f:RN×W→RM with N inputs and M outputs defined by some parameter w∈W. Then our model density is going to be given by
p(y|x,w)=1(2π)M2exp(−12∥y−f(x,w)∥2).
From here on in, we will assume we are working with a (model, truth, prior) triple (p(y|x,w),q(y|x),φ(w)) as specified in this section.
Loss in our setting
To put these technical quantities into perspective, let me make clear two key points:
Under the regression model, the NLL is equivalent to the mean-squared error of the neural network f(x,w) on the dataset Dn (up to a constant),
Ln(w)=M2log2π+1nn∑i=112∥yi−f(xi,w)∥2.
In the realisable case where q(y|x)=p(y|x,w(0)), the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs,
K(w)=12∫RN∥f(x,w)−f(x,w(0))∥2q(x)dx.
Singular vs Regular Models
What is a singular model?
The key quantity that distinguishes regular and singular models is the Fisher information matrix I(w), whose entries are defined by
It can be shown that when evaluated at a point on the set of true parameters w(0)∈W0, the Fisher information matrix I(w) is simply the Hessian of K(w), so
Ij,k(w(0))=∂2∂wj∂wkK(w)∣∣∣w=w(0).
A regular statistical model class is one which is identifiable (so p(y|x,w1)=p(y|x,w2) implies that w1=w2), and has positive definite Fisher information matrix I(w) for all w∈W. Regular model classes, such as standard linear regression, are the backbone of classical statistics for which all pre-exisiting literature on Bayesian statistics applies. But, from the point of view of SLT, regular model classes are... boring.
If a model class is not regular, then it is strictlysingular. The non-identifiability condition can be easily dealt with, but it is the degeneracy of the Fisher information matrix that fundamentally changes the nature of the posterior and its asymptotics. We will say a model defined by a fixed w(0)∈W (not necessarily a true parameter) is strictly singular if the Fisher information at the point, I(w(0)), is degenerate, meaning
rank(I(w(0)))<d, where d is the number of dimensions in parameter space W⊂Rd, or equivalently;
detI(w(0))=0.
Then the model class is strictly singular if there exists a w(0)∈W such that I(w(0)) is degenerate. A singular model class can be either regular or strictly singular - Watanabe's theory thus generalises regular models, regardless of the model non-identifiability or degenerate Fisher information.
It can be easily shown that, under the regression model, I(w(0)) is degenerate if and only the set of derivatives
{∂∂wjf(x,w)}dj=1
is linearly dependent.
In regular models, the set of true parameters W0 consists of one point. But in singular models, the degeneracy of the Fisher information matrix means W0 is not restricted to being one point, or even a set of isolated points - in general, these local minima of K(w) are connected together in high-dimensional structures [8]. In strictly singular models, the true parameters are degenerate singularities [9] of K(w), and thus K(w) cannot be approximated by a quadratic form near these points. This is the fundamental reason the classical theory of Bayesian statistics breaks down.
Watanabe states that "in singular statistical models, the knowledge or grammar to be discovered corresponds to singularities in general" [Wat09]. With this in mind, it is unsurprising that the following widely used models are all examples of singular models:
Layered neural networks
Gaussian, binomial, multinomial and other mixture models
Reduced rank regression
Boltzmann machines
Bayes networks
Hidden Markov models
Singular models are characterised by features like: having hierarchical structure, being made of superposition of parametric functions, containing hidden variables, etc., all in the service of obtaining hidden knowledge from random samples.
Classical Bayesian inference breaks down for singular models
There are two key properties of regular models that are critical to Bayesian inference as n→∞:
Asymptotic normality: The posterior of regular models converges in distribution to a d-dimensional normal distribution centred at the maximum likelihood estimator w(0) [Vaa07]:
p(w|Dn)→Nd(w(0),1nI(w(0))−1).
Bayesian Information Criterion (BIC): The free energy of regular models asymptotically looks like the BIC as n→∞, where Ln(w(0))=minw∈WLn(w) and d is the dimension of parameter space W⊆Rd:
Fn≈nLn(w(0))+d2logn=BIC.
At the core of both of these results is an asymptotic expansion that strongly depends on the Fisher information matrix I(w) being non-degenerate at true parameters w(0)∈W0. It's instructive to see why this is, so let's derive the BIC to see where I(w) shows up.
Deriving the Bayesian Information Criterion only works for regular models
For the sake of this calculation, let us assume W=Rd. Taking our cues from [Kon08], suppose w(0)∈W0 (thus is a maximum likelihood estimator and satisfies Ln(w(0))=minw∈WLn(w) ). We can Taylor expand the NLL as
where J(w(0))=∂2Ln(w)∂w∂wT∣∣w=w(0) is the Hessian. Since we are analysing the asymptotic limit n→∞, we can relate this Hessian to the Fisher information matrix,
Here's the crux: ifI(w(0)) is non-degenerate (so the model is regular), then we can perform this integral in good-faith knowing that it will always exist. In that case, the second term involving ∂φ(w)∂w vanishes since it is the first central moment of a normal distribution, so we have
since the integrand is the integral of a d-dimensional multivariate Gaussian Nd(w(0),1nI(w(0))−1). Notice here that this is the same distribution that arises in the asymptotic normality result, a theorem that has the same core, but requires more rigorous probability theory to prove. If I(w(0)) is degenerate, then it is non-invertible, meaning the above formulas cannot hold.
and so ignoring terms less than O(1) in n, we arrive at the Bayesian Information Criterion
BIC=nLn(w(0))+d2logn.
This quantity can be understood as an accuracy-complexity tradeoff, where the complexity of the model class is defined by d. We will elaborate on this more in DSLT2 but for now, you should just believe that the Fisher information I(w) is a big deal. Generalising this procedure (and therefore the BIC) for singular models, is the heart of SLT.
Examples of Singular Loss Landscapes
In essence, the Fisher information matrix I(w) describes something about the effective dimensionality or complexity of a model w. When a model class is regular, the effective dimensionality of every point is simply d, the number of parameters available to the model. But in the singular case, a new notion of effective dimensionality is required to adequately describe the complexity of a model. We're now going to look at two cases of singular models [10] - or more precisely, loss landscapes that correspond to singular models - to motivate this generalisation. We'll start with the easier case where one or more parameters are genuinely "free".
Sometimes singularities are just free parameters
Example 1.1: Suppose we have d=2 parameters afforded to a model such that K(w1,w2)=w21, which has a Hessian given by
J(w)=∂2∂wT∂wK(w)=(2000).
Taking the critical point w(0)=(0,0), we have I(w(0))=J(w(0))=J(w) and so detI(w(0))=0, thus the model is singular. In this case, since K(0,w2)=0 for all w2, we could simply throw out the free parameter w2 and define a regular model with d1=1 parameters that has identical geometry K(w1)=w21, and therefore defines the same input-output function, f(x,(w1,w2))=f(x,w1).
This example is called a minimally singular case. Suppose W⊆Rd with integers d1,d2>0 such that d1+d2=d, and after some change of basis[11] we may write a local expansion of K(w) as the sum of d1 squares,
K(w)=d1∑i=1ciw2i,
where c1,c2,…,cd1>0 are positive coefficients. Then the Fisher information matrix has the form
where 0d2 is the square d2×d2 zero matrix. Perhaps then we could define the "effective dimensionality" of w(0) as being rank(I(w(0)))=d1, which is the number of tangent directions in parameter space in which the model changes - the number of "non-free" parameters - and just discard the d2 "free" parameters that are normal to W0.
Sure! We can do that, and if we did, our BIC derivation would carry out fine and we would just replace d by d1 in the final formula. So the minimally singular case is easy to handle.
But this doesn't always work.
But not all singularities are free parameters
Defining the effective dimensionality at w(0) as rank(I(w(0))) seems nice in theory, but turns out to give nonsensical answers pretty quickly - it is not a full enough description of the actual geometry at play.
Example 1.2: Suppose instead that K(w1,w2)=12w21w22. Then the Hessian is
J(w)=(w222w1w22w1w2w21).
At the critical point w(0)=(0,0) the Fisher information is
I(w(0))=H(w(0))=(0000),
which is obviously degenerate.
If we used our notion of effective dimensionality from before, we would say the model defined by w(0) had effective dimension of rank(I(w(0)))=0. But this would be ridiculous - clearly there are more than zero "effective dimensions" in this model, a term that would intuitively imply K(w) was identically zero, which it clearly is not. Thus, we need a different way of thinking about effective dimensionality.
The Real Log Canonical Threshold (aka the Learning Coefficient)
In this section we are going to explain the key claim of this post: that effective dimensionality in singular models is measured by a positive rational number called the Real Log Canonical Threshold, also known as the learning coefficient.
Dimensionality as a volume co-dimension
Taking inspiration from Weyl's famous Volume of Tubes paper, we can reframe dimensionality in terms of a scaling exponent of the volume of "nearly" true parameters. To explain this, we will generalise the minimally singular case above. The following discussion follows [Wei, 22].
Assume we have a partition as before with d1,d2∈N≥0 such that d1+d2=d, where d1 is the number of non-free parameters and d2 is the number of free parameters. For any ε>0 we can consider the set of almost true parameters centred at w(0) (which, without loss of generality, we will take to be w(0)=0),
Wε={w∈W|K(w)<ε}
and an associated volume function
V(ε)=∫Wεφ(w)dw.
As long as the prior φ(w) is non-zero on W0 it does not affect the relevant features of the volume, so we may assume that it is a constant C in the first d1 directions and is a normal distribution in the remaining d2. Then since K(w)≈∑d1i=1ciw2i, we can write
The right integrand is some constant A that doesn't depend on ε, and for the left we can make the substitution ui=√ciεwi, hence
V(ε)=AC∫{u∈U|∑d1i=1u2i<1}√εc1…√εcd1du1…dud1.
Recognising the integrand as the volume of the d1-ball, a constant B that does not depend on ε, we see that
V(ε)∝εd12√c1…cd1.
Then the dimension d1 arises as the scaling exponent of ε12, which can be extracted via the following ratio of volumes formula for some a∈(0,1):
d1=2limε→0log(V(aε)/V(ε))loga.
This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities.
Watanabe shows in Theorem 7.1 of [Wat09] that in general, for any singular model defined by w(0), the volume integral centred at w(0) has the form
V(ε)=cελ+o(ελ)
where λ∈Q>0 is a positive rational number called the Real Log Canonical Threshold (RLCT) associated to the "most singular point" in W0. This is the quantity that generalises the dimensionality of a singularity. What's more, different singularities in W0 can have different RLCT values, and thereby different "effective dimensionalities". As suggested above, the RLCT can then be defined by this volume formula:
λ=2limε→0log(V(aε)/V(ε))loga.
An example of fractional dimension
Example 1.3: To build intuition for what a "fractional dimension" is, consider a model with d=1 parameters with KL divergence given by K(w)=w4, which is singular since ∂2K∂w2∣∣w=0=0. A simple calculation shows that for this KL divergence,
V(ε)∝ε14
meaning λ=14 and so the "effective dimensionality" is 2λ=12.
Meanwhile, in the K(w)=w2 case, V(ε)∝ε12, so the effective dimensionality is 1.
The RLCT can be read off when K(w) is in normal crossing form
I may have presented the previous section suggesting that the RLCT is trivial to calculate. In general, this couldn't be further from the truth. But in one special case, it is. For this discussion we will ignore the prior, i.e. we will set it to be uniform on W.
One dimensional case
As we just saw in Example 1.3, in the one dimensional case where K(w)=w2k for some k≥1, the RLCT is simply λ=12k. In fact, if we can express K(w) in the form
K(w)=(w−c1)2k1…(w−cJ)2kJ
for non-negative integers k1,…,kj and unique c1,…,cj∈R, then the RLCT associated to each singularity cj is simply λj=12kj. But, Watanabe shows that it is the smallest local RLCT (and thus the highest exponent in K(w)) that dominates the free energy, thus defining the global RLCTλ where
λ=minj=1,…,J(12kj).
Example 1.4 This example is going to be very relevant in DSLT2. If we have
K(w)=(w+1)2w4,
with true parameters w(0)−1=−1 and w(0)1=1, then the local RLCT associated to each singularity is
λ−1=12andλ1=14.
The global RLCT is thus λ=λ1.
Multidimensional case
Suppose now that d>1 so K(w)=K(w1,…,wd). Suppose without loss of generality that w(0)=0 is a true parameter for K(w). If we can write the KL divergence in normal crossing form near w(0),
K(w)=w2k11…w2kdd
then the RLCT is given by
λ=minj=1,…,d(12kj).
The multiplicity mj of each coordinate is the number of elements in {k1,…,kd} that equal kj.
This generalises this above case in the following sense:
Example 1.5 Suppose now that we have a two dimensional KL divergence of the form
K(w1,w2)=(w1+1)2w41w22.
Then, in a neighbourhood of the singularity w(0)0=(0,0), the KL divergence is approximately
K(w)∝w41w22.
Thus, the RLCT associated to w(0)0 is
λ0=14,
with multiplicity m0=1.
On the other hand, near the singularity w(0)−1=(−1,0) the KL divergence is, up to a prefactor, approximately
K(w)≈(w1+1)2w22
so the RLCT associated to w(0)−1 is
λ−1=12
with multiplicity m−1=2. So, in this case the global RLCT is λ=λ0, which we will see in DSLT2 means that the posterior is most concentrated around the singularity w(0)0.
Resolution of Singularities
In Algebraic Geometry and Statistical Learning Theory, Watanabe shows that algebraic geometry plays a central role in governing the behaviour of statistical models, and a highly non-trivial one in singular models especially. This rich connection between these two deep mathematical fields is, in my eyes, both profound and extremely beautiful.
The remarkable insight of Watanabe is that in fact any KL divergence, under appropriate hypotheses (such as analyticity), can be written in normal crossing form near a singularity of K(w). To do so, he invokes one of the fundamental theorems of algebraic geometry: Hironaka's Resolution of Singularities. The content of this theorem and its implications go well beyond the scope of this sequence. But, I will briefly mention its role in the theory as it relates to the RLCT. For a more detailed introduction to this part of the story, see [Wat09, Section 1.4].
The theorem guarantees the existence of a d-dimensional analytic manifold M and a real analytic map
g:M∋u↦w∈W
such that for each coordinate Mα of M one can write
where each k1,…,kd and h1,…,hd are non-negative integers, |g′(u)| is the Jacobian determinant of w=g(u) and ϕ(u)>0 is a real analytic function. The global RLCT is then defined by
λ=minαminj=1,…,d(hj+12kj),
and the global multiplicity is the maximum multiplicity over α.
From this point on in the sequence, when you see the word "desingularise", what you should think is "put K(w) into normal crossing form near a singularity".
The RLCT measures the effective dimensionality of a model
Succinctly, the RLCT λ∈Q>0 of a singularity w(0)∈W0⊆W⊆Rd generalises the idea of dimension because:
If a model defined by w(0) is regular, then
λ=d2.
If a model defined by w(0) is minimally singular where d1<d is the number of non-free parameters, then
λ=d12.
In general, for any singular model the RLCT satisfies (by Theorem 7.2 of [Wat09])
λ≤d2.
In particular, if there are d1<d non-free parameters then
λ≤d12.
In order to find the asymptotic form of the free energy Fn as n→∞, Watanabe desingularises K(w) near each singularity using the Resolution of Singularities. The RLCT then directly substitutes into the place of d2 in the BIC formula, which gives rise to the Widely Applicable Bayesian Information Criterion (WBIC)
WBIC:=nLn(w(0))+λlogn.
In DSLT2, we will explain the implications of the WBIC and what it tells us about the profound differences between regular and singular models.
Appendix 1 - The other definition of the RLCT
In this post we have defined the RLCT as the scaling exponent of the volume integral of nearly true parameters. This result, whilst the most intuitive, is presented in the reverse order to how Watanabe originally defines the RLCT in [Wat09]. Alternatively, we can consider the zeta function
ζ(z)=∫WK(w)zφ(w)dw,
and show that it has a Laurent series given by
ζ(z)=ζ0(z)+∞∑k=1mk∑m=1ckm(z+λk)m
where ζ0(z) is a holomorphic function, ckm∈C are coefficients, each λk∈Q>0 is ordered such that 0<λ1<λ2<…, and mk∈N is the largest order of the pole λk.
Then the Real Log Canonical Threshold of our (model, truth, prior) triple is λ=λ1 with multiplicity m=m1.
This ζ(z) is a key piece of machinery in using distribution theory to expand the partition function Zn. In the end, the smallest λ1 and its multiplicity m1 are the dominant terms in the expansion, and a further calculation in [Wat09, Theorem 7.1] shows how V(ε)∝ελ.
To see why ζ(z) is necessary, and why this definition of the RLCT matters to the free energy formula proof, see the sketch of the proof in [Wat09, pg31-34].
In the finite n case, the choice of prior φ(w) is a philosophical matter, as well as a mathematical tractability matter. But as n→∞, most results in Bayesian statistics show φ(w) to be irrelevant so long as it satisfies some reasonable conditions. This is also true in SLT. ↩︎
where β>0 is the inverse temperature. This β plays an important role in deriving the free energy formula and can be thought of as controlling the "skinniness" of the posterior. In our regression model below, it is actually the inverse variance of the Gaussian noise. ↩︎
By Bayes' rule we have p(w|Dn)=p(Dn|w)φ(w)p(Dn). The form written here follows from some simplification of terms and redefinitions, see page 10 of the thesis.
It is easy to show that EX[Sn]=S and EX[Ln(w)]=L(w), and so by the law of large numbers there is almost sure convergence Sn→S and Ln(w)→L(w). Analogous definitions show
Based on K(w)=12∫W∥f(x,w)−f(x,w(0)∥2q(x)dx, it is relatively easy to reconstruct a model that genuinely yields a given K(w) function, so we may happily pretend we have said model when we pull such a loss function from thin air. ↩︎
TLDR; This is the first post of Distilling Singular Learning Theory (DSLT), an introduction to which can be read at DSLT0. In this post I explain how singular models (like neural networks) differ from regular ones (like linear regression), give examples of singular loss landscapes, and then explain why the Real Log Canonical Threshold (aka the learning coefficient) is the correct measure of effective dimension in singular models.
When a model class is singular (like neural networks), the complexity of a parameter w in parameter space W⊂Rd needs a new interpretation. Instead of being defined by the total parameters available to the model d, the complexity (or effective dimensionality) of w is defined by a positive rational λ∈Q>0 called the Real Log Canonical Threshold (RLCT), also known as the learning coefficient. The geometry of the loss K(w) is fundamentally defined by the singularity structure of its minima, which λ measures. Moreover, in regular models like linear regression the RLCT is λ=d2, but in singular models it satisfies λ≤d2 in general. At its core, then, Sumio Watanabe's Singular Learning Theory (SLT) shows the following key insight:
Watanabe shows that the RLCT λ has strong effects on the learning process: it is the correct generalisation of model complexity in the Bayesian Information Criterion for singular models, and therefore plays a central role in the asymptotic generalisation error, thereby inheriting the name "learning coefficient".
In this first post, after outlining the Bayesian setup of SLT, we will start by defining what a singular model is and explain what makes them fundamentally different to regular models. After examining different examples of singular K(w) loss landscapes, we will define the RLCT to be the scaling exponent of the volume integral of nearly true parameters, and conclude by summarising how this quantity correctly generalises dimensionality.
Preliminaries of SLT
The following section introduces some necessary technical terminology, so use it as a reference point, not necessarily something to cram into your head on a first read through. A more thorough setup can be found in [Car21, Chapter 2], which follows [Wat09] and [Wat18].
SLT is established in the Bayesian paradigm, where the Bayesian posterior on the parameter space W is the primary object of focus, containing information on which parameters w∈W correspond to "good" models.
Our statistical learning setup consists of the following data:
Using this data, the error of the model w on the dataset Dn is defined by the empirical negative log likelihood (NLL), Ln(w),
Ln(w)=−1nn∑i=1logp(yi|xi,w),where e−nLn(w)=∏ni=1p(yi|xi,w)=p(Dn|w) is the model likelihood. [2] [3]
This gives rise to the Bayesian posterior of w defined by [4]
p(w|Dn):=1Znφ(w)e−nLn(w)where the partition function (or in Bayesian terms the evidence) is given by
Zn=∫Wφ(w)e−nLn(w)dw.The partition function Zn measures posterior density, and thus contains a lot of macroscopic data about a system. Inspired by its role in physics, and for theoretical ease, we consider the free energy
Fn=−logZn.Performing asymptotic analysis on Zn (and therefore Fn) is the main task of SLT. The learning goal is to find small regions of parameter space with high posterior density, and therefore low free energy.
Though we never have access to the truth in the learning procedure, for theoretical purposes we nonetheless define the empirical entropy of the true distribution
Sn:=−1nn∑i=1logq(yi|xi).Even though this quantity is always inaccessible in real settings, there is almost sure convergence Sn→S as n→∞ to a constant S that doesn't depend on n, [5]
S=EX[−logq(y|x)]=−∬RN+Mq(y,x)logq(y|x)dxdy,To study the posterior, we define the Kullback-Leibler divergence K(w) between the truth and the model,
K(w):=∬RN+Mq(y|x)q(x)logq(y|x)p(y|x,w)dxdy,which is the infinite-data limit of its empirical counterpart,
Kn(w):=1nn∑i=1logq(yi|xi)p(yi|xi,w)=Ln(w)−Sn.The KL divergence is usually thought of as a "loss metric"[6] between the truth and and the model since
As such, I will colloquially refer to K(w) as the loss landscape. [7] The current state of results in SLT require K(w) to be an analytic function, but it seems likely that the results can be generalised to non-analytic settings with suitable hypotheses and constraints.
To analyse where the loss K(w) is minimised, we are then lead to defining the set of true parameters,
W0={w∈W|K(w)=0}={w∈W|p(y|x,w)=q(y|x)}.We say that the true distribution q(y|x) is realisable by the model class p(y|x,w) if W0 is non empty, that is, there exists some w(0)∈W such that q(y|x):=p(y|x,w(0)) for all x,y. Despite being unrealistic in real world applications, this is nonetheless an important starting point to the theory, which will then generalise to the set of optimal parameters in DSLT2.
We are going to restrict our attention to a particular kind of model: neural networks with Gaussian noise. We will formally define a neural network f(x,w) in a following chapter of this sequence, but for now it suffices to say that it is a function f:RN×W→RM with N inputs and M outputs defined by some parameter w∈W. Then our model density is going to be given by
p(y|x,w)=1(2π)M2exp(−12∥y−f(x,w)∥2).From here on in, we will assume we are working with a (model, truth, prior) triple (p(y|x,w),q(y|x),φ(w)) as specified in this section.
Loss in our setting
To put these technical quantities into perspective, let me make clear two key points:
- Under the regression model, the NLL is equivalent to the mean-squared error of the neural network f(x,w) on the dataset Dn (up to a constant),
Ln(w)=M2log2π+1nn∑i=112∥yi−f(xi,w)∥2.- In the realisable case where q(y|x)=p(y|x,w(0)), the KL divergence is just the euclidean distance between the model and the truth adjusted for the prior measure on inputs,
K(w)=12∫RN∥f(x,w)−f(x,w(0))∥2q(x)dx.Singular vs Regular Models
What is a singular model?
The key quantity that distinguishes regular and singular models is the Fisher information matrix I(w), whose entries are defined by
Ij,k(w)=∬RN+M(∂∂wjlogp(y|x,w))(∂∂wklogp(y|x,w))p(y|x,w)q(x)dxdy.It can be shown that when evaluated at a point on the set of true parameters w(0)∈W0, the Fisher information matrix I(w) is simply the Hessian of K(w), so
Ij,k(w(0))=∂2∂wj∂wkK(w)∣∣∣w=w(0).A regular statistical model class is one which is identifiable (so p(y|x,w1)=p(y|x,w2) implies that w1=w2), and has positive definite Fisher information matrix I(w) for all w∈W. Regular model classes, such as standard linear regression, are the backbone of classical statistics for which all pre-exisiting literature on Bayesian statistics applies. But, from the point of view of SLT, regular model classes are... boring.
If a model class is not regular, then it is strictly singular. The non-identifiability condition can be easily dealt with, but it is the degeneracy of the Fisher information matrix that fundamentally changes the nature of the posterior and its asymptotics. We will say a model defined by a fixed w(0)∈W (not necessarily a true parameter) is strictly singular if the Fisher information at the point, I(w(0)), is degenerate, meaning
Then the model class is strictly singular if there exists a w(0)∈W such that I(w(0)) is degenerate. A singular model class can be either regular or strictly singular - Watanabe's theory thus generalises regular models, regardless of the model non-identifiability or degenerate Fisher information.
It can be easily shown that, under the regression model, I(w(0)) is degenerate if and only the set of derivatives
{∂∂wjf(x,w)}dj=1is linearly dependent.
In regular models, the set of true parameters W0 consists of one point. But in singular models, the degeneracy of the Fisher information matrix means W0 is not restricted to being one point, or even a set of isolated points - in general, these local minima of K(w) are connected together in high-dimensional structures [8]. In strictly singular models, the true parameters are degenerate singularities [9] of K(w), and thus K(w) cannot be approximated by a quadratic form near these points. This is the fundamental reason the classical theory of Bayesian statistics breaks down.
Watanabe states that "in singular statistical models, the knowledge or grammar to be discovered corresponds to singularities in general" [Wat09]. With this in mind, it is unsurprising that the following widely used models are all examples of singular models:
Singular models are characterised by features like: having hierarchical structure, being made of superposition of parametric functions, containing hidden variables, etc., all in the service of obtaining hidden knowledge from random samples.
Classical Bayesian inference breaks down for singular models
There are two key properties of regular models that are critical to Bayesian inference as n→∞:
- Asymptotic normality: The posterior of regular models converges in distribution to a d-dimensional normal distribution centred at the maximum likelihood estimator w(0) [Vaa07]:
p(w|Dn)→Nd(w(0),1nI(w(0))−1).- Bayesian Information Criterion (BIC): The free energy of regular models asymptotically looks like the BIC as n→∞, where Ln(w(0))=minw∈WLn(w) and d is the dimension of parameter space W⊆Rd:
Fn≈nLn(w(0))+d2logn=BIC.At the core of both of these results is an asymptotic expansion that strongly depends on the Fisher information matrix I(w) being non-degenerate at true parameters w(0)∈W0. It's instructive to see why this is, so let's derive the BIC to see where I(w) shows up.
Deriving the Bayesian Information Criterion only works for regular models
For the sake of this calculation, let us assume W=Rd. Taking our cues from [Kon08], suppose w(0)∈W0 (thus is a maximum likelihood estimator and satisfies Ln(w(0))=minw∈WLn(w) ). We can Taylor expand the NLL as
Ln(w)=Ln(w(0))+(w−w(0))T∂Ln(w)∂w∣∣w=w(0)+12(w−w(0))TJ(w(0))(w−w(0))+…where J(w(0))=∂2Ln(w)∂w∂wT∣∣w=w(0) is the Hessian. Since we are analysing the asymptotic limit n→∞, we can relate this Hessian to the Fisher information matrix,
J(w(0))=∂2Ln(w)∂w∂wT∣∣w=w(0)=∂2(Kn(w)+Sn)∂w∂wT∣∣w=w(0)≈∂2K(w)∂w∂wT∣∣w=w(0)asn→∞=I(w(0)).By definition w(0) is a minimum of Ln(w), so ∂Ln(w)∂w∣∣w=w(0)=0, so we can expand the partition function as
Zn=∫We−nLn(w)φ(w)dw=∫Wexp(−nLn(w(0))−n2(w−w(0))TI(w(0))(w−w(0))+…)×[φ(w(0))+(w−w(0))T∂φ(w)∂w∣∣∣w=w(0)+…]dw.Here's the crux: if I(w(0)) is non-degenerate (so the model is regular), then we can perform this integral in good-faith knowing that it will always exist. In that case, the second term involving ∂φ(w)∂w vanishes since it is the first central moment of a normal distribution, so we have
Zn≈exp(−nLn(w(0)))φ(w(0))∫Wexp(−n2(w−w(0))TI(w(0))(w−w(0)))dw=exp(−nLn(w(0)))φ(w(0))(2π)d2nd2√detI(w(0))since the integrand is the integral of a d-dimensional multivariate Gaussian Nd(w(0),1nI(w(0))−1). Notice here that this is the same distribution that arises in the asymptotic normality result, a theorem that has the same core, but requires more rigorous probability theory to prove. If I(w(0)) is degenerate, then it is non-invertible, meaning the above formulas cannot hold.
The free energy of this ensemble is thus
Fn=−logZn=nLn(w(0))+d2logn−logφ(w(0))−d2log2π+12detI(w(0)),and so ignoring terms less than O(1) in n, we arrive at the Bayesian Information Criterion
BIC=nLn(w(0))+d2logn.This quantity can be understood as an accuracy-complexity tradeoff, where the complexity of the model class is defined by d. We will elaborate on this more in DSLT2 but for now, you should just believe that the Fisher information I(w) is a big deal. Generalising this procedure (and therefore the BIC) for singular models, is the heart of SLT.
Examples of Singular Loss Landscapes
In essence, the Fisher information matrix I(w) describes something about the effective dimensionality or complexity of a model w. When a model class is regular, the effective dimensionality of every point is simply d, the number of parameters available to the model. But in the singular case, a new notion of effective dimensionality is required to adequately describe the complexity of a model. We're now going to look at two cases of singular models [10] - or more precisely, loss landscapes that correspond to singular models - to motivate this generalisation. We'll start with the easier case where one or more parameters are genuinely "free".
Sometimes singularities are just free parameters
Example 1.1: Suppose we have d=2 parameters afforded to a model such that K(w1,w2)=w21, which has a Hessian given by
J(w)=∂2∂wT∂wK(w)=(2000).Taking the critical point w(0)=(0,0), we have I(w(0))=J(w(0))=J(w) and so detI(w(0))=0, thus the model is singular. In this case, since K(0,w2)=0 for all w2, we could simply throw out the free parameter w2 and define a regular model with d1=1 parameters that has identical geometry K(w1)=w21, and therefore defines the same input-output function, f(x,(w1,w2))=f(x,w1).
This example is called a minimally singular case. Suppose W⊆Rd with integers d1,d2>0 such that d1+d2=d, and after some change of basis[11] we may write a local expansion of K(w) as the sum of d1 squares,
K(w)=d1∑i=1ciw2i,where c1,c2,…,cd1>0 are positive coefficients. Then the Fisher information matrix has the form
I(w(0))=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝c1…00⋮⋱⋮⋮0…cd100000d2⎞⎟ ⎟ ⎟ ⎟ ⎟⎠where 0d2 is the square d2×d2 zero matrix. Perhaps then we could define the "effective dimensionality" of w(0) as being rank(I(w(0)))=d1, which is the number of tangent directions in parameter space in which the model changes - the number of "non-free" parameters - and just discard the d2 "free" parameters that are normal to W0.
Sure! We can do that, and if we did, our BIC derivation would carry out fine and we would just replace d by d1 in the final formula. So the minimally singular case is easy to handle.
But this doesn't always work.
But not all singularities are free parameters
Defining the effective dimensionality at w(0) as rank(I(w(0))) seems nice in theory, but turns out to give nonsensical answers pretty quickly - it is not a full enough description of the actual geometry at play.
Example 1.2: Suppose instead that K(w1,w2)=12w21w22. Then the Hessian is
J(w)=(w222w1w22w1w2w21).At the critical point w(0)=(0,0) the Fisher information is
I(w(0))=H(w(0))=(0000),which is obviously degenerate.
If we used our notion of effective dimensionality from before, we would say the model defined by w(0) had effective dimension of rank(I(w(0)))=0. But this would be ridiculous - clearly there are more than zero "effective dimensions" in this model, a term that would intuitively imply K(w) was identically zero, which it clearly is not. Thus, we need a different way of thinking about effective dimensionality.
The Real Log Canonical Threshold (aka the Learning Coefficient)
In this section we are going to explain the key claim of this post: that effective dimensionality in singular models is measured by a positive rational number called the Real Log Canonical Threshold, also known as the learning coefficient.
Dimensionality as a volume co-dimension
Taking inspiration from Weyl's famous Volume of Tubes paper, we can reframe dimensionality in terms of a scaling exponent of the volume of "nearly" true parameters. To explain this, we will generalise the minimally singular case above. The following discussion follows [Wei, 22].
Assume we have a partition as before with d1,d2∈N≥0 such that d1+d2=d, where d1 is the number of non-free parameters and d2 is the number of free parameters. For any ε>0 we can consider the set of almost true parameters centred at w(0) (which, without loss of generality, we will take to be w(0)=0),
Wε={w∈W|K(w)<ε}and an associated volume function
V(ε)=∫Wεφ(w)dw.As long as the prior φ(w) is non-zero on W0 it does not affect the relevant features of the volume, so we may assume that it is a constant C in the first d1 directions and is a normal distribution in the remaining d2. Then since K(w)≈∑d1i=1ciw2i, we can write
V(ε)=∫{w∈W|∑d1i=1ciw2i<ε}Cdw1…dwd1∫Rd2e−12(w2d1+1+⋯+w2d)dwd1+1…dwd.The right integrand is some constant A that doesn't depend on ε, and for the left we can make the substitution ui=√ciεwi, hence
V(ε)=AC∫{u∈U|∑d1i=1u2i<1}√εc1…√εcd1du1…dud1.Recognising the integrand as the volume of the d1-ball, a constant B that does not depend on ε, we see that
V(ε)∝εd12√c1…cd1.Then the dimension d1 arises as the scaling exponent of ε12, which can be extracted via the following ratio of volumes formula for some a∈(0,1):
d1=2limε→0log(V(aε)/V(ε))loga.This scaling exponent, it turns out, is the correct way to think about dimensionality of singularities.
Watanabe shows in Theorem 7.1 of [Wat09] that in general, for any singular model defined by w(0), the volume integral centred at w(0) has the form
V(ε)=cελ+o(ελ)where λ∈Q>0 is a positive rational number called the Real Log Canonical Threshold (RLCT) associated to the "most singular point" in W0. This is the quantity that generalises the dimensionality of a singularity. What's more, different singularities in W0 can have different RLCT values, and thereby different "effective dimensionalities". As suggested above, the RLCT can then be defined by this volume formula:
λ=2limε→0log(V(aε)/V(ε))loga.An example of fractional dimension
Example 1.3: To build intuition for what a "fractional dimension" is, consider a model with d=1 parameters with KL divergence given by K(w)=w4, which is singular since ∂2K∂w2∣∣w=0=0. A simple calculation shows that for this KL divergence,
V(ε)∝ε14meaning λ=14 and so the "effective dimensionality" is 2λ=12.
Meanwhile, in the K(w)=w2 case, V(ε)∝ε12, so the effective dimensionality is 1.
The RLCT can be read off when K(w) is in normal crossing form
I may have presented the previous section suggesting that the RLCT is trivial to calculate. In general, this couldn't be further from the truth. But in one special case, it is. For this discussion we will ignore the prior, i.e. we will set it to be uniform on W.
One dimensional case
As we just saw in Example 1.3, in the one dimensional case where K(w)=w2k for some k≥1, the RLCT is simply λ=12k. In fact, if we can express K(w) in the form
K(w)=(w−c1)2k1…(w−cJ)2kJfor non-negative integers k1,…,kj and unique c1,…,cj∈R, then the RLCT associated to each singularity cj is simply λj=12kj. But, Watanabe shows that it is the smallest local RLCT (and thus the highest exponent in K(w)) that dominates the free energy, thus defining the global RLCT λ where
λ=minj=1,…,J(12kj).Example 1.4 This example is going to be very relevant in DSLT2. If we have
K(w)=(w+1)2w4,with true parameters w(0)−1=−1 and w(0)1=1, then the local RLCT associated to each singularity is
λ−1=12andλ1=14.The global RLCT is thus λ=λ1.
Multidimensional case
Suppose now that d>1 so K(w)=K(w1,…,wd). Suppose without loss of generality that w(0)=0 is a true parameter for K(w). If we can write the KL divergence in normal crossing form near w(0),
K(w)=w2k11…w2kddthen the RLCT is given by
λ=minj=1,…,d(12kj).The multiplicity mj of each coordinate is the number of elements in {k1,…,kd} that equal kj.
This generalises this above case in the following sense:
Example 1.5 Suppose now that we have a two dimensional KL divergence of the form
K(w1,w2)=(w1+1)2w41w22.Then, in a neighbourhood of the singularity w(0)0=(0,0), the KL divergence is approximately
K(w)∝w41w22.Thus, the RLCT associated to w(0)0 is
λ0=14,with multiplicity m0=1.
On the other hand, near the singularity w(0)−1=(−1,0) the KL divergence is, up to a prefactor, approximately
K(w)≈(w1+1)2w22so the RLCT associated to w(0)−1 is
λ−1=12with multiplicity m−1=2. So, in this case the global RLCT is λ=λ0, which we will see in DSLT2 means that the posterior is most concentrated around the singularity w(0)0.
Resolution of Singularities
In Algebraic Geometry and Statistical Learning Theory, Watanabe shows that algebraic geometry plays a central role in governing the behaviour of statistical models, and a highly non-trivial one in singular models especially. This rich connection between these two deep mathematical fields is, in my eyes, both profound and extremely beautiful.
The remarkable insight of Watanabe is that in fact any KL divergence, under appropriate hypotheses (such as analyticity), can be written in normal crossing form near a singularity of K(w). To do so, he invokes one of the fundamental theorems of algebraic geometry: Hironaka's Resolution of Singularities. The content of this theorem and its implications go well beyond the scope of this sequence. But, I will briefly mention its role in the theory as it relates to the RLCT. For a more detailed introduction to this part of the story, see [Wat09, Section 1.4].
The theorem guarantees the existence of a d-dimensional analytic manifold M and a real analytic map
g:M∋u↦w∈Wsuch that for each coordinate Mα of M one can write
K(g(u))=u2k11…u2kddandφ(g(u))|g′(u)|=ϕ(u)|uh11…uhdd|where each k1,…,kd and h1,…,hd are non-negative integers, |g′(u)| is the Jacobian determinant of w=g(u) and ϕ(u)>0 is a real analytic function. The global RLCT is then defined by
λ=minαminj=1,…,d(hj+12kj),and the global multiplicity is the maximum multiplicity over α.
From this point on in the sequence, when you see the word "desingularise", what you should think is "put K(w) into normal crossing form near a singularity".
The RLCT measures the effective dimensionality of a model
Succinctly, the RLCT λ∈Q>0 of a singularity w(0)∈W0⊆W⊆Rd generalises the idea of dimension because:
- If a model defined by w(0) is regular, then
λ=d2.- If a model defined by w(0) is minimally singular where d1<d is the number of non-free parameters, then
λ=d12.- In general, for any singular model the RLCT satisfies (by Theorem 7.2 of [Wat09])
λ≤d2.- In particular, if there are d1<d non-free parameters then
λ≤d12.In order to find the asymptotic form of the free energy Fn as n→∞, Watanabe desingularises K(w) near each singularity using the Resolution of Singularities. The RLCT then directly substitutes into the place of d2 in the BIC formula, which gives rise to the Widely Applicable Bayesian Information Criterion (WBIC)
WBIC:=nLn(w(0))+λlogn.In DSLT2, we will explain the implications of the WBIC and what it tells us about the profound differences between regular and singular models.
Appendix 1 - The other definition of the RLCT
In this post we have defined the RLCT as the scaling exponent of the volume integral of nearly true parameters. This result, whilst the most intuitive, is presented in the reverse order to how Watanabe originally defines the RLCT in [Wat09]. Alternatively, we can consider the zeta function
ζ(z)=∫WK(w)zφ(w)dw,and show that it has a Laurent series given by
ζ(z)=ζ0(z)+∞∑k=1mk∑m=1ckm(z+λk)mwhere ζ0(z) is a holomorphic function, ckm∈C are coefficients, each λk∈Q>0 is ordered such that 0<λ1<λ2<…, and mk∈N is the largest order of the pole λk.
Then the Real Log Canonical Threshold of our (model, truth, prior) triple is λ=λ1 with multiplicity m=m1.
This ζ(z) is a key piece of machinery in using distribution theory to expand the partition function Zn. In the end, the smallest λ1 and its multiplicity m1 are the dominant terms in the expansion, and a further calculation in [Wat09, Theorem 7.1] shows how V(ε)∝ελ.
To see why ζ(z) is necessary, and why this definition of the RLCT matters to the free energy formula proof, see the sketch of the proof in [Wat09, pg31-34].
References
[Car21] - Liam Carroll, Phase Transitions in Neural Networks (thesis)
[Wat09] - Sumio Watanabe, Algebraic Geometry and Statistical Learning Theory (book)
[Wat18] - Sumio Watanabe, Mathematical Theory of Bayesian Statistics (book)
[KK08] - Konishi, Kitagawa, Information Criteria and Statistical Modelling (book)
[Vaa07] - van der Vaart, Asymptotic Statistics (book)
[Wei22] - Susan Wei, Daniel Murfet et al., Deep Learning is Singular, and That's Good (paper)
In the finite n case, the choice of prior φ(w) is a philosophical matter, as well as a mathematical tractability matter. But as n→∞, most results in Bayesian statistics show φ(w) to be irrelevant so long as it satisfies some reasonable conditions. This is also true in SLT. ↩︎
This should remind you of the Gibbs ensemble from statistical physics - not coincidentally, either. ↩︎
For theoretical, philosophical, and computational purposes, we also define the tempered posterior to be
pβ(w|Dn):=1Zβnφ(w)e−nβLn(w),whereZβn=∫Wφ(w)e−nβLn(w)dw.where β>0 is the inverse temperature. This β plays an important role in deriving the free energy formula and can be thought of as controlling the "skinniness" of the posterior. In our regression model below, it is actually the inverse variance of the Gaussian noise. ↩︎
By Bayes' rule we have p(w|Dn)=p(Dn|w)φ(w)p(Dn). The form written here follows from some simplification of terms and redefinitions, see page 10 of the thesis.
We can define an expectation over the dataset Dn for some function g(X,Y) as
EX[g(X,Y)]=∬RN+Mg(x,y)q(y,x)dxdyIn particular, we define the entropy of the true conditional distribution to be
S=EX[−logq(y|x)]=−∬RN+Mq(y,x)logq(y|x)dxdy,and the (non-empirical) negative log loss to be
L(w)=EX[−logp(y|x,w)]=−∬RN+Mq(y,x)logp(y|x,w)dxdy.It is easy to show that EX[Sn]=S and EX[Ln(w)]=L(w), and so by the law of large numbers there is almost sure convergence Sn→S and Ln(w)→L(w). Analogous definitions show
Kn(w)=Ln(w)−Sn→L(w)−S=K(w).Though it isn't a true metric due to its asymmetry in p and q, and since it doesn't satisfy the triangle inequality. ↩︎
Note here that since K(w)=L(w)−S, we can reasonably call both K(w) and L(w) the loss landscape since they differ only by a constant S (as n→∞).
Or more precisely, a real analytic set.
Since K(w(0))=0, and ∇K(w(0))=0, and I(w(0)) is degenerate. ↩︎
Based on K(w)=12∫W∥f(x,w)−f(x,w(0)∥2q(x)dx, it is relatively easy to reconstruct a model that genuinely yields a given K(w) function, so we may happily pretend we have said model when we pull such a loss function from thin air. ↩︎
Which is guaranteed to exist since the Hessian is a real symmetric matrix (and thus so is I(w(0))), so it can be diagonalised. ↩︎