TLDR; This is the fourth main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain how to relate SLT to thermodynamics, and therefore how to think about phases and phase transitions in the posterior in statistical learning. I then provide intuitive examples of first and second order phase transitions in a simple K(w) loss function. Finally, I experimentally demonstrate phase transitions in two layer ReLU neural networks associated to the node-degeneracy and orientation-reversing phases established in DSLT3, which we can understand precisely through the lens of SLT.
In deep learning, the terms "phase" and "phase transition" are often used in an informal manner to refer to a steep change in a metric we care about, like the training or test loss, as a function of SGD steps, or alternatively some hyperparameter like the number of samples from the truth n.
But what exactly are the phases? And why do phase transitions even occur? SLT provides us a solid theoretical framework for understanding phases and phase transitions in deep learning. In this post, we will argue that in the Bayesian setting,
A phase of the learning process corresponds to a singularity of K(w), and a phase transition corresponds to a drastic change in the posterior as a function of a hyperparameter θ.
The hyperparameter θ could be the number of samples from the truth n, some way of varying the model function f(x,w)=f(x,w;θ) or something about the true distribution Dn=Dn(θ), amongst other things. At some critical value θ=θc, we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error Gn=E[Fn+1]−E[Fn].
In this post, we will present experiments that observe precise phase transitions in the toy neural network models we studied in DSLT3, for which we understand the set of true parameters W0 and therefore the phases. By the end of this post, you will have a framework for thinking about phase transitions in singular models and an intuition for why SLT predicts them to occur in learning.
Phases Correspond to Singularities
The Story Starts in Physics
This subsection is modelled on [Callen, Ch9], but it is only intended to be a high level discussion of the concepts grounded in some basic physics - don't get too bogged down in the details of the thermodynamics.
Fundamentally, a phase describes an aggregate state of a complex system of many interacting components, where the state retains particular qualities with variations in some hyperparameter. To explain the concept in detail, it is natural to start in physics (thermodynamics in particular), where these ideas originally arose. But there is a deeper reason to build from here: every human has an intuitive understanding of the phases of water and how they change with temperature [1], which serves as the base mental model for what a phase is.
One of the main goals of thermodynamics is to study how the equilibrium state of a system changes as a function of macroscopic parameters. In the case of a vessel of water at 1atm of pressure in constant contact with a thermal and pressure reservoir, the equilibrium state of the system corresponds to a state that is minimised by the Gibbs free energy F[2]. The phases, then, are the equilibrium states, which describe qualitative physical properties of the system. The states of matter - solid, liquid, and gas - are all phases of water, which are characterised by variables like their volume and crystal structure. As anybody that has boiled water before knows, these phases undergo transitions as a function of temperature. Let's make this more precise.
The Thermodynamic Setup
Consider a system of K water molecules moving in a 2D container, each with equal mass m. To each particle i∈[K]={1,…,K} we can associate a set of microstates describing its physical properties at a point in time, for example its position xi and its velocity vi. In our discussion we will simply focus on the position, which we will relabel w=(w1,…,wK) (for reasons that will become clear), so our configuration space W⊆R2K of possible microstates is
W={w|(wi,x,wi,y)∈R2for eachi∈[K]}.
Since it is physically infeasible to know or model the positions of all molecules, we instead reason about the dynamics of the system by calculating macroscopic variables associated to a microstate, for example the temperature or total volume of the molecules. We will focus on the volume V(w) of a microstate w. Importantly, a macroscopic state is an aggregate over the system (for example, temperature being related to average squared velocity), meaning there are many possible configurations of microstates that result in the same macrostate. To this end, we can define regions of our configuration space according to their volume v,
Wv={w∈W|V(w)=v}⊆W.
In our toy example, we want to study how the system changes as a function of temperature, which we will denote with θ. In a Gibbs ensemble, we can associate an energy functional, the Hamiltonian H(w;θ), to any given microstate w at temperature θ. The fundamental postulate of such a Gibbs ensemble is that probability of the system being in a particular micro state w is determined by a Gibbs distribution [3]
p(w;θ)=e−H(w;θ)ZwhereZ=∫We−H(w;θ)dw.
This should look pretty familiar from our statistical learning setup! Indeed, we can then calculate the free energy of the ensemble for different volumes v at temperature θ,
Fθ(v)=−log(∫Wve−H(w;θ)dw).
For a Gibbs ensemble, the equilibrium state of a given system is that state which minimises the free energy. In the context of bringing water to a boiling point, there are two minima of the free energy characterised by the liquid and gaseous states, which for ease we will characterise by their volumes vliquid and vgas. Then the equilibrium state changes at the critical temperature θc=100°C,
{Wvliquid0°C<θ<100°CWvgasθ>100°C.
Importantly, while small variations in the temperature away from θc will change the free energy of each state, it will not change the configuration of these minima with respect to the free energy. In other words, the system will still be a liquid for any θ∈(0,100) - its qualitative properties are stable. This is the content of a phase.
What is a phase?
A phase of a system is a region of configuration space W⊂W that minimises the free energy, and is invariant to small perturbations in a relevant hyperparameter θ. Typically, phases are distinguished by some macroscopic variable, in our case the volume V(w) distinguishing subsets Wv. More generally though, a phase describes some qualitative aggregate state of a system - like, as we've discussed in our example, the states of matter.
In some sense, you can define a phase to be any region that induces an equilibrium state with qualities you care about. But what makes phases a powerful concept is their relation to phase transitions - when there is a sudden jump in which state is preferred by the system.
What is a phase transition?
Phase transitions are changes in the structure of the global minima of the free energy, and often arise as non-analyticities of Fn. This is a fancy way of saying they correspond to discontinuities in the free energy or one of its derivatives [4].
A first order phase transition at a critical temperature θc corresponds to a reconfiguration of which phase is the global minima of the free energy.
As we discussed above, heating water to boiling point θc=100°C is a classic example of a first order phase transition.
Two examples of second order phase transitions are where:
A merge transition occurs at θc when two phases that are initially disjoint for θ<θc merge to become the same state for θ≥θc, or;
A creation transition occurs at θc when a local minima exists for θ≥θc but does not exist for θ<θc. (If the directions are reversed, we call this a destruction transition).
(Note that we have not given a full classification of phase transitions here, because to do so one needs to study the possible types of catastrophes that can occur, as presented in [Gilmore]).
Phases in Statistical Learning
The notation and concepts in the previous section were not presented without reason. For starters, the Gibbs ensemble view of statistical learning is actually quite a rich analogy because, when the prior is uniform, the (random) Hamiltonian is equal to the empirical KL divergence [5],
Hn(w)=nKn(w).
The configuration space of microstates of the physical system then corresponds to parameter space W with microstates given by different parameters w∈W. This means the posterior is equivalent to the Gibbs probability distribution of the system being in a certain microstate, meaning the definition of free energy is identical. So, what exactly are the phases then?
In statistical learning then,
A phase corresponds to a local neighbourhood W⊂Wcontaining a singularity w(0)W of interest.
To say that W minimises the free energy is equivalent to saying that it has non-negligible posterior mass. The reason for this, as we explored in DSLT2, is that the singularity structure of a most singular optimal point w(0)W∈Wopt dominates the behaviour of the free energy, because it minimises the loss L(w) and has the smallest RLCT λW.
You can, in principal, define a phase to be any region of W. But the analysis of phases in the posterior only gets interesting when you have a set of phases that have fundamentally different geometric properties. The free energy formula tells us that these geometric properties correspond to different accuracy-complexity tradeoffs.
Consequently, in statistical learning, Watanabe states in [Wat18, §9.4] that
A phase transition is a drastic change in the geometry of the posterior as a function of a hyperparameter θ.
Our definitions of first and second order phase transitions carry over perfectly from the physics discussion above.
It's important to clarify here that phase transitions in deep learning have many flavours. If one believes that SGD is effectively just "sampling from the posterior", then the conception that phase transitions are related to changes in the geometry of the posterior carries over. There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints n, but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice. [6]
The hyperparameter θ can affect any number of objects involved in the posterior. Remembering that the posterior is
p(w|Dn)=φ(w)e−nLn(w)Zn,
we could include hyperparameter θ dependence in any of:
The model function f(x,w)=f(x,w;θ) (i.e. the neural network defining p(y|x,w)).
The true distribution Dn=Dn(θ), meaning Ln(w)=Ln(w;θ). (This could in principal be dependence in the input prior q(x) or the actual dataset generated by q(y|x).)
The number of datapoints n (inducing a first order phase transition due to the change in accuracy-complexity tradeoff).
The prior φ(w)=φ(w;θ).
Intuitive Examples to Interpret Phase Transitions
In DSLT2 we studied an example of a very simple one-dimensional K(w) curve and got a feel for how the accuracy and complexity of a singularity affect the free energy of different neighbourhoods. Having now learned about phase transitions, we can cast new light on this example.
Example 1: First Order Phase Transition in n
Example 4.1: Consider again a KL divergence given by
K(w)=(w+1)2((w−(1+hC))4−kC)
where w(0)−1=−1 and w(0)1=1 are the singularities, but the accuracy of w(0)1 is worse, K(w(0)1)=C>0. Then we can identify two phases corresponding to the two singularities,
W−1=B(w(0)−1,δ)andW1=B(w(0)1,δ)
for some radius δ>0 such that the accuracy of W−1 is better, but the complexity of W1was smaller,
L(w(0)−1)<L(w(0)1),butλW−1>λW1.
As the hyperparameter θ=n[7] varies, we see a first order phase transition at the critical value of nc≈17 where the two free energy curves intersect, causing an exchange which phase is the global minima of the free energy. As we argued in that post, this is largely due to the accuracy-complexity tradeoff of the free energy. Notice also how the free energy of the global minima is non-differentiable at nc, showing an example of the "non-analyticity" of Fn that we mentioned above.
Example 2: Second Order Merge Phase Transition
Example 4.2: We can modify our example slightly to observe a second order phase transition. Let's consider
K(w;θ)=(w+(1−θ))2(w−(1−θ))4
where θ∈[0,1] is a hyperparmeter that shifts the two singularities w(0)−1=−1+θ and w(0)1=1−θ towards the origin. We will continue to label these phases W−1 and W1, noting their θ dependence. [8]
Thus, at θc=1 the two phases will merge and the KL divergence will be
K(w;1)=w6.
Therefore, at θ=1 the singularity w(0)0=0 will have an RLCT of
λ0=16.
There is a new most singular point caused by the merging of two phases! Again, we can visually depict this phase transition:
Now that we have the basic intuitions of SLT and phase transitions down pat, let's apply these concepts to the case of two layer feedforward ReLU neural networks.
Phase Transitions in Two Layer ReLU Neural Networks
The main claim of this sequence is that Singular Learning Theory is a solid theoretical framework for understanding phases and phase transitions in neural networks. It's now time to make good on that promise and bring all of the pieces together to understand an actual example of phase transitions in neural networks. The full details of these experiments are explained in my thesis, [Carroll, §5.2], but I will briefly outline some points here for the interested reader. All notation and terminology is explained in detail in DSLT3, so use that section as a reference.
If you are uninterested, just skip to the next subsection to see the results.
Experimental Setup
We will consider a (model, truth) pair defined by the simple two layer feedforward ReLU neural network models we studied in DSLT3. Phase transitions will be induced by varying true distribution by a hyperparameter θ, meaning Dn=Dn(θ). Since we have a full classification of W0 from DSLT3, we understand the phases of the system, and therefore we want to study how their differing geometries affect the posterior. As we explained in that post, the scaling and permutation symmetries are generic (they occur for all parameters w∈W), but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth. Thus, we are interested in studying the how the posterior changes as we vary the truth to induce these alternative true parameters - the phases of our setup.
The posterior sampling procedure uses an MCMC variant called HMC NUTS, which is brilliantly explained and interpreted here. Estimating precise nominal free energy values, and particularly those of the RLCT λ, using sampling methods is currently very challenging (as explained in [Wei22]). So, for these experiments, our inference about phases and phase transitions will be based on visualising the posterior and observing the posterior concentrations of different phases. With this in mind, the posteriors below are averaged over four trials, 20,000 samples each, for each fixed true distribution defined by θ. (Bayesian sampling is very computationally expensive, even in simple settings).
To isolate the phases we care about, we can use the fact that the scaling symmetry and permutation symmetries of our networks are generic. To this end we will normalise the weights by defining the effective weight^wi=|qi|wi[9], which preserves functional equivalence f(x,w)=f(x,^w)[10]. We will say a node is degenerate if ^wi=0. We also project different node indices on to the same (^wi,1,^wi,2) axes as follows:
The prior on inputs q(x) is uniform on the square [−1,1]2, and the prior on parameters φ(w) is the standard multidimensional normal N(0,1).
Phase Transition 1 - Deforming to Degeneracy
In this experiment we will see a first order phase transition induced by deforming a true network from having no degenerate nodes to having one (possibility of a) degenerate node, as discussed in DSLT3 - Node Degeneracy. This example will reinforce the key messages of Watanabe's free energy formula: true parameters are preferred according to their RLCT, and at finite n non-true parameters can be preferred due to the accuracy-complexity tradeoff.
Defining the Model, Truth, and Phases
We are going to consider a model network with d=2 nodes,
f(x,w)=ReLU(⟨^w1,x⟩+^b1)+ReLU(⟨^w2,x⟩+^b2)+c
and a realisable true network f(x,w(0)) with m=2 nodes, which we will denote by f2(x,θ):=f(x,w(0)) to signify its hyperparameter θ dependence (and distinguish it from the next experiment),
f2(x,θ)=ReLU(⟨^w(0)1,x⟩−13)+ReLU(⟨^w(0)2,x⟩−13).
The true weights rotate towards one another by a hyperparameter θ=[0,π2], so [11]
w(0)1=(cosθ,sinθ),w(0)2=(−cosθ,sinθ).
As we explained in DSLT3, we can depict the function and its activation boundaries pictorially:
At θ=π2, the truth could be expressed by a network with only one node, m=1,
f2(x,π2)=ReLU(x2−13)+ReLU(x2−13)=2ReLU(x2−13).
This degeneracy is what we are interested in studying. The WBIC tells us to expect the posterior to prefer the one-degenerate-node configuration since it has less effective parameters. [12]
To identify our phases, at θ=π2 there are two possible configurations of the effective model weights that are true parameters:
Both non-degenerate but share the same activation boundary: Both ^w1,^w2≠0 such that ^w1+^w2=(0,2).
One degenerate, one non-degenerate: Either ^w1=(0,0) and ^w2=(0,2), or vice versa by permutation symmetry.
To study these configurations we thus define phases based on annuli in the plane centred on the circle of radius r with annuli radius of ε,
At θ=π2, the degenerate phase ADegen is preferred.
There is a first order phase transition at θc=1.26c where ADegen becomes preferred, despite not containing a true parameter for θ<π2,
It is unsurprising (yet satisfying) that the degenerate phase ADegen is preferred at θ=π2, in line with what the WBIC tells us to expect. What might be more surprising, though, is that ANonDegen has extremely little posterior density at this θ value. [13]
As we have argued throughout the sequence, the free energy formula suggests that first order phase transitions happen when there is a change in the accuracy-complexity tradeoff such that the posterior newly preferences one phase over the other. Here, the first order phase transition at θc=1.26c can be understood in these terms with the following graph that depicts how the accuracy of ADegen improves with θ.
A Complexity Measure for Non-Analytic ReLU Networks
One last thing to point out here is that since K(w) is not analytic for ReLU neural networks, the RLCT is not a well defined object. Nonetheless, Watanabe has recently proven in this paper that there is a bound on the free energy,
Fn≤nSn+λReLUlogn
where complexity λReLU∈Q>0 is measured by the number of parameters in the smallest compressed network possible to represent the function, as a kind of 'pseudo'-RLCT. In our case the the complexity is
2λReLU={9for 0<θ<π25for θ=π2
since there are five parameters required in the degenerate phase and nine in the non-degenerate phase [14]. In this way, Watanabe's work predicts the results we see. This also shows us how the theory of SLT may be generalisable to the non-analytic setting and still give approximately the same essential insights into singular models.
At ϑ=1, the weights satisfy the weight annihilation property,
w(0)1+w(0)2+w(0)3=(0,0),
meaning that reversing the orientation of the weights, w(0)i↦−w(0)i (which is equal to a rotation by π), will preserve the function as discussed in DSLT3 - Orientation Reversal. We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.[15] Our key question thus becomes: does the posterior prefers the weight annihilation phase, or the non-weight annihilation phase, at ϑ=1?
To depict the phases on the (wi,1,wi,2) plane, let R(θ)=(cosθ,sinθ), let B(x,ε) be the closed ball of radius epsilon centred at x∈R2, and let S3 denote the permutation group of order 3. Then the two phases of interest are
Since w(0)2 is being scaled by ϑ, we will understand the centre of each ball corresponding to σ(k)=1 in ENonWA as being multiplied by the scalar ϑ. (It is easier to state in words that writing down in gory notation).
In this experiment our two questions are:
At ϑ=1, which phase is preferred?
Is there a first or second order phase transition at some ϑc∈(1,3]?
Results
The results of this experiment show that:
At ϑ=1 the non-weight annihilation phase ENonWA is preferred by the posterior.
The weight annihilation phase EWA is never preferred by the posterior, thus there is no first order phase transition. But there is a second order phase transition at ϑc≈2 where EWA is destroyed.
In [Carroll, §5.4.3], I perform a calculation on an even simpler orientation-reversing example which shows that the relative error of inner cancellation region strongly dictates the preference of the two phases. This relative error can be made smaller by increasing the size of the prior q(x). That result suggests that the two phases may have the same RLCT, but differing lower order geometry. This is speculative though, and it would be interesting to better understand the RLCT of both phases.
The second order phase transition is unsurprising since we specifically deform the network so that EWA doesn't contain a true parameter for ϑ∈(1,3]. At ϑc, its inaccuracy is too highly penalised and the posterior contains no samples from the region.
Yes, in any physics or chemistry textbook you will see the Gibbs free energy denotes by G. I am writing F to keep it consistent with our later statistical learning discussion.
At this point, this is a slight abuse of the physics notions. Typically the probability distribution is proportional to e−βH(w) where β is the inverse temperature. In this case we are going to absorb the β into the H(w;θ) term and not get too caught up in the actual physics - we're just painting a conceptual picture to apply later on.
Note here that a phase transitions of a dynamical system (i.e. SGD, which we can imagine as a particle moving subject to a potential well) is a slightly more subtle concept. One imagines the loss landscape to be fixed, and the "phase transition" corresponding to the particle moving from one particular phase in W to another. In this sense, there isn't exactly a phase transition in the general sense, but there is a change in which phase a system finds itself in.
It is a little bit disingenuous to continue to call these phases when δ is very close to 1, as the singularity w(0)1 has a non-negligible effect on W2, and vice-versa, meaning the phases lose their individual identities. Alternatively, one defines W0 to centre on w(0)0=0, and observes how the free energy changes with δ. But, I have kept the two "phases" W1 and W2 in the animation below to illustrate the general idea with minimum fuss.
You might wonder why we still endow the model with the qi parameters in the first place if we just normalise them out after the fact. We assumed it was more important to let the sampling procedure take place on an earnest neural network model without restricting its parameter space, thus trying to keep it in line with neural networks actually used in practice. But, it is likely that these results would hold otherwise, too.
The astute observer will notice that this is a white lie - the functional equivalence is true as long as each qi≥0. However, in our experiments, the true outgoing weights are qi=1, meaning a good sample will only ever have positive weights, i.e. any sample with a negative q(k)i will be removed by the outlier validation.
It is worth briefly mentioning the effect of the prior here. The free energy formula tells us that as n→∞, the effects of the prior on learning become negligible. But of course, we are only ever in the finite n regime, at which point the prior does have effects on the posterior. In our case, since the prior is a Gaussian centred at w=(0,0) with standard deviation 1, it is reasonable to say that it has some bearing on the degenerate phase being preferred. However, further experiments showed that this behaviour is still retained for a flatter prior with increased standard deviation. The problem, however, is that the Markov chains can become very unstable on these priors, producing posterior samples with very high loss, indicating that the chains aren't converging to the correct long-term distribution. In the interest of time, I decided not to continue to fine-tune the experiments on non-converging chains for a flatter prior, but it would be interesting to see to what extent the prior does affect these results.
@Leon Lang correctly pointed out that this is slightly weird terminology to use. Instead these should really be referred to as weight-cancellation instead of weight-annihilation, since both initial configurations obey the weight-annihilation property as I defined it, whereas what I am really referring to is the fact that in one configuration all weights are active and cancel in a region. It's too late to change the terminology throughout, but do keep this in mind.
TLDR; This is the fourth main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain how to relate SLT to thermodynamics, and therefore how to think about phases and phase transitions in the posterior in statistical learning. I then provide intuitive examples of first and second order phase transitions in a simple K(w) loss function. Finally, I experimentally demonstrate phase transitions in two layer ReLU neural networks associated to the node-degeneracy and orientation-reversing phases established in DSLT3, which we can understand precisely through the lens of SLT.
In deep learning, the terms "phase" and "phase transition" are often used in an informal manner to refer to a steep change in a metric we care about, like the training or test loss, as a function of SGD steps, or alternatively some hyperparameter like the number of samples from the truth n.
But what exactly are the phases? And why do phase transitions even occur? SLT provides us a solid theoretical framework for understanding phases and phase transitions in deep learning. In this post, we will argue that in the Bayesian setting,
The hyperparameter θ could be the number of samples from the truth n, some way of varying the model function f(x,w)=f(x,w;θ) or something about the true distribution Dn=Dn(θ), amongst other things. At some critical value θ=θc, we recognise a phase transition as being a discontinuous change in the free energy or one of its derivatives, for example the generalisation error Gn=E[Fn+1]−E[Fn].
In this post, we will present experiments that observe precise phase transitions in the toy neural network models we studied in DSLT3, for which we understand the set of true parameters W0 and therefore the phases. By the end of this post, you will have a framework for thinking about phase transitions in singular models and an intuition for why SLT predicts them to occur in learning.
Phases Correspond to Singularities
The Story Starts in Physics
This subsection is modelled on [Callen, Ch9], but it is only intended to be a high level discussion of the concepts grounded in some basic physics - don't get too bogged down in the details of the thermodynamics.
Fundamentally, a phase describes an aggregate state of a complex system of many interacting components, where the state retains particular qualities with variations in some hyperparameter. To explain the concept in detail, it is natural to start in physics (thermodynamics in particular), where these ideas originally arose. But there is a deeper reason to build from here: every human has an intuitive understanding of the phases of water and how they change with temperature [1], which serves as the base mental model for what a phase is.
One of the main goals of thermodynamics is to study how the equilibrium state of a system changes as a function of macroscopic parameters. In the case of a vessel of water at 1atm of pressure in constant contact with a thermal and pressure reservoir, the equilibrium state of the system corresponds to a state that is minimised by the Gibbs free energy F [2]. The phases, then, are the equilibrium states, which describe qualitative physical properties of the system. The states of matter - solid, liquid, and gas - are all phases of water, which are characterised by variables like their volume and crystal structure. As anybody that has boiled water before knows, these phases undergo transitions as a function of temperature. Let's make this more precise.
The Thermodynamic Setup
Consider a system of K water molecules moving in a 2D container, each with equal mass m. To each particle i∈[K]={1,…,K} we can associate a set of microstates describing its physical properties at a point in time, for example its position xi and its velocity vi. In our discussion we will simply focus on the position, which we will relabel w=(w1,…,wK) (for reasons that will become clear), so our configuration space W⊆R2K of possible microstates is
W={w|(wi,x,wi,y)∈R2for eachi∈[K]}.Since it is physically infeasible to know or model the positions of all molecules, we instead reason about the dynamics of the system by calculating macroscopic variables associated to a microstate, for example the temperature or total volume of the molecules. We will focus on the volume V(w) of a microstate w. Importantly, a macroscopic state is an aggregate over the system (for example, temperature being related to average squared velocity), meaning there are many possible configurations of microstates that result in the same macrostate. To this end, we can define regions of our configuration space according to their volume v,
Wv={w∈W|V(w)=v}⊆W.In our toy example, we want to study how the system changes as a function of temperature, which we will denote with θ. In a Gibbs ensemble, we can associate an energy functional, the Hamiltonian H(w;θ), to any given microstate w at temperature θ. The fundamental postulate of such a Gibbs ensemble is that probability of the system being in a particular micro state w is determined by a Gibbs distribution [3]
p(w;θ)=e−H(w;θ)ZwhereZ=∫We−H(w;θ)dw.This should look pretty familiar from our statistical learning setup! Indeed, we can then calculate the free energy of the ensemble for different volumes v at temperature θ,
Fθ(v)=−log(∫Wve−H(w;θ)dw).For a Gibbs ensemble, the equilibrium state of a given system is that state which minimises the free energy. In the context of bringing water to a boiling point, there are two minima of the free energy characterised by the liquid and gaseous states, which for ease we will characterise by their volumes vliquid and vgas. Then the equilibrium state changes at the critical temperature θc=100°C,
{Wvliquid0°C<θ<100°CWvgasθ>100°C.Importantly, while small variations in the temperature away from θc will change the free energy of each state, it will not change the configuration of these minima with respect to the free energy. In other words, the system will still be a liquid for any θ∈(0,100) - its qualitative properties are stable. This is the content of a phase.
What is a phase?
A phase of a system is a region of configuration space W⊂W that minimises the free energy, and is invariant to small perturbations in a relevant hyperparameter θ. Typically, phases are distinguished by some macroscopic variable, in our case the volume V(w) distinguishing subsets Wv. More generally though, a phase describes some qualitative aggregate state of a system - like, as we've discussed in our example, the states of matter.
In some sense, you can define a phase to be any region that induces an equilibrium state with qualities you care about. But what makes phases a powerful concept is their relation to phase transitions - when there is a sudden jump in which state is preferred by the system.
What is a phase transition?
Phase transitions are changes in the structure of the global minima of the free energy, and often arise as non-analyticities of Fn. This is a fancy way of saying they correspond to discontinuities in the free energy or one of its derivatives [4].
A first order phase transition at a critical temperature θc corresponds to a reconfiguration of which phase is the global minima of the free energy.
As we discussed above, heating water to boiling point θc=100°C is a classic example of a first order phase transition.
Two examples of second order phase transitions are where:
(Note that we have not given a full classification of phase transitions here, because to do so one needs to study the possible types of catastrophes that can occur, as presented in [Gilmore]).
Phases in Statistical Learning
The notation and concepts in the previous section were not presented without reason. For starters, the Gibbs ensemble view of statistical learning is actually quite a rich analogy because, when the prior is uniform, the (random) Hamiltonian is equal to the empirical KL divergence [5],
Hn(w)=nKn(w).The configuration space of microstates of the physical system then corresponds to parameter space W with microstates given by different parameters w∈W. This means the posterior is equivalent to the Gibbs probability distribution of the system being in a certain microstate, meaning the definition of free energy is identical. So, what exactly are the phases then?
In statistical learning then,
To say that W minimises the free energy is equivalent to saying that it has non-negligible posterior mass. The reason for this, as we explored in DSLT2, is that the singularity structure of a most singular optimal point w(0)W∈Wopt dominates the behaviour of the free energy, because it minimises the loss L(w) and has the smallest RLCT λW.
You can, in principal, define a phase to be any region of W. But the analysis of phases in the posterior only gets interesting when you have a set of phases that have fundamentally different geometric properties. The free energy formula tells us that these geometric properties correspond to different accuracy-complexity tradeoffs.
Consequently, in statistical learning, Watanabe states in [Wat18, §9.4] that
Our definitions of first and second order phase transitions carry over perfectly from the physics discussion above.
It's important to clarify here that phase transitions in deep learning have many flavours. If one believes that SGD is effectively just "sampling from the posterior", then the conception that phase transitions are related to changes in the geometry of the posterior carries over. There is, however, one fundamentally different kind of "phase transition" that we cannot explain easily with SLT: a phase transition of SGD in time, i.e. the number gradient descent steps. The Bayesian framework of SLT does not really allow one to speak of time - the closest quantity is the number of datapoints n, but these are not equivalent. We leave this gap as one of the fundamental open questions of relating SLT to current deep learning practice. [6]
The hyperparameter θ can affect any number of objects involved in the posterior. Remembering that the posterior is
p(w|Dn)=φ(w)e−nLn(w)Zn,we could include hyperparameter θ dependence in any of:
Intuitive Examples to Interpret Phase Transitions
In DSLT2 we studied an example of a very simple one-dimensional K(w) curve and got a feel for how the accuracy and complexity of a singularity affect the free energy of different neighbourhoods. Having now learned about phase transitions, we can cast new light on this example.
Example 1: First Order Phase Transition in n
Example 4.1: Consider again a KL divergence given by
K(w)=(w+1)2((w−(1+hC))4−kC)where w(0)−1=−1 and w(0)1=1 are the singularities, but the accuracy of w(0)1 is worse, K(w(0)1)=C>0. Then we can identify two phases corresponding to the two singularities,
W−1=B(w(0)−1,δ)andW1=B(w(0)1,δ)for some radius δ>0 such that the accuracy of W−1 is better, but the complexity of W1was smaller,
L(w(0)−1)<L(w(0)1),butλW−1>λW1.As the hyperparameter θ=n [7] varies, we see a first order phase transition at the critical value of nc≈17 where the two free energy curves intersect, causing an exchange which phase is the global minima of the free energy. As we argued in that post, this is largely due to the accuracy-complexity tradeoff of the free energy. Notice also how the free energy of the global minima is non-differentiable at nc, showing an example of the "non-analyticity" of Fn that we mentioned above.
Example 2: Second Order Merge Phase Transition
Example 4.2: We can modify our example slightly to observe a second order phase transition. Let's consider
K(w;θ)=(w+(1−θ))2(w−(1−θ))4where θ∈[0,1] is a hyperparmeter that shifts the two singularities w(0)−1=−1+θ and w(0)1=1−θ towards the origin. We will continue to label these phases W−1 and W1, noting their θ dependence. [8]
Thus, at θc=1 the two phases will merge and the KL divergence will be
K(w;1)=w6.Therefore, at θ=1 the singularity w(0)0=0 will have an RLCT of
λ0=16.There is a new most singular point caused by the merging of two phases! Again, we can visually depict this phase transition:
Now that we have the basic intuitions of SLT and phase transitions down pat, let's apply these concepts to the case of two layer feedforward ReLU neural networks.
Phase Transitions in Two Layer ReLU Neural Networks
The main claim of this sequence is that Singular Learning Theory is a solid theoretical framework for understanding phases and phase transitions in neural networks. It's now time to make good on that promise and bring all of the pieces together to understand an actual example of phase transitions in neural networks. The full details of these experiments are explained in my thesis, [Carroll, §5.2], but I will briefly outline some points here for the interested reader. All notation and terminology is explained in detail in DSLT3, so use that section as a reference.
If you are uninterested, just skip to the next subsection to see the results.
Experimental Setup
We will consider a (model, truth) pair defined by the simple two layer feedforward ReLU neural network models we studied in DSLT3. Phase transitions will be induced by varying true distribution by a hyperparameter θ, meaning Dn=Dn(θ). Since we have a full classification of W0 from DSLT3, we understand the phases of the system, and therefore we want to study how their differing geometries affect the posterior. As we explained in that post, the scaling and permutation symmetries are generic (they occur for all parameters w∈W), but the node-degeneracy and orientation-reversing symmetries only occur under precise configurations of the truth. Thus, we are interested in studying the how the posterior changes as we vary the truth to induce these alternative true parameters - the phases of our setup.
The posterior sampling procedure uses an MCMC variant called HMC NUTS, which is brilliantly explained and interpreted here. Estimating precise nominal free energy values, and particularly those of the RLCT λ, using sampling methods is currently very challenging (as explained in [Wei22]). So, for these experiments, our inference about phases and phase transitions will be based on visualising the posterior and observing the posterior concentrations of different phases. With this in mind, the posteriors below are averaged over four trials, 20,000 samples each, for each fixed true distribution defined by θ. (Bayesian sampling is very computationally expensive, even in simple settings).
To isolate the phases we care about, we can use the fact that the scaling symmetry and permutation symmetries of our networks are generic. To this end we will normalise the weights by defining the effective weight ^wi=|qi|wi [9], which preserves functional equivalence f(x,w)=f(x,^w) [10]. We will say a node is degenerate if ^wi=0. We also project different node indices on to the same (^wi,1,^wi,2) axes as follows:
The prior on inputs q(x) is uniform on the square [−1,1]2, and the prior on parameters φ(w) is the standard multidimensional normal N(0,1).
Phase Transition 1 - Deforming to Degeneracy
In this experiment we will see a first order phase transition induced by deforming a true network from having no degenerate nodes to having one (possibility of a) degenerate node, as discussed in DSLT3 - Node Degeneracy. This example will reinforce the key messages of Watanabe's free energy formula: true parameters are preferred according to their RLCT, and at finite n non-true parameters can be preferred due to the accuracy-complexity tradeoff.
Defining the Model, Truth, and Phases
We are going to consider a model network with d=2 nodes,
f(x,w)=ReLU(⟨^w1,x⟩+^b1)+ReLU(⟨^w2,x⟩+^b2)+cand a realisable true network f(x,w(0)) with m=2 nodes, which we will denote by f2(x,θ):=f(x,w(0)) to signify its hyperparameter θ dependence (and distinguish it from the next experiment),
f2(x,θ)=ReLU(⟨^w(0)1,x⟩−13)+ReLU(⟨^w(0)2,x⟩−13).The true weights rotate towards one another by a hyperparameter θ=[0,π2], so [11]
w(0)1=(cosθ,sinθ),w(0)2=(−cosθ,sinθ).As we explained in DSLT3, we can depict the function and its activation boundaries pictorially:
At θ=π2, the truth could be expressed by a network with only one node, m=1,
f2(x,π2)=ReLU(x2−13)+ReLU(x2−13)=2ReLU(x2−13).This degeneracy is what we are interested in studying. The WBIC tells us to expect the posterior to prefer the one-degenerate-node configuration since it has less effective parameters. [12]
To identify our phases, at θ=π2 there are two possible configurations of the effective model weights that are true parameters:
To study these configurations we thus define phases based on annuli in the plane centred on the circle of radius r with annuli radius of ε,
A(r,ε)={(^w_,1,^w_,2)∈R2|r−ε≤∥(^w_,1,^w_,2)∥≤r+ε}.Then we define the two phases containing the singularities of interest to be
ANonDegen=A(1,ε)×A(1,ε)ADegen=(A(0,ε)×A(2,ε))∪(A(2,ε)×A(0,ε)).The union is due to the permutation symmetry - which precise node is degenerate doesn't matter. We will let Ac=R2∖(ANonDegen∪ANonDegen).
There are two questions we seek to answer:
Results
There is also a static facet grid of the key frames if you want a closer inspection.
The results of our experiments show:
It is unsurprising (yet satisfying) that the degenerate phase ADegen is preferred at θ=π2, in line with what the WBIC tells us to expect. What might be more surprising, though, is that ANonDegen has extremely little posterior density at this θ value. [13]
As we have argued throughout the sequence, the free energy formula suggests that first order phase transitions happen when there is a change in the accuracy-complexity tradeoff such that the posterior newly preferences one phase over the other. Here, the first order phase transition at θc=1.26c can be understood in these terms with the following graph that depicts how the accuracy of ADegen improves with θ.
A Complexity Measure for Non-Analytic ReLU Networks
One last thing to point out here is that since K(w) is not analytic for ReLU neural networks, the RLCT is not a well defined object. Nonetheless, Watanabe has recently proven in this paper that there is a bound on the free energy,
Fn≤nSn+λReLUlognwhere complexity λReLU∈Q>0 is measured by the number of parameters in the smallest compressed network possible to represent the function, as a kind of 'pseudo'-RLCT. In our case the the complexity is
2λReLU={9for 0<θ<π25for θ=π2since there are five parameters required in the degenerate phase and nine in the non-degenerate phase [14]. In this way, Watanabe's work predicts the results we see. This also shows us how the theory of SLT may be generalisable to the non-analytic setting and still give approximately the same essential insights into singular models.
Phase Transition 2 - Orientation Reversing Symmetry
Defining the Model, Truth, and Phases
This time we are going to consider a model network with d=3 nodes,
f(x,w)=c+3∑i=1ReLU(⟨^wi,x⟩+^bi),and a realisable true network f3(x,ϑ) with m=3 nodes,
f3(x,ϑ)=3∑i=1ReLU(⟨^w(0)i,x⟩−13),where the weights are defined by an order parameter ϑ∈[1,3] that scales one gradient,
w(0)1=(cosπ3,sinπ3),w(0)2(ϑ)=ϑ(cosπ,sinπ),w(0)3=(cos5π3,sin5π3).At ϑ=1, the weights satisfy the weight annihilation property,
w(0)1+w(0)2+w(0)3=(0,0),meaning that reversing the orientation of the weights, w(0)i↦−w(0)i (which is equal to a rotation by π), will preserve the function as discussed in DSLT3 - Orientation Reversal. We will use the label weight annihilation phase to refer to the configuration of nodes such that the weights all point into the centre region and annihilate one another.[15] Our key question thus becomes: does the posterior prefers the weight annihilation phase, or the non-weight annihilation phase, at ϑ=1?
To depict the phases on the (wi,1,wi,2) plane, let R(θ)=(cosθ,sinθ), let B(x,ε) be the closed ball of radius epsilon centred at x∈R2, and let S3 denote the permutation group of order 3. Then the two phases of interest are
ENonWA=⋃σ∈S32∏k=0B(R(π3+2σ(k)π3),ε)EWA=⋃σ∈S32∏k=0B(R(2σ(k)π3),ε).Since w(0)2 is being scaled by ϑ, we will understand the centre of each ball corresponding to σ(k)=1 in ENonWA as being multiplied by the scalar ϑ. (It is easier to state in words that writing down in gory notation).
In this experiment our two questions are:
Results
The results of this experiment show that:
In [Carroll, §5.4.3], I perform a calculation on an even simpler orientation-reversing example which shows that the relative error of inner cancellation region strongly dictates the preference of the two phases. This relative error can be made smaller by increasing the size of the prior q(x). That result suggests that the two phases may have the same RLCT, but differing lower order geometry. This is speculative though, and it would be interesting to better understand the RLCT of both phases.
The second order phase transition is unsurprising since we specifically deform the network so that EWA doesn't contain a true parameter for ϑ∈(1,3]. At ϑc, its inaccuracy is too highly penalised and the posterior contains no samples from the region.
References
[Callen] - H. Callen, Thermodynamics and an Introduction to Thermostatistics, 1991
[Gilmore] - R. Gilmore, Catastrophe Theory for Scientists and Engineers, 1981
[Wat18] - S. Watanabe, Mathematical Theory of Bayesian Statistics, 2018
[Carroll] - L. Carroll, Phase Transitions in Neural Networks, 2021
[Wei22] - S. Wei, D. Murfet, et al., Deep learning is singular, and that’s good, 2022
At constant atmospheric pressure, that is.
Yes, in any physics or chemistry textbook you will see the Gibbs free energy denotes by G. I am writing F to keep it consistent with our later statistical learning discussion.
At this point, this is a slight abuse of the physics notions. Typically the probability distribution is proportional to e−βH(w) where β is the inverse temperature. In this case we are going to absorb the β into the H(w;θ) term and not get too caught up in the actual physics - we're just painting a conceptual picture to apply later on.
Which often correspond to the moments (mean, variance, etc.) of quantities like H(w).
More precisely, considering the tempered posterior at inverse temperature β>0, the Hamiltonian has the form
Hn(w)=nβLn(w)−logφ(w).(Since Kn(w)=Ln(w)−Sn, the constant Sn in w is irrelevant).
Note here that a phase transitions of a dynamical system (i.e. SGD, which we can imagine as a particle moving subject to a potential well) is a slightly more subtle concept. One imagines the loss landscape to be fixed, and the "phase transition" corresponding to the particle moving from one particular phase in W to another. In this sense, there isn't exactly a phase transition in the general sense, but there is a change in which phase a system finds itself in.
Which altered the posterior geometry, but not that of K(w) since p(w|Dn)≈e−nK(w) (up to a normalisation factor).
It is a little bit disingenuous to continue to call these phases when δ is very close to 1, as the singularity w(0)1 has a non-negligible effect on W2, and vice-versa, meaning the phases lose their individual identities. Alternatively, one defines W0 to centre on w(0)0=0, and observes how the free energy changes with δ. But, I have kept the two "phases" W1 and W2 in the animation below to illustrate the general idea with minimum fuss.
You might wonder why we still endow the model with the qi parameters in the first place if we just normalise them out after the fact. We assumed it was more important to let the sampling procedure take place on an earnest neural network model without restricting its parameter space, thus trying to keep it in line with neural networks actually used in practice. But, it is likely that these results would hold otherwise, too.
The astute observer will notice that this is a white lie - the functional equivalence is true as long as each qi≥0. However, in our experiments, the true outgoing weights are qi=1, meaning a good sample will only ever have positive weights, i.e. any sample with a negative q(k)i will be removed by the outlier validation.
Explicitly, the truth is defined by
f2(x,θ)=ReLU(cos(θ)x1+sin(θ)x2−13)+ReLU(−cos(θ)x1+sin(θ)x2−13),Relatedly, the plot of the KL divergence in Example 3.3 tells us to expect that the degenerate phase may be preferred.
It is worth briefly mentioning the effect of the prior here. The free energy formula tells us that as n→∞, the effects of the prior on learning become negligible. But of course, we are only ever in the finite n regime, at which point the prior does have effects on the posterior. In our case, since the prior is a Gaussian centred at w=(0,0) with standard deviation 1, it is reasonable to say that it has some bearing on the degenerate phase being preferred. However, further experiments showed that this behaviour is still retained for a flatter prior with increased standard deviation. The problem, however, is that the Markov chains can become very unstable on these priors, producing posterior samples with very high loss, indicating that the chains aren't converging to the correct long-term distribution. In the interest of time, I decided not to continue to fine-tune the experiments on non-converging chains for a flatter prior, but it would be interesting to see to what extent the prior does affect these results.
In other words, the degenerate phase requires a truth with five parameters
q(0)1ReLU(w(0)1,1x1+w(0)1,2x2+b(0)1)+c,whereas the non-degenerate phase requires nine,
q(0)1ReLU(w(0)1,1x1+w(0)1,2x2+b(0)1)+q(0)2ReLU(w(0)2,1x1+w(0)2,2x2+b(0)2)+c.@Leon Lang correctly pointed out that this is slightly weird terminology to use. Instead these should really be referred to as weight-cancellation instead of weight-annihilation, since both initial configurations obey the weight-annihilation property as I defined it, whereas what I am really referring to is the fact that in one configuration all weights are active and cancel in a region. It's too late to change the terminology throughout, but do keep this in mind.