TLDR; This is the third main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied in DSLT4

As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less than half the dimension of parameter space. With this in mind, it should be no surprise that neural networks are indeed singular models, but up until this point we have not exactly explained what feature they possess that makes them singular. In this post, we will explain that in essence:

Neural networks are singular because there are often ways to vary their parameters without changing the function they compute.  

In the case where the model and truth are both defined by similar neural network architectures, this fact means that the set of true parameters  is non-trivial (i.e. bigger than the regular case where it is a single point), and often possesses many symmetries. This directly implies that neural networks are singular models. 

The primary purpose of this post is to show with examples why neural networks are singular, and classify the set of true parameters  in the case where the model and truth are simple two layer feedforward ReLU networks. In doing so, we will lay the groundwork for understanding the phases present in the setup so that we can then study relevant phase transitions in DSLT4. Feel free to jump ahead to the slightly more exciting DSLT4 Phase Transitions in Neural Networks and refer back to this post as needed. 

Outline of Classification

To understand the different regions that minimise the free energy (and thus, as we'll see in DSLT4, the phases), one needs to first understand the singularities in the set of optimal parameters of .

In the realisable regression case with a model neural network  and true neural network defined by  for some , the set of true parameters has the form [1]

Thus, classifying the true parameters is a matter of establishing which parameters  yield functional equivalence between the model and the truth . The property of being singular is specific to a model class , regardless of the underlying truth. But, classifying  in the realisable case is a convenient way of studying what functionally equivalent symmetries exist for a particular model class. 

Neural networks have been shown to satisfy a number of different symmetries of functional equivalence across a range of activation functions and architectures, which we will elaborate on throughout the post. Unsurprisingly, the nonlinearity of the activation function plays a central role in governing these symmetries. In general, then, deep neural networks are highly singular.

In this post we are going to explore a full characterisation of the symmetries of  when the model is a two layer feedforward ReLU neural networks with  hidden nodes, and the truth is the same architecture but with  nodes. Though you would never use such a basic model in real deep learning, the simplicity of this class of network allows us to study  with full precision. We will see that:

  • If the model and truth have the same number of nodes, : There are three forms of symmetry of :
    • Scaling symmetry of the incoming and outgoing weights to any node.
    • Permutation symmetry of the hidden nodes in a layer.
    • Orientation reversing symmetry of the weights, only when some subset of weights sum to zero (i.e. "annihilate" one another).
  • If the model has more nodes than the truth, : Without loss of generality, the first  nodes of the model must have the same symmetries as in the first case. Then each excess node  is either
    • Degenerate, meaning its total weight (gradient) is 0 (thus the node is always constant).
    • Or it has the same activation boundary as another already in the model such that the weights sum to the total gradient in a region [2].

In [Carroll, Chapter 4], I give rigorous proofs that in both cases,  is classified by these symmetries, and these symmetries alone. The purpose of this post is not to repeat these proofs, but to provide the intuition for each of these symmetries. I have included a sketch of the full proof in the appendix of this post if you are more mathematically inclined.

Two layer Feedforward ReLU Neural Networks

Literature abounds on what neural networks are, so I will merely give the definition of the class we are going to study here and some related terminology for the discussion.

Defining the Networks and Terminology

Let  be a compact parameter space. We will let  denote the set of hidden nodes in the first layer of our network, and  denote the standard dot product between two vectors. Also recall that 

We let  denote a two layer feedforward ReLU neural network with two inputs  and one output , defined by a parameter . The function is given by 

 where for each :

  • the first layer weights are  and the biases are 
  • the second layer weights are  and the bias is .

These functions are simply piecewise affine functions (i.e. piecewise hyperplanes), and as such they have (relatively) easy topology to study. Before we give an example, we will briefly mention some key terminology.

Let  be defined by a fixed . We say a particular node  is degenerate in  if either of the weights are zero, so  or [3]

We say a non-degenerate node  is activated in some linear domain [4]  when the ReLU is non-zero for all  , that is, 

The activation boundary associated to node  is thus the line 

One of the key accounting tools in the symmetry classification is identifying the foldsets of  (in the terminology of [PL19]), which are the regions where  is non-differentiable in , and noticing that these equate to the union of non-degenerate activation boundaries . Two functionally equivalent networks must then have the same foldsets since they define the same function, allowing us to compare the lines defined by .

Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes

Example 3.1: Consider the following two layer feedforward ReLU neural network: 

 defined by biases  and , second layer weights , and first layer weights 

Its graphical structure and activation boundaries in the  plane can be seen below: 

The data of  above.

Conceptually, it's helpful to notice that when anchored on its corresponding activation boundary, each weight vector  "points" into its region of activation.

The Symmetries of Two Layer Feedforward ReLU Neural Networks

In this section I am going to provide some motivating examples of each kind of symmetry exhibited in two layer feedforward ReLU neural networks. To prove that this is the full set of symmetries in generality requires a bit more work, which we relegate to the appendix.

Scaling Inner and Outer Weights of a Node

The scaling symmetry of ReLU networks offers us our first window into why these models are singular. The key property is to notice that for any , the ReLU satisfies a scale invariance [5]

 Say we had the simplest model possible with just one node: 

 Then we could define an alternative parameter  with 

 which gives functional equivalence because, 

For a model with  hidden nodes, the same scaling symmetry applies to each individual node  with a set of scaling factors .

The fact that we can define such a  for any set of positive scalars means that the Fisher information matrix of these models is degenerate at all points . We prove this in generality in Appendix 1, but I'll spell it out explicitly for a simple example here. 

Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix

Example 3.2: It is worth taking a moment to recognise how this scaling symmetry affects the geometry of the loss landscape . The mental model to have here is that it results in valleys in , where the set of true parameters  is like a river on the valley floor. To see this, say we defined a model with parameter  and truth as: 

where  is some fixed constant. If  is uniform on  then it is easy to calculate that when  we have 

 We can depict this valley and its effect on the posterior for :

K(w) is a valley
Setting , we see that  is a valley due to the scaling symmetry (left), thus there is no unique maximum a posterior (right). Remember that, up to a scaling factor,  is the posterior when the prior  is uniform, and  for large  since .

Looking at this , it's easy to intuit that the Fisher information matrix  is degenerate for all . But, for clarity, let me spell this out for the true parameters in the case where , so .

Remember that at true parameters the Fisher information matrix is just the Hessian, which in this case has the form

In particular, let  be a fixed true parameter parameterised by a fixed , so . Then the Fisher information matrix has the form

Setting  and  to be the rows of the matrix, there is clearly a linear dependence relation 

and since  is arbitrary, this shows that all true parameters have degenerate Fisher information matrices and are thus singular.

Permutation of Nodes

This one is easy to see. If we have a model with  nodes, 

 and we define a new model  where  is a permutation of the nodes in 

then 

This easily generalises to  hidden nodes by taking any permutation  in the permutation group  and letting each node  of  satisfy , so 

Permutation symmetry
Permuting nodes induces functional equivalence, here depicted for .

Orientation Reversal

This one is a bit trickier to observe as the symmetry depends on a very specific condition of weight annihilation. Let's look at a simple example first.

Motivating Example

Example 3.3: Consider a true distribution defined by a (one-input) feedforward ReLU given by 

where , and the activation boundaries are  and .

Surprisingly, though it may appear our linear regions and activation boundaries must uniquely define the function (up to the scaling and permutation symmetries), there is a particular symmetry that arises by reversing the orientation of the weights and first layer biases, and adjusting the total bias accordingly. When we say reverse the orientation, we mean negating their direction, 

and ditto for the biases. If we adjust the total bias  accordingly, then following function 

gives the same functional output!

Weight annihilation in 1D
Reversing the orientation of the true weights preserves this function because the true weights annihilate one another.

There is a very specific reason we can do this: in the middle region , both nodes are active and cancel out to give a constant function, 

because the total gradients of the underlying truth sum to zero, .

General case

Suppose the true network  is defined by a fixed  for  nodes. If there is a set  of total gradients that sum to 0, 

then the model can produce functional equivalence by reversing the orientation of those particular weights (associated to those activation boundaries), biases, and adjusting the total bias. In other words, modulo permutation and scaling symmetry, there is a functionally equivalent network to  where the weights of every  satisfy 

We call the condition  weight annihilation.

In [Carroll214.5] we define -symmetric networks where the weights are progressive rotations by the angle , thus their total sum is zero. In DSLT4, we will study whether the posterior prefers configurations of weight-annihilation or not. (The answer is: not). [6]

An -symmetric network for  with  and . Both configurations, non-weight-cancellation (left) and weight-cancellation (right), are functionally equivalent since . Here, weight cancellation refers to the configuration where all three nodes are active in the central linear domain, but cancel to give an effective gradient of zero there.

Node Degeneracy

This is possibly the most important symmetry of all: neural network models can have more nodes than they need to represent a particular function. In essence, this degeneracy is the reason that different regions of the loss-landscape  of neural networks have fundamentally different accuracy-complexity tradeoffs. In other words, if the model has  nodes in the hidden layer available to it, then all possible subnetwork configurations with less than  nodes are also contained within the loss landscape. Thus, increasing the width of the network can only serve to increase the accuracy of these models, without sacrificing its ability to generalise, since the posterior will just prefer that number of hidden nodes with the best accuracy-complexity tradeoff. 

Motivating Example

Example 3.4: Suppose we had a (one-input) true network given by 

and our model had  nodes (with fixed biases  and outgoing weights ), 

Since  for , both weights must be positive, , to have any hope of being functionally equivalent. If , we are in one of two configurations:

One node is degenerate: Either  or , meaning 

Both nodes are non-degenerate, but the total gradient is the same as the truth: So long as the weights satisfy 

 for , we will have functional equivalence since, setting 

Node-degeneracies Correspond to Different Phases

We could of course encapsulate both of these configurations into the one statement that  for , but there is a key reason we have delineated them: they represent two different phases and have different geometry on . Intuitively, the degenerate phase is a simpler model with less complexity, thus we expect it has a lower RLCT [7], and for the posterior to prefer it. In DSLT4 we will discuss phases in statistical learning more broadly, and display experimental evidence for this latter claim.

To foreshadow this, we can actually calculate  for Example 3.4. Setting the prior  to be uniform on  we find 

K(w) for the simple example
 for the above example with slightly wider bowls at the degenerate-node phases.

Notice how there are ever so slightly wider bowls at either end of the line , thus suggesting the posterior has more density at the degenerate phase (or vice versa). Intuitively, imagine a tiny ball being with random kinetic energy rolling around the bottom of the surface - it will spend more time in the ends since there is more catchment area. (Don't take the physics analogy too seriously, though).

We see once again that the singularity structure of the minima has a big impact on the geometry of , and therefore the posterior. 

General Case

Suppose we have a truth  and model  that are both defined by two layer feedforward ReLU neural networks, where the model has  nodes and the truth has  nodes (assumed to all be non-degenerate and with distinct activation boundaries) such that . Then the model is overparameterised compared to the truth it is trying to model.

Performing the appropriate analysis (which we do in Appendix 2), one finds that:

  1. Without loss of generality (i.e. up to permutation symmetry), the first  nodes of  must have the same activation boundaries as those in , and satisfy the same scaling, permutation and orientation reversing symmetries as discussed above.
  2. For the remaining nodes  of the model, each one either:
    • Is degenerate, so  or , or;
    • Shares the same activation boundary as one already in  such that the total gradients sum to the correct gradient in each region. (In our above example, this is saying that necessarily  since these nodes share the same activation boundary).
Degenerate contour plot
The function  can also be represented by a two node model where both nodes share the same boundary, 

In DSLT4 we will test which of the phases in the above figure is preferred by the posterior for this simple two layer feedforward ReLU neural networks setup. 

Node degeneracy is the same as lossless network compressibility

The fact that neural networks can contain these node-degeneracies is well known and often goes under the guise of lossless network compressibility. There are many notions of compressibility, but the one that makes the most sense in our setup is to say that if the model has  hidden nodes compared to the truth, then it can be compressed to a network with only  hidden nodes and still produce the same input-output map.

For an excellent introduction to lossless network compressibility, see Farrugia-Roberts' recent paper Computational Complexity of Detecting Proximity to Losslessly Compressible Neural Network Parameters, where he studies the problem for networks. 

There are More True Parameters if the Input Domain is Bounded

Let me make an important remark here. In both of the above cases, we have considered the symmetries of  when the input domain of the model and the truth is all of . As we explain in Appendix 2, this allows us to compare the gradients and biases of hyperplanes, similar to comparing polynomial coefficients, to make our conclusions. However, if the domain of the input prior  is restricted to some open bounded domain , there could in principle be more degeneracies and symmetries of , since the functional equivalence only needs to be on .

For example, consider a true network defined by  and a single-node single-input model  defined on , so . If the activation boundary falls outside of  and the vector  points away from , then any value of  satisfying these constraints would give , thus there is an entirely new class of symmetry in .

Whilst important to keep this mind, we won't discuss this any further as it opens up an entirely different can of worms. 

Even if Singularities Occur with Probability Zero, they Affect Global Behaviour in Learning

I want to make a quick comment on the work of Phuong and Lampert in [PL19]. In this paper, they prove equivalent results to these for arbitrary depth feedforward ReLU neural networks (with non-increasing widths), but with a key distinction: they consider  general models. In their words,

A sufficient condition for a network to be general with probability one is that the weights are sampled from a distribution with a density.

They then show that almost all feedforward ReLU networks with this architecture are general, and then show that general networks only satisfy scaling and permutation symmetries, thus excluding our orientation-reversing and degenerate node singularities since they occur on a set of measure zero. Importantly, this implies that almost all parameters  have no degenerate nodes, or equivalently, no opportunity for lossless compression.

However, even though scaling and permutation symmetries may be the only generic symmetries (in the sense of measure theory) that occur with non-zero probability, SLT tells us that the singularities of  have global effects on the loss landscape, as we discussed at length in DSLT2. If a parameter is near a non-generic singularity (i.e. one that occurs with probability zero), it computes a function that is almost identical to the one computed by that of a non-generic singularity. If we shift our language to that of compressibility of a network, SLT tells us that:

Just because a particular point  sampled from a posterior (or, notionally, obtained via running SGD) is not directly compressible itself, that doesn't mean that it isn't extremely close to one that is.

In this sense, SLT tells us that to understand the geometry of the loss landscape, we need to consider singularities even though they are not generic points. As Watanabe says, singularities contain knowledge. 

Appendix 1 - Formal Proof that Neural Networks are Singular

If, like me, you are mathematically inclined, you probably want to see a proof that these neural networks are, indeed, singular models, to tie together the various concepts and intuitions that we have built in this sequence so far. So let's turn into math mode briefly.

Recall that the Fisher information matrix  is degenerate if and only if the set 

 is linearly dependent. Here,  refers to the partial derivative with respect to the th component of the total parameter , not to be confused with the specific weight vector  in the neural network definition. Thus, to prove that feedforward ReLU networks are singular, our task is to find this linear dependence relation. The scaling symmetry alone is enough for this.

Theorem: Given a two layer feedforward neural network  with  hidden nodes, for any domain on which  is differentiable,  satisfies the differential equation for a fixed node 

Proof: Since , and letting , the set of derivatives with respect to our parameters are 

 and so since we can write  we have 

Corollary: Feedforward ReLU neural networks are singular models.

Proof: For the two layer case, for any fixed , there is a linear dependence relation given by the above differential equation evaluated at , thus the Fisher information is degenerate at , so the model is singular.

The equivalent proof for arbitrary depths and widths is given in Lemma A.1 of [Wei22], following from other work on functional equivalence in [PL19]. 

The degenerate node symmetries also give rise to a degenerate Fisher information matrix, though I haven't formally written out this alternate proof yet. If you are interested, do it as an exercise and leave it as a comment!

Appendix 2 - Proof Sketch for Fully Classifying  for Two Layer Feedforward ReLU Networks

This section is going to be slightly more technical, and in the grand scheme of the SLT story I am telling in this sequence, this may be seen as an unnecessary side-plot. But, other readers, particularly those with a pure mathematical bent, may find it interesting to consider the process of fully classifying  and how one might understand all phases present, so I am providing a sketch of these proofs for completeness. Understanding the full form of  was a vital part of performing the phase transition experiments that we will see in DSLT4. These models are simple enough that we can perfectly classify all true parameters in . Thus, we can precisely understand all of its phases.

We are going to classify the symmetries of  when both the model  and truth  are two-layer feedforward ReLU neural networks, with  and  hidden nodes respectively, giving 

meaning the task is to classify functional equivalence of the two networks. To avoid some annoying fringe cases, we assume that the true network is minimal, which means there is no network with fewer nodes that could also represent it (which also means every node is non-degenerate), and activation-distinguished, meaning every node of the truth corresponds to a unique activation boundary.

We will see that the set of symmetries explained above comprise all of the symmetries in  - there can be no more [8]. This result rests mainly on the fact that the activation boundaries are the core piece of data that defines a neural network. The rest is then just performing accounting of the gradients and biases in each region.

This is a sketch of the proofs in Chapter 4 of my thesis, and all lemmas and theorems that are referenced in the following section come from here.

Case 1: The model has the same number of nodes as the truth, 

Let  be a two layer feedforward ReLU neural network model with  hidden nodes, and let  be the realisable true network with  hidden nodes defined by a fixed parameter , denoted by 

which we assume is minimal and activation-distinguished as explained above.

We start by comparing the foldsets, which are the activation boundaries [Lemma 4.1], between the truth and the model. Let  be the activation boundary of the node  in the model, and  be the activation boundary of the node  in the truth. Then by comparing the sets of linear lines in [Lemma 4.2], we can show that for every node of the model  there exists a permutation  such that 

By [Lemma 4.3], two activation boundaries  are equal if and only if there is some non-zero scalar  such that  and .

Using our relation , in [Lemma 4.4] we analyse how the gradients and biases change across each activation boundary, and what this means for the relation between weights and biases in the model versus the truth. We show that there exists a unique , and for each  an  and  such that 

meaning  and  necessarily have the same sign.

However, there is a restriction on which weights can have reversed orientation,  (thus ). Letting , we show in [Lemma 4.5] that the weights and biases of the true network must satisfy [9]

The crux of this proof rests in comparing the gradients in regions either side of the activation boundary .

In [Theorem 4.7] we show that these scaling, permutation and orientation reversing symmetries are the only such symmetries by piecing together all of these aforementioned Lemmas, with emphasis on the importance of the activation boundaries in defining the topology of [10]

Case 2: The model has more nodes than the truth, 

We now suppose that the model is over-parameterised compared to the true network, so .

The key piece of data is once again the foldsets defining the model and the truth. Since they must be equal, the model can only have  unique foldsets, and thus activation boundaries. Without loss of generality (i.e. up to permutation symmetry), the first  nodes in the model have the same activation boundaries as the truth, . Thus, these  nodes in the model must satisfy the same symmetries as in the  case.

By comparing the fold sets on each excess node in , we must have 

 In comparing linear lines again, this means there are two possible situations:

  1.  is empty, so node  is degenerate, meaning  or , or;
  2.  for some , so node  shares an activation boundary already in the first  nodes of the model.

Let  the number of non-degenerate nodes of the model. We can thus define a surjective finite set map 

relating the non-degenerate nodes in the model to those in the truth, which is a bijection (i.e. a permutation ) on the first .

We can then compare the gradients and biases in each region to show that the total gradients calculated by each non-degenerate node at each unique activation boundary must sum to the gradient in the truth. Precisely, for each node  of the truth, let  be the set of nodes in the model that share the same activation boundary. Then for each  there exists an  and  such that 

 with the constraint that 

A similar orientation reversing symmetry also applies as in case 1, just by accounting for the nodes that share the same activation boundaries.

Resources

[Carroll21] - L. Carroll, Phase Transitions in Neural Networks, 2021

[Wei22] - S. Wei, D. Murfet, et al., Deep Learning is Singular, and That's Good, 2022

[PL19] - M. Phuong, C. Lampert, Functional vs Parametric Equivalence of ReLU networks, 2019


  1. ^

    Since  if and only if  for some .

  2. ^

    e.g. . ↩︎

  3. ^

    For ease of classification, we exclude the case where  and  since we can just absorb the total bias contribution into . ↩︎

  4. ^

    A linear domain  is just a connected open set where  is a plane with constant gradient and bias when restricted to , and  is the maximal such set for which that plane is defined. In other words, the set of linear domains are the set of different regions the piecewise affine function are carved up into. ↩︎

  5. ^

    But don't forget,  as the domain of activation is completely different. ↩︎

  6. ^

    Though I have not been able to formally prove it, I believe that this symmetry on its own (i.e. modulo scaling symmetry) does not result in a degeneracy of the Fisher information matrix, at least in our simple case. This, I think, is because the weights must cancel out in the region where both nodes are active, and the gradients in the other regions must be retained. Feel free to prove me wrong, though! ↩︎

  7. ^

    This statement is a bit disingenuous. Watanabe's free energy formula only applies to the case where  is analytic, but ReLU neural networks are certainly not analytic, as we can see in the below example. With that said, Watanabe has recently proved a bound on the free energy for ReLU neural networks, showing that the complexity term is essentially related to the number of non-degenerate nodes in the truth, even if it isn't a true RLCT. We will look at this in more depth in DSLT4. ↩︎

  8. ^

    Aside from the technical caveat discussed about the restricted input prior  above.

  9. ^

    Our convention is to take the empty sum to be 0, so all weight orientations being preserved, , is perfectly fine. ↩︎

  10. ^

    The activation distinguished condition on the truth allows us to uniquely identify the permutation  relating activation boundaries, and ensures only one node changes across each boundary. ↩︎

New Comment