The paper that killed deep learning theory

LawrenceC

Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.

Of course, this is a bit of an exaggeration. No single paper ever kills a field of research on its own, and deep learning theory was not exactly the most productive and healthy field at the time this was published. And the paper didn't come close to addressing all theoretical approaches to understanding aspects of deep learning. But if I had to point to a single paper that shattered the feeling of optimism at the time, it would be Zhang et al. 2016.^[1]

Let’s start by answering a question: what, exactly, do I mean by deep learning theory?

At least in 2016, the answer was: “extending statistical learning theory to deep neural networks trained with SGD, in order to derive generalization bounds that would explain their behavior in practice”.

Since the seminal work of Valiant in the mid 1980s, statistical learning theory had been the dominant approach for understanding machine learning algorithms. The framework imagined a data distribution D over inputs X and outputs Y where the goal was to fit a hypothesis h : X → Y that minimized the expected test loss for a loss function L : X × Y → R over D. A learning algorithm would receive n samples from the data distribution, and would minimize the training loss averaged across the sample L(h(x), y).

The core results of this approach took the form of generalization bounds: given some metric of complexity of the hypothesis class H, bound the difference between the average training loss and the test loss in terms of this metric of hypothesis complexity. To put it in less technical terms, a generalization bound basically says:

If your hypothesis class is not too complicated relative to the amount of training data you have, and it explains the training data well, then it will generalize and do well on the full data distribution.

The field of statistical learning had settled on a few preferred ways to measure complexity: VC dimension and Rademacher complexity being the two main metrics, though some researchers considered alternatives such as the margin between positive/negative example from the classification boundary.

The success of modern deep learning, starting from the early 2010s, posed something of an existential crisis for this field. By all the metrics – including both VC Dimension and Rademacher complexity – even a simple MLP with sigmoidal or ReLU activations represents far too complicated a hypothesis class to not immediately overfit on the training data. If the VC dimension results for a neural network are assumed to be asymptotically tight up to constraints, then no neural network with even 100,000 parameters should be able to do anything useful on data points not included in the training data. Yet, not only were neural networks performing better than other machine learning algorithms, by the mid 2010s there was a growing list of examples where neural networks with tens of millions of parameters solved problems (such as the ImageNet challenge) that no other machine learning algorithm could make much progress on.

Clearly, neural networks did generalize. If traditional metrics of complexity, based on the representation capacity of the class of neural networks with arbitrarily specified, infinite precision floating points, failed at capturing the simplicity of neural networks in practice, then the field simply needed to construct new simplicity measures to argue that neural networks learned simple functions in practice.

This was the approach taken in several papers around the time. For example, Neyshabur, Tomioka, Srebro’s Norm-Based Capacity Control in Neural Networks (2015) constructed a complexity measure based on the Frobenius norm of the weight matrices in a deep neural network. Hardt, Recht, and Singer’s Train faster, generalize better: Stability of stochastic gradient descent (2015)^[2] showed that neural networks trained with a small number of SGD steps with sufficiently small step size were uniformly stable in that removing a single training example would not change the model’s loss on any particular test example by very much.

At least when I first entered the field of deep learning as an undergrad in early 2016, there was a sense of cautious optimism: we would find the way in which neural networks in realistic regimes were simple, and thereby derive generalization bounds that would be applicable in practice.

So, what did Zhang et al. 2016 actually show? Why did understanding deep learning require rethinking generalization?

To quote the paper, the “central finding can be summarized as: Deep neural networks easily fit random labels”. Specifically, the authors trained neural networks on the standard-at-the-time CIFAR10 and ImageNet benchmarks to memorize random labels, while following standard procedures and training for the same order of magnitude of steps. They also show that with similar techniques, neural networks could be trained to memorize random noise inputs.

Why is this an effective death knell for the simplicity-and-generalization-bound approach? The authors' results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense, at least for complexity measures that depend only on properties of the hypothesis class and (data-independent) properties of the learning algorithm.

The paper has 5 important parts. Let's go through each of them.

The core empirical finding that neural networks can fit random labels. The authors train a 1- and a 3-layer MLP, an AlexNet variant, and an Inception variant on CIFAR10. They train the models normally (with the true labels), as well as four ways of corrupting the dataset: random labels (replacing each label with a random class with some probability), shuffled pixels (the same permutation on pixels is applied to each image), random pixels (a different random permutation is applied to each image), and pure Gaussian noise (replacing every single pixel with an independent draw from a Gaussian). In each of these five cases, the network gets to near 0 training loss. Notably, while training with random labels is harder, convergence to zero training loss takes only a factor of 1.5-3.5x longer than with the true labels. And by varying the degree of label corruption, the authors can produce models that either generalize to the test set to varying degrees or perform no better than chance.

The authors also train an InceptionV3 model on ImageNet with random labels, and find that it can get 95.2% top-1 accuracy on the train set.

The implications for statistical learning theory approaches to generalization bounds. These experiments show that in realistic regimes, Rademacher complexity and VC dimension bounds are basically vacuous, since neural networks have enough representational capacity to memorize entire training sets. Hardt and Recht's (both authors on this paper) prior results on uniform stability also are necessarily vacuous in this setting, since it’s a property that only depends on the algorithm and hypothesis class (it’s data-independent!), but the algorithm and hypothesis class stays the same in each experimental setting.
Further experiments demonstrating that explicit regularization cannot rescue generalization bounds. The authors show that on both ImageNet and CIFAR-10, explicit regularization methods such as data augmentation or weight decay do not seem to affect the test accuracy of the algorithms very much. That is, the neural networks generalize to the test distribution even without any regularization. The authors also show that on ImageNet, applying dropout or weight decay still allows the resulting model to memorize the training set to a large extent. So any generalization bound that depends on regularization (e.g. weight norm-based explanations) cannot explain why neural networks generalize.
A simple toy construction that showed a two-layer ReLU network can memorize a number of examples linear in parameter count. The authors include a simple theoretical result, where a depth-2 ReLU network with 2n+d weights can fit any labeling of a sample of n data points in d dimensions. This feels pretty extraneous to me given the strength of the empirical results, but the construction is simple and it confirms the intuition that neural networks with millions of parameters “should” be able to fit tens of thousands of data points in the CIFAR10 setting.
Some notes on how statistical learning theory fails even in a simple overparameterized linear regime. The authors consider a basic overparameterized linear regression setting, and show both empirically and theoretically that SGD can learn a minimum norm solution that generalizes. The authors point out that statistical learning theory at the time had no explanation for generalization in this simple regime.
They also demonstrate empirically that smaller norm doesn’t imply better generalization – by applying preprocessing to an MNIST dataset to increase its effective dimensionality for a linear classifier, the resulting larger linear classifier has higher norm but less generalization error (this result also undercuts the weight-norm based approach to explaining generalization in neural networks).
Amusingly, the quick thoughts put forth by the authors in this setting would go on to become quite influential, both in that people would study the behavior of SGD in overparameterized linear regimes, and that it hints toward future puzzles such as double descent.

So, how did the field of deep learning theory react to this paper? What were the attempts to get around this result using data-dependent generalization bounds? And what was the paper that arguably sealed the deal on the whole edifice, and nailed the proverbial coffin shut?

I'll answer these questions in tomorrow's post.

^{^}
Notably, Zhang et al. 2016 got best paper at ICLR 2017, so it was widely recognized as important even at the time.
^{^}
Note that both Hardt and Recht were also authors on the Zhang et al. paper.

I like this post and the "theory of deep learning" posts. But I think I still haven't figured out how to model your view, especially the specifics of the pessimism here. Maybe we should discuss in person. In particular I'm not sure what "deep learning theory" encompasses.

My sense of mechinterp theory is that it's similar to pre- standard model physics.

Heuristically, here's a thought experiment. Suppose we're worried about the sun destroying the earth and want to understand as much as possible about the physics of solar plasmas and supernovas; but we currently only have (a vaguely historical pastiche of) pre-WW2 physics. Physics then roughly had the following components:

idealized heuristics: if we view a big object in space as a classical blackbody, we get a good heuristic on some parts of its emission spectrum
new behaviors: there's a consistent way that emission spectra aren't classical blackbodies, in that they're quantized. We have only a rough understanding of how and why, and in fact this observation spawned the discovery of quantum mechanics.
small toy examples: we can understand the hydrogen atom relatively well. There are some weird factors of 2 and corrections that we can only explain kind of heuristically, but except for these we have a clean, exact quantized spectrum. We see this spectrum in real life materials - but we also see that most of what comprises real materials isn't hydrogen, and is much more complicated. Some stuff still looks roughly like they could be atomic spectra for other atoms or small molecules, but metal conducting bands are dominated by weird and clearly non-localized behaviors that we don't understand (and the sun similarly has weird spectral phenomena).
limits. There's a limit where the world is Newtonian, which is sometimes useful, but very inaccurate when modeling the sun. There's a limit where the world is relativistic. This gives directionally good corrections for some stellar phenomena (e.g. redshift) but is not nearly enough. It seems that there are maybe other limits (like we can mostly blackbox nuclear phenomena at earth temperatures but not at solar temperatures). Most of our understanding comes from sloppily combining together different phenomena coming from the various limits of importance.
experimental tools: looking at emission spectra is a really low-bandwidth way to interact with behaviors of interest. While it gives interesting info that points to new phenomena, it at best tells us something about a very limited class of behaviors (photon absorption and emission). In order to understand "how QM works" we have to figure out new tools (maybe vacuum chambers and primitive colliders), and new ways to interpret the output of existing tools.

In our world, iterating on these techniques gave us the standard model (and we understood solar plasma and some basics of supernovas before this). I think the promise of theory is that analogs of these techniques (maybe: SAE, large-N limits, toys like mod-add) will give us robust mechanism-finding tools. I think a lot of criticism of theory sounds to me like someone who in that world is saying "none of the current tools explain the sun even approximately, so we're on the wrong track" - but that's not how theory works (until you've found a "critical mass" of the standard model, or at least all parts of it relevant in a field of interest, you'll only be explaining tiny fractions of the observations). I know you're not making this criticism, but I feel like currently you are flattening the different components above into one notion of "theory good vs. theory bad".

I'd guess that you're skeptical about whether the analogs of 1-5 in ML theory are actually useful for "making progress towards the standard model", but I'm not sure from your post which of these you think is most lacking (or if this picture is even compatible with your criticism).

My guess is that your issue is wrt something like my #1: certain heuristics that people were excited about and hoped would explain generalization turned out to be more complicated. My view is that in modern theory, VC dimension is considered largely defunct in models with nontrivially interesting data (even as simple as mod-add), but I'm not sure why this is the important thing about theory. If you take a more modern theory like mean field or even NTK, it has a non-VC notion of generalization: e.g. NTK/ Gaussian processes can replicate generalization in mnist (related to some data spectrum properties), and mean field theory can (currently only on the Bayesian level - this is unpublished work with Kaarel) explain generalization on polynomially many samples of any mechanism that can be encoded in a small (algorithms) circuit. It's of course not guaranteed to converge to the same mechanism, but has the same notion of learnable vs. un-learnable on a polynomial-vs-exponential complexity theory level. It also does replicate the correct modular addition generalizing algorithm (NTK does not).

Hm. I'm going to first nitpick your specific example of pre-standard model physics and the sun, then respond to the overall point.

My impression is that pre-WW2 physics had a lot more wins re: the sun than either deep learning theory circa 2016 or even current learning mechanics (as used in the Simon et al paper) has for current mech interp. (I'm not familiar with your unpublished mean field theory work, and I also haven't been keeping up to date on Singular Learning Theory work.

From the spectrum lines we knew that the sun was made of hydrogen and helium by the late 1920s at the very latest: helium was first discovered in the sun in the late 1800s via its spectral lines, and Cecilia Payne-Gaposchkin's PhD thesis Stellar Atmospheres that established the sun's atmospheric composition was published in 1925.
Aston had built the mass spectrograph by 1920, which allowed him to measure the mass of helium and hydrogen nuclei. It turns out that four hydrogen nuclei (that is, 4 protons) weighed a statistically significantly amount more than a helium nuclei (Google tells me ~0.7%, though idk how precise their measurements were at the time). In the 1920s, Eddington argued that this mass defect was the source of energy for the sun (though he didn't understand how the sun fused hydrogen into helium).
The Hertzprung and Russell diagram was plotted by the early 1900s. Perhaps a naive interpretation says that stars would move left and up along this diagram over time, though, Eddington's 1920s work (and the existence of the mass-luminosity relationship between stars) suggested that the main sequence was a result of stars having different masses and not a thing they'd move along.
We had observations of white dwarfs by the 1920s, we knew they were held up by electron degeneracy pressure by the late 1926 (Pauli's exclusion principle was published in 1925), and Chandrasekhar did the relativistic corrections to get the mass limit on white dwarves by 1930.
Chandrasekhar's limit also suggested the existence of neutron stars, which were then discovered in 1930s. Baade and Zwicky's 1934 paper then proposed that the transition from main sequence -> neutron star was the reason for super nova.
Much earlier work in biology -- eg the speed of evolution and age of the fossil record -- already suggested that the sun and the earth have been around in something like their current configuration for hundreds of millions if not billions of years, which put pretty strong limits on the speed of stellar evolution.
The main missing piece for predicting the fate of the sun was that it wasn't obvious how white dwarves came to be, or even when stars would leave the main sequence. (Probably the first explanation was from Chandrasekhar in 1942.) We were also missing a lot of facts about how exactly the sun fused hydrogen into helium and how supernovas would happen.

So I think that, while people did not know whether the sun would go nova or somehow transition into a white dwarf, we knew the composition of the sun (1), the mechanism through which it generated energy (2), had statistical relationships between stellar mass, luminosity, and temperature (3), had proposed two stellar remnants that the sun could eventually become (4, 5), and had a strong sense of what the time scale these events would happen at (6).

Arguably, deep learning theory circa 2016 had only a (6) analogue (namely, how well deep learning seemed to work in practice). Deep learning theory in 2026 has in addition its versions of (3) (scaling laws, which were first noted around 2020) and rudimentary analogues of (4) and (5) (NTKs/infinite width/depth limits in general, perhaps?). But we don't have any analogues for 1 or 2 (which are the mechanisms through which SGD on deep neural networks produces smart systems) except in a very rudimentary way (something about circuits, gradient scales, and loss landscapes, perhaps?), and there's a fair amount of evidence that our 4/5 analogues aren't great.

Specifically, I think the relevant work for understanding the fate of the sun (at least on the scale of "does it go nova") wasn't so much the standard model as relativistic quantum mechanics + general relativity, a lot of which had been worked out by the 1930s. Spectral lines gave us the composition of the sun, e=mc^2 suggested the mechanism for energy generation, and relativistic quantum mechanics pointed to two possible stellar remnants.

I think there's two ways to make your case go through, though:

You can argue that circa 1935, it'd seem that additional work on "pure" quantum mechanics would not help our understanding of the sun, while it was demonstrably very important. So from a "figure out if the sun will explode" standpoint, it might look like we shouldn't be spending any additional research effort into QM in itself, which would cause us to miss a bunch of facts about the fate of the sun. Examples:
1. iirc you need a theory of weak interactions to compute the rate of p + p → d + e⁺ + ν in the sun.
2. we had no explanation of why the sun seemed to produce far fewer neutrinos than expected for decades, and this one actually does require the standard model to solve (though to be fair, we also didn't notice the problem until 1960).
3. (Claude tells me that) the actual math for parts of the red giant -> white dwarf transition for our sun involved a lot of complicated QM.
I think the closest analogue to the Zhang et al paper re: the Sun and physics probably happened with Kelvin's calculations about the age of the sun and the earth. He didn't know about radiation or e=mc^2, so his calculations gave the age of the sun and earth at something like 20 million years. This was far below the hundreds of millions-billions of years bound set by biology results at the time. Maybe then one would conclude that in the late 1800s that you should not study classical mechanics in order to understand the sun, or at least, that further investment in research like that was unlikely to produce relevant results.

Responses to these two cases:

It's plausible that this argument might hold up -- my technical QM knowledge basically ends with work in the 1940s. But I think it's not as strong an argument as "your argument implies people shouldn't study theoretical physics", because so much of stellar evolution could be (and had been) worked out with just basic relativistic QM and general relativity. So a QFT person could point to the success of relativistic QM as evidence for why additional investment in QM research would further our understanding of the stars.
I think the analogous dismissal of theoretical research here is basically correct -- I'm pretty sure that no amount of additional theoretical work without knowledge of radiation, QM, or relativity would've let Kelvin calculate the age of the sun correctly from first principles, let alone figured out whether it'd eventually explode.

To clarify, my case isn't so much "don't invest in theory research" -- I actually think we should invest in much more theory research! -- as opposed to "I don't expect classical learning theory approaches to work, and the standard academic response to this (learning mechanics) seems to have just dropped the level of ambition so much as to be uninteresting". Specifically, I think any theoretical approach that seeks to explain large chunks of deep learning should be able to articulate a clear novel insight ala relativity or QM that differentiates it from pre-2019 learning theory, or else be treated with a level of skepticism that can only be overcome with interesting empirical results.

The Singular Learning Theory people have several insights of this form -- e.g. the old learning theory both used a wrong notion of generalization and failed to account for degeneracies in the loss landscape. ARC Theory and the comp mech people both have their own versions of these insights as well. I suspect you can articulate something like this for your work with Kareel (though I haven't seen it) as well -- iirc it was something like "modern versions of mean field theory in physics can handle higher order interactions than the earlier approaches that people used for studying generalization" or something?

Very good nitpicks. I definitely don't know my physics history well (but even with my limited knowledge, I was gesturing at a cartoon level of understanding that mixed different early-20th-century pictures on different phenomena).

Re mean field - it's not higher order, but lower order. Mean field is to NTK what classical mechanics is to quantum mechanics (in particular NTK + higher order corrections still has most of the bad generalization properties of NTK). The new insight is that while the expansion in NTK is always around a trivial classical theory, nontrivial classical theories also exist and are better-behaved from a complexity viewpoint

For future reference, after speaking more with Dmitry + reading more of the papers linked in the Simon et al "Scientific Theory of Deep Learning" paper, I've become (slightly) more positive on deep learning theory, at least the mean-field line of work he mentioned in his comments.

I wrote up a bit of the history of that line of work, and why I've become more optimistic on deep learning theory in the past few days: https://www.lesswrong.com/posts/6SRq7mZ97Dwuavwb6/maybe-i-was-too-harsh-on-deep-learning-theory-three-days-ago

These randomly trained models, are they uncertain or confidently wrong on the test data?

My model of what is going on here is that stochastic gradient descent is acting roughly like an MCMC sampling method. It's producing a random sample from the space of low loss parameters. And that the simpler hypothesis correspond to larger parameter space volumes.

When the network needs to memorize, it needs to use nearly all it's parameters, meaning a small parameter-space volume. When the network is learning a pattern, it's only using a small fraction of it's parameters on the pattern, and the rest of the parameters can be almost anything, so long as they don't get in the way. This means simple hypothesis have a huge volume in parameter space. (This is basically the lottery ticket hypothesis, and it explains why network distillation is so effective.)

MCMC means sampling from the distribution proportional to so larger parameter space volumes will be more likely to be sampled.

So the network training will choose the simplest hypothesis available.

Grokking makes sense if the simpler hypothesis are sometimes harder for local greedy search methods to find.