I am not sure if the calculation in Appendix B is quite accurate; I would like to ask you for a better explanation if I am not quite right.
In the first line (calculation of 'm'), we can clearly see that there are 4 operations. Now, we could assume that (1-beta1) could be pre-calculated, and hence there are only 3 operations.
If we accept that argument, then in the calculations of 'm_hat' and 'v_hat', should be considered to have only 1 operation each. I do see the transpose there, which is weird to me too; although PyTorch's documentation gives the same set of mathematical equations, the default parameters use scalar values for beta1 and beta2.
I am really trying to make sense of the calculation here, but I really can't. Could you please provide more information on this?
Summary:
Compute-intensity of the weight update
Most compute-intensive layers
Backward-forward ratio
Large batch size OR compute-intensive convolutional layer
First layer
1:1
Other layers
2:1
Small batch size AND no compute-intensive convolutional layers
First layer
Other layers
3:1
Introduction:
How many more floating-point operations (FLOP) does it take to compute a backward pass than a forward pass in a neural network? We call this the backward-forward FLOP ratio.
This ratio is useful to estimate the total amount of training compute from the forward compute; something we are interested in the context of our study of Parameter, Compute and Data Trends in Machine Learning.
In this post, we first provide a theoretical analysis of the ratio, and we then corroborate our findings empirically.
Theory:
To understand where the differences in ratios come from, we need to look at the classical equations of backpropagation.
Let’s start with a simple example---a neural network with 2 hidden layers.
In this example, we have the following computations for forward and backward pass assuming linear layers with ReLU activations. The “@”-symbols denote matrix multiplications.
Operation
Computation
FLOP forward
Computation
FLOP backward
δ1R=dL/dA2
=W2@δ2
dL/dW2
=δ2@A1R
We separate the weight update from the individual layers since the update is done after aggregation, i.e. we first add all gradients coming from different batches and then multiply with the learning rate.
From this table we see
In equation form, the formula for the backward-forward FLOP ratio is:
backward / forward =
(FIRST LAYER FORWARD FLOP + 2*OTHER LAYERS FORWARD FLOP + WEIGHT UPDATE) / (FIRST LAYER FORWARD FLOP + OTHER LAYERS FORWARD FLOP)
There are two considerations to see which terms dominate in this equation:
This leads us to four possible cases:
Big weight update
Small weight update
First layer dominant
2*FIRST LAYER FORWARD FLOP / FIRST LAYER FORWARD FLOP = 2:1
FIRST LAYER FORWARD FLOP / FIRST LAYER FORWARD FLOP = 1:1
Other layers dominant
3*OTHER LAYERS FORWARD FLOP / OTHER LAYERS FORWARD FLOP = 3:1
2*OTHER LAYERS FORWARD FLOP / OTHER LAYERS FORWARD FLOP = 2:1
The norm in modern Machine Learning is deep networks with large batch sizes, where our analysis predicts a ratio close to 2:1.
In short, our theoretical analysis predicts that the backward-forward FLOP ratio will be between 1:1 and 3:1, with 2:1 being the typical case.
Empirical results:
To corroborate our analysis we use NVIDIA’s pyprof profiler to audit the amount of FLOP in each layer during the backward and forward pass.
In this section we will explore:
In short, our empirical results confirm our theoretical findings.
In a previous post, we tried to estimate utilization rates. As detailed in the previous post, the profiler does under- and overcounting. Thus, we believe some of the estimates are slightly off.
We have tried to correct them as much as possible. In particular, we eliminate some operations which we believe are double-counted, and we add the operations corresponding to multiplication by the learning rate which we believe are not counted in stochastic gradient descent.
Backward and forward FLOP in the first and the rest of the layers:
We can investigate this empirically by looking at a simple linear network (code in appendix).
It results in the following FLOP counts:
We can see that the first layer (red) has the same flop count for forward and backward pass while the other layers (blue, green) have a ratio of 2:1. The final weight update (yellow) is 2x the number of parameters of the network.
Type of layer:
The number of FLOP is different for different types of layers.
As we can see, the number of FLOP for linear layers is 2x their number of parameters. For CNNs the number of FLOP is much higher than the number of parameters. This means that the final weight update is basically negligible for CNNs but relevant for linear networks.
To show this empirically, we look at the profiler FLOP counts of a small CNN (code in appendix).
Similar to the linear network, we can confirm that the backward-forward ratio for the first layer is 1:1 and that of all others 2:1. However, the number of FLOP in layers (red, blue, green) is much larger than for the weight update (yellow).
Batch size:
Gradients are aggregated before the weight update. Thus, the FLOP for weight updates stays the same for different batch sizes (yellow) while the FLOP for all other operations scales with the batch size (blue, green, red). As a consequence, larger batch sizes make the FLOP from weight updates negligibly small.
Depth:
Depth, i.e. the number of layers only has an indirect influence. This stems from the fact that the first layer has a ratio of 1:1 while further layers have a ratio of 2:1. Thus, the true influence comes from FLOP in the first layer vs. every other layer.
To show this effect, we define a CNN with different numbers of intermediate conv layers (code in appendix).
We find that the backward-forward starts significantly below 2:1 for 0 intermediate layers and converges towards 2:1 when increasing the number of intermediate layers.
Most common deep learning CNN architectures are deep enough that the first layer shouldn’t have a strong effect on the overall number of FLOP and thus the ratio should be close to 2:1. We have empirically tested this for multiple different types of resnets and batch sizes. We observe some diverge from the expected 2:1 ratio but we think that this is a result of the profiler undercounting certain operations. We have described problems with the profiler in the previous post.
Backward-forward FLOP ratio in different architectures. Read the labels as architecture_batchsize.
Combining all above:
There are interdependencies of batch size, type of layer and depth which we want to explore in the following. We compare the small CNN and the linear network that were already used before with a network we call OneNet (code in appendix). OneNet has only one input neuron and a larger second and third layer. Thus, the ratio between the first and other layers is very small and we can see that the theoretical maximum for the backward-forward ratio of 3:1 can be observed in practice.
Furthermore, we look at exponentially increasing batch sizes for all three architectures. In the case of linear networks, i.e. LinearNet and OneNet, the ratio decreases with increasing batch size since the influence of the weight update is reduced. In the case of the CNN, the FLOP count is completely dominated by layers and the weight update is negligible. This effect is so strong that no change can be observed in the figure.
We see that LinearNet converges to a backward-forward ratio of 1:1 for larger batch sizes while OneNet converges to 2:1. This is because nearly all weights of LinearNet are in the first layer and nearly all weights of OneNet in the other layers.
Conclusion:
We have reasoned that the backward-forward FLOP ratio in Neural Networks will typically be between 1:1 and 3:1, and most often close to 2:1.
The ratio depends on the batch size, how much computation happens in the first layer versus the others, the degree of parameter sharing and the batch size.
We have confirmed this in practice. However, we have used a profiler with some problems, so we cannot completely rule out a mistake.
Acknowledgments
The experiments have been conducted by Marius Hobbhahn. The text was written by MH and Jaime Sevilla.
Lennart Heim helped greatly with discussion and support. We also thank Danny Hernandez and Girish Sastry for discussion.
Appendix A: Code for all networks
Appendix B: Using other optimizers
Through this post we have assumed stochastic gradient descent (SGD) for the weight update. SGD involves multiplying the gradient by a learning rate and adding the result to the current weights. That is, it requires 2 FLOP per parameter.
Other optimizers require some extra work. For example, consider adaptive moment estimation (Adam). Adam’s parameter update is given by:
For a total of ~3 + 4 + 3 + 3 + 5 = 18 FLOP per parameter.
In any case, the choice of optimizer affects only the weight update and the amount of FLOP is proportional to the number of parameters. Since batch sizes are typically large, the difference will be small and won’t affect the backward-forward ratio much.