All of Ethan Caballero's Comments + Replies

Read Section 6 titled “The Limit of the Predictability of Scaling Behavior” in this paper: 
https://arxiv.org/abs/2210.14891

We describe how to go about fitting a BNSL to yield best extrapolation in the last paragraph of Appendix Section A.6 "Experimental details of fitting BNSL and determining the number of breaks" of the paper: 
https://arxiv.org/pdf/2210.14891.pdf#page=13

Sigmoids don't accurately extrapolate the scaling behavior(s) of the performance of artificial neural networks. 

Use a Broken Neural Scaling Law (BNSL) in order to obtain accurate extrapolations: 
https://arxiv.org/abs/2210.14891
https://arxiv.org/pdf/2210.14891.pdf
 

2Lukas Finnveden
Interesting. Based on skimming the paper, my impression is that, to a first approximation, this would look like: * Instead of having linear performance on the y-axis, switch to something like log(max_performance - actual_performance). (So that we get a log-log plot.) * Then for each series of data points, look for the largest n such that the last n data points are roughly on a line. (I.e. identify the last power law segment.) * Then to extrapolate into the future, project that line forward. (I.e. fit a power law to the last power law segment and project it forward.) That description misses out on effects where BNSL-fitting would predict that there's a slow, smooth shift from one power-law to another, and that this gradual shift will continue into the future. I don't know how important that is. Curious for your intuition about whether or not that's important, and/or other reasons for why my above description is or isn't reasonable. When I think about applying that algorithm to the above plots, I worry that the data points are much too noisy to just extrapolate a line from the last few data points. Maybe the practical thing to do would be to assume that the 2nd half of the "sigmoid" forms a distinct power law segment, and fit a power law to the points with >~50% performance (or less than that if there are too few points with >50% performance). Which maybe suggests that the claim "BNSL does better" corresponds to a claim that the speed at which the language models improve on ~random performance (bottom part of the "sigmoid") isn't informative for how fast they converge to ~maximum performance (top part of the "sigmoid")? That seems plausible.

The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations.

Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.

"However, to the best of our knowledge there are no quantitative scaling laws for robustness yet."


For scaling laws for adversarial robustness, see appendix A.15 of openreview.net/pdf?id=sckjveqlCZ#page=22

1AdamGleave
Thanks, I'd missed that! Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space? Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.

Sections 3.1 and 6.6 titled "Ossification" of "Scaling Laws for Transfer" paper (https://arxiv.org/abs/2102.01293) show that current training of current DNNs exhibits high path dependence.