AI ALIGNMENT FORUM
AF

13

[ Parent Question — Does the lottery ticket hypothesis suggest the scaling hypothesis? ]

What happens to variance as neural network training is scaled? What does it imply about "lottery tickets"?

by Abram Demski

28th Jul 2020

1 min read

13

Daniel Kokotajlo asks whether the lottery ticket hypothesis implies the scaling hypothesis.

The way I see it, this depends on the distribution of "lottery tickets" being drawn from.

If the quality of lottery tickets follows a normal distribution, then after your neural network is large enough to sample decent tickets, it will get better rather slowly as you scale it -- you have to sample a whole lot of tickets to get a really good one.
If the quality of tickets has a long upward tail, then you'll see better scaling.

However, a long tail also suggests to me that variance in results would continue to be relatively high as a network is scaled: bigger networks are hitting bigger jackpots, but since even bigger jackpots are within reach, the payoff of scaling remains chaotic.

(This could all benefit from a more mathematical treatment.)

So: what do we know about NN training? Does it suggest we are living in extremistan or mediocristan?

Note: a major conceptual difficulty to answering this question is representing NN quality in the right units. For example, an accuracy metric -- which necessarily falls between 0% and 100% -- must yield "diminishing returns", and cannot be host to a "long-tailed distribution". Take that same metric and send it through an inverse sigmoid, and now you might not have diminishing returns, and could have a long-tail distribution. But we can transform data all day. The analysis shouldn't be too ad-hoc. So it's not immediately clear how to measure this.

AI Timelines2Lottery Ticket Hypothesis2AI2

Frontpage

What happens to variance as neural network training is scaled? What does it imply about "lottery tickets"?

11Evan Hubinger

1romeostevensit

New Answer

New Comment

1 Answers sorted by
top scoring

Feb 23, 2021

110

This paper seems to be arguing that variance initially increases as network width goes up, then starts decreasing for very large networks, suggesting that overall variance is likely to decrease as we approach more advanced AI systems and networks get very large.

'Variance' is used in an amusing number of ways in these discussions.You use 'variance' in one sense (the bias-variance tradeoff), but "Explaining Neural Scaling Laws", Bahri et al 2021 talks about a difference kind of variance limit in scaling, while "Learning Curve Theory", Hutter 2001's toy model provides statements on yet others kinds of variances about scaling curves themselves (and I think you could easily dig up a paper from the neural tangent kernel people about scaling approximating infinite width models which only need to make infinitesimally small linear updates or something like that because variance in a different sense goes down...) Meanwhile, my original observation was about the difficulty of connecting benchmarks to practical real-world capabilities: regardless of whether the 'variance of increases in practical real-world capabilities' goes up or down with additional scaling, we still have no good way to say that an X% increase on benchmarks ought to yield qualitatively new capability Y - almost a year later, still no one has shown how you would have predicted in advance that pushing GPT-3 to a particular likelihood loss would yield all these cool new things. As we cannot predict that at all, it would not be of terribly much use to say whether it either increases or decreases as we continue scaling (since either way, we may wind up being surprised).

Parent Question

7Does the lottery ticket hypothesis suggest the scaling hypothesis?

Daniel Kokotajlo

5y

17

1 comment, sorted by

Click to highlight new comments since: Today at 6:54 PM

[-]romeostevensit5y10

One related question is what sub-tasks of gpt-3 showed surprise jackpots vs gpt-2

More from abramdemski

51Have LLMs Generated Novel Insights?

Abram Demski, Cole Wyeth, Kaj Sotala

2mo

19

45Judgements: Merging Prediction & Evidence

2mo

2

25Anti-Slop Interventions?

2mo

27

Curated and popular this week

71Why Have Sentence Lengths Decreased?

Arjun Panickssery

2d

0

65Power Lies Trembling: a three-book review

6d

0

44Why Should I Assume CCP AGI is Worse Than USG AGI?

2d

0