chinchilla's wild implications

When you exhaust all the language data from text, you can start extracting language from audio and video.

As far as I know the largest public repository of audio and video is YouTube. We can do a rough back-of-the-envelope computation for how much data is in there:

According to some 2019 article I found, in every minute 50 hours of video are uploaded to YouTube. If we assume this was the average for the last 15 years, that gets us 200 billion minutes of video.
An average conversation has 150 words per minute, according to a Google search. That gets us 30T words, or 30T tokens if we assume 1 token per word (is this right?)
Let's say 1% of that is actually useful, so that gets us 300B tokens, which is... a lot less than I expected.

So it seems like video doesn't save us, if we just use it for the language data. We could do self-supervised learning on the video data, but for that we need to know the scaling laws for video (has anyone done that?).

[-]Sam Bowman3y716

I agree that this points in the direction of video becoming increasingly important.

But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don't have the scaling laws, but it seems pretty clear that there's a ton of information in the non-language parts of videos that'd be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it'll take more computation to extract the same amount of useful information from video than from text.)

[-]nostalgebraist3y510

Very interesting!

There are a few things in the calculation that seem wrong to me:

If I did things right,15 years * (365 days/yr) * (24 hours/day) * (60 mins/hour) * (50 youtube!hours / min) * (60 youtube!mins / youtube!hour) = 24B youtube!minutes, not 200B.
I'd expect much less than 100% of Youtube video time to contain speech. I don't know what a reasonable discount for this would be, though.
In the opposite direction, 1% useful seems too low. IIRC, web scrape quality pruning discards less than 99%, and this data is less messy than a web scrape.

In any case, yeah, this does not seem like a huge amount of data. But there's enough order-of-magnitude fuzziness in the estimate that it does seem like it's worth someone's time to look into more seriously.

[-]MSRayne3y1715

Does this imply that AGI is not as likely to emerge from language models as might have been thought? To me it looks like it's saying that the only way to get enough data would be to have the AI actively interacting in the world - getting data itself.

[-]nostalgebraist3y1914

I definitely think it makes LM --> AGI less likely, although I didn't think it was very likely to begin with.

I'm not sure that the AI interacting with the world would help, at least with the narrow issue described here.

If we're talking about data produced by humans (perhaps solicited from them by an AI), then we're limited by the timescales of human behavior. The data sources described in this post were produced by millions of humans writing text over the course of decades (in rough order-of-magnitude terms).

All that text was already there in the world when the current era of large LMs began, so large LMs got to benefit from it immediately, "for free." But once it's exhausted, producing more is slow.

IMO, most people are currently overestimating the potential of large generative models -- including image models like DALLE2 -- because of this fact.

There was all this massive data already sitting around from human activity (the web, Github, "books," Instagram, Flickr, etc) long before ML compute/algorithms were anywhere near the point where they needed more data than that.

When our compute finally began to catch up with our data, we effectively spent all the "stored-up potential energy" in that data all at once, and then confused ourselves into thinking that compute was only necessary input for the reaction.

But now compute has finally caught up with data, and it wants more. We are forced for the first time to stop thinking of data as effectively infinite and free, and to face the reality of how much time and how many people it took to produce our huge-but-finite store of "data startup capital."

I suppose the AI's interactions with the world could involve soliciting more data of the kind it needs to improve (ie active learning), which is much more valuable per unit than generic data.

I would still be surprised if this approach could get much of anywhere without requiring solicitation-from-humans on a massive scale, but it'd be nice to see a back-of-the-envelope calculation using existing estimates of the benefit of active learning.

[-]Raemon3y50

Something I'm unsure about (commenting from my mod-perspective but not making a mod pronouncement) is how LW should relate to posts that lay out ideas that may advance AI capabilities.

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

It seems important to be able to talk about that and model the world, but I'm wondering if posts like this should live behind a "need to log-in" filter, maybe with a slight karma-gate, so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

nostalgiabraist, I'm curious how you would feel about that.

[-]Kaj_Sotala3y812

so that the people who end up reading it are at least more likely to be plugged into the LW ecosystem and are also going to get exposed to arguments about AI risk.

There's also the chance that if these posts are not gated, people who previously weren't plugged into the LW ecosystem but are interested in AI find LW through articles such as this one. And then eventually also start reading other articles here and become more interested in alignment concerns.

There's also a bit of a negative stereotype among some AI researchers as alignment people being theoretical philosophers doing their own thing and being entirely out of touch about what real AI is like. They might take alignment concerns a bit more seriously if they find it easy to actually find competent AI discussion on LW / Alignment Forum.

[-]nostalgebraist3y45

My current understanding is that all major AI labs have already figured out the chinchilla results on their own, but that younger or less in-the-loop AI orgs may have needed to run experiments that took a couple months of staff time. This post was one of the most-read posts on LW this month, and shared heavily around twitter. It's plausible to me that spreading these arguments plausibly speeds up AI timelines by 1-4 weeks on average.

What is the mechanism you're imagining for this speedup? What happens that would not have happened without this post?

Consider that

The Chinchilla paper was released over four months ago, on 3/29/22.
It did not take long for the paper to get noticed among people interested in ML scaling, including here on LW.
- On 3/29, the same day it was released, the paper was linked on r/mlscaling.
- On 3/31, I heard about it through the EleutherAI discord, and immediately made an LW linkpost.
- On 4/1, 1a3orn posted a more detailed explainer.

I'm struggling to imagine a situation where a relevant AI org is doing Chinchilla-like scaling experiments, yet somehow has managed to miss this paper (or to ignore/misunderstand it) for 4+ months. The paper is not exactly a secret, and it's not even especially difficult to read as these things go.

More broadly, I doubt LW has significant leverage to decrease the overall supply of these kinds of conversations. There are lots of venues for cutting-edge ML discussion, and the conversation is going to happen somewhere. (See Connor's comments here.)

[-]Chris_Leong2y*10

This is very tricky. On one hand, this may actually Streisand effect these results to greater prominence. On the other hand, at the point where people were specifically working around this to gain access to log-in gated LW resources, this would probably enhance our community status/prestige which might actually increase our influence.

Overall, I'd lean towards carefully experimenting with a log-in filter, with the notion of abandoning this idea if it doesn't seem to be achieving its goals.

[-]RyanCarey3y43

It would be useful to have a more descriptive title, like "Chinchilla's implications for data bottlenecks" or something.

[-]Zvi3y30

Scott Alexander asked things related to this, but still seems worth being more explicit about what this perfect 1.69 loss model would be like in practice if we got there?

[-]nostalgebraist3y34

The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText."

How good is that, though? Well, it depends entirely on how good you think transformer LMs are capable of being, in principle.

If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways. Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI. Both of these people are right, conditional on their other beliefs.

The key distinction here is that "1.69 loss" may not the best achievable loss on this dataset. It's just an estimate of the best loss achievable by this kind of model.

The question "what would a model be like, if it got the best achievable loss, period?" is more interesting, but nothing in this post or these papers really touches on it.

[-]Vanessa Kosoy3y55

it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText

Transformers a Turing complete, so "model of this type" is not much of a constraint. On the other hand, I guess it's theoretically possible that some weight matrices are inaccessible to current training algorithms no matter how much compute and data we have. It seems also possible that the scaling law doesn't go on forever, but phase-transitions somewhere (maybe very far) to a new trend which goes below the "irreducible" term.

^{^}

Training Compute-Optimal Large Language Models

^{^}

See their footnote 2

^{^}

See their equation (10)

^{^}

Is 0.052 a "small" amount in some absolute sense? Not exactly, but (A) it's small compared to the loss improvements we're used to seeing from new models, and (B) small compared to the improvements possible by scaling data.

In other words, (A) we have spent a few years plucking low-hanging fruit much bigger than this, and (B) there are more such fruit available.

^{^}

The two terms are still a bit imbalanced, but that's largely due to the "Approach 3 vs 1/2" nuances mentioned above.

^{^}

Caveat: Gopher and Chinchilla were trained on the same data distribution, but these other models were not. Plugging them into the equation won't give us accurate loss values for the datasets they used. Still, the datasets are close enough that the broad trend ought to be accurate.

^{^}

Wait, isn't that smaller than Chinchilla?

This is another Approach 3 vs. 1/2 difference.

Chinchilla was designed with Approaches 1/2. Using Approach 3, like we're doing here, give you a Chinchilla of only 33B params, which is lower than our palm_opt's 63B.

^{^}

Seriously, I can't find anything about it in the Gopher paper. Except that it was "collected in November 2020."

^{^}

It is not even clear that this multilingual-ization affected the web corpus at all.

Their datasheet says they "used multilingual versions of Wikipedia and conversations data." Read literally, this would suggest they didn't change the web corpus, only those other two.

I also can't tell if the original GLaM web corpus was English-only to begin with, since that paper doesn't say.

^{^}

This ablation only compared filtered web data to completely unfiltered web data, which is not a very fine-grained signal. (If you're interested, EleutherAI has done more extensive experiments on the impact of filtering at smaller scales.)

^{^}

They are being a little coy here. The current received wisdom by now is that repeating data is really bad for LMs and you should never do it. See this paper and this one.
EDIT 11/15/22: but see also the Galactica paper, which casts significant doubt on this claim.

^{^}

The Pile authors only included a subset of this in the Pile.

^{^}

The MassiveText datasheet says only that "the books dataset contains books from 1500 to 2008," which is not especially helpful.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

97

97

1. the scaling law

plugging in real models

2. are we running out of data?

web scrapes

domain-specific corpora

"all the data we have"

what is compute? (on a further barrier to data scaling)

appendix: to infinity