AI ALIGNMENT FORUM
AF

11

[ Question ]

Probability that other architectures will scale as well as Transformers?

by Daniel Kokotajlo

28th Jul 2020

1 min read

11

GPT-1, 2, and 3 have shown impressive scaling properties. How likely is it that, in the next five years, many other architectures will also be shown to get substantially better as they get bigger? EDIT I am open to discussion of better definitions of the scaling hypothesis. For example, maybe Gwern means something different here in which case I'm also interested in that.

AI Takeoff2AI Timelines2GPT2AI4

Frontpage

Mentioned in

29More Recent Progress in the Theory of Neural Networks

14To what extent are the scaling properties of Transformer networks exceptional?

Probability that other architectures will scale as well as Transformers?

5Daniel Kokotajlo

New Answer

New Comment

1 Answers sorted by
top scoring

Jul 29, 2020

15-1

For some reason here on LW there's a huge focus on "architecture". I don't get it. Here's how I at-this-moment think of the scaling hypothesis:

Weak scaling hypothesis: For a task that has not yet been solved, if you increase data and model capacity, and tune the learning algorithm to make use of it (like, hyperparameter tuning and such, not a fundamentally new algorithm), then performance will improve.

This seems fairly uncontroversial, I think? This probably breaks down in some edge cases (e.g. if you have 1-layer neural net that you keep making wider and wider) but seems broadly correct to me. It's mostly independent of the architecture (as long as it is possible to increase model capacity). Note also the common wisdom in ML that it's far more important what your data is than what your model / learning algorithm are.

What the architecture can influence is where your performance starts out at, and the rate at which it scales, which matters for:

Strong scaling hypothesis: (Depends on weak scaling hypothesis) There is a sufficiently difficult task T and an architecture A that we know of for that task, such that 1. "solving" T would lead to AGI, 2. it is conceptually easy to scale up the model capacity for A, 3. it is easy to get more data for T, and 4. scaling up a) model capacity and b) data will lead to "solving" T on some not-crazy timescale and resource-scale.

According to me, it is hard to find T that satisfies 1, 3 and 4b, it is trivial to satisfy 2, and hard to find an architecture that satisfies 4a. OpenAI's big contribution here is believing and demonstrating that T="predict language" might satisfy 1, 3 and 4b. I know of no other such T (though multiagent environments are a candidate).

What about 4a? According to me, it just so happens that Transformers are the best architecture for T="predict language", and so that's what we saw get scaled up, but I'd expect you'd see the same pattern of scaling (but not the same absolute performance) from other architectures as well. (For example, I suspect RNNs would also satisfy 4a.) I think the far more interesting question is whether we'll see other tasks T that could plausibly satisfy 1, 3, and 4b.

[-]Daniel Kokotajlo5y50

Thanks! It sounds like you are saying the task is more important than the architecture, so we should talk less about architectures and more about tasks.

That seems plausible to me, with the caveat that I think it's still worth talking about architecture sometimes. For example, when thinking about the safety or generalization properties of a system the architecture might be more important, no?

If I could go back in time, I'd change the question to be about "Architecture+training setups" instead of just "architectures."

2Rohin Shah5y

Yes, that's right. I'd be pretty surprised if this were the case after conditioning on the raw capabilities of the architecture, though I can't rule it out.

14To what extent are the scaling properties of Transformer networks exceptional?

1

1Answer by Gordon Seidoh Worley5y

Most systems eventually face scaling bottlenecks. In fact, unless your system is completely free of coordination, it definitely has bottlenecks even if you haven't scaled large enough to hit them. And since Transformers definitely require some coordination since no matter how large the models are and how much parallelism their hardware supports they still produce a single reduced output, we should expect that there are some scaling limits on Transformers that at some size will prevent them for effectively taking advantage of having a larger network. Further, you point at this a bit, but most systems also experiencing diminishing returns on performance for additional resources because of these constraints. Transformers may just be special in that they have yet to start hitting diminishing returns because we haven't yet run up against their coordination bottlenecks, although that doesn't make them too special since we should expect them to still have them lying in wait somewhere, just like they do in every other system that is not coordination free.

More from Daniel Kokotajlo

39Training AGI in Secret would be Unsafe and Unethical

Daniel Kokotajlo

5d

0

188AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V, Romeo Dean

15d

2

73OpenAI: Detecting misbehavior in frontier reasoning models

Daniel Kokotajlo

1mo

13

Curated and popular this week

73Why Have Sentence Lengths Decreased?

Arjun Panickssery

3d

0

46Why Should I Assume CCP AGI is Worse Than USG AGI?

4d

0

54Accountability Sinks

1d

0