For some reason here on LW there's a huge focus on "architecture". I don't get it. Here's how I at-this-moment think of the scaling hypothesis:
Weak scaling hypothesis: For a task that has not yet been solved, if you increase data and model capacity, and tune the learning algorithm to make use of it (like, hyperparameter tuning and such, not a fundamentally new algorithm), then performance will improve.
This seems fairly uncontroversial, I think? This probably breaks down in some edge cases (e.g. if you have 1-layer neural net that you keep making wider and wider) but seems broadly correct to me. It's mostly independent of the architecture (as long as it is possible to increase model capacity). Note also the common wisdom in ML that it's far more important what your data is than what your model / learning algorithm are.
What the architecture can influence is where your performance starts out at, and the rate at which it scales, which matters for:
Strong scaling hypothesis: (Depends on weak scaling hypothesis) There is a sufficiently difficult task T and an architecture A that we know of for that task, such that 1. "solving" T would lead to AGI, 2. it is conceptually easy to scale up the model capacity for A, 3. it is easy to get more data for T, and 4. scaling up a) model capacity and b) data will lead to "solving" T on some not-crazy timescale and resource-scale.
According to me, it is hard to find T that satisfies 1, 3 and 4b, it is trivial to satisfy 2, and hard to find an architecture that satisfies 4a. OpenAI's big contribution here is believing and demonstrating that T="predict language" might satisfy 1, 3 and 4b. I know of no other such T (though multiagent environments are a candidate).
What about 4a? According to me, it just so happens that Transformers are the best architecture for T="predict language", and so that's what we saw get scaled up, but I'd expect you'd see the same pattern of scaling (but not the same absolute performance) from other architectures as well. (For example, I suspect RNNs would also satisfy 4a.) I think the far more interesting question is whether we'll see other tasks T that could plausibly satisfy 1, 3, and 4b.
Thanks! It sounds like you are saying the task is more important than the architecture, so we should talk less about architectures and more about tasks.
That seems plausible to me, with the caveat that I think it's still worth talking about architecture sometimes. For example, when thinking about the safety or generalization properties of a system the architecture might be more important, no?
If I could go back in time, I'd change the question to be about "Architecture+training setups" instead of just "architectures."
GPT-1, 2, and 3 have shown impressive scaling properties. How likely is it that, in the next five years, many other architectures will also be shown to get substantially better as they get bigger? EDIT I am open to discussion of better definitions of the scaling hypothesis. For example, maybe Gwern means something different here in which case I'm also interested in that.