One thing that's bothering me is... Google/DeepMind aren't stupid. The transformer model was invented at Google. What has stopped them from having *already* trained such large models privately? GPT-3 isn't that large an evidence for the effectiveness of scaling transformer models; GPT-2 was already a shock and caused huge public commotion. And in fact, if you were close to building an AGI, it would make sense for you not to announce this to the world, specially as open research that anyone could copy/reproduce, for obvious safety and economic reasons.
Maybe there are technical issues keeping us from doing large jumps in scale (i.e. , we only learn how to train a 1 trillion parameter model after we've trained a 100 billion one)?
As far as I can tell, this is what is going on: they do not have any such thing, because GB and DM do not believe in the scaling hypothesis the way that Sutskever, Amodei and others at OA do.
GB is entirely too practical and short-term focused to dabble in such esoteric & expensive speculation, although Quoc's group occasionally surprises you. They'll dabble in something like GShard, but mostly because they expect to be likely to be able to deploy it or something like it to production in Google Translate.
DM (particularly Hassabis, I'm not sure about Legg's current views) believes that AGI will require effectively replicating the human brain module by module, and that while these modules will be extremely large and expensive by contemporary standards, they still need to be invented and finetuned piece by piece, with little risk or surprise until the final assembly. That is how you get DM contraptions like Agent57 which are throwing the kitchen sink at the wall to see what sticks, and why they place such emphasis on neuroscience as inspiration and cross-fertilization for reverse-engineering the brain. When someone seems to have come up with a scalable architecture for a problem, like AlphaZero or AlphaStar, they are willing to pour on the gas to make it scale, but otherwise, incremental refinement on ALE and then DMLab is the game plan. They have been biting off and chewing pieces of the brain for a decade, and it'll probably take another decade or two of steady chewing if all goes well. Because they have locked up so much talent and have so much proprietary code and believe all of that is a major moat to any competitor trying to replicate the complicated brain, they are fairly easygoing. You will not see DM 'bet the company' on any moonshot; Google's cashflow isn't going anywhere, and slow and steady wins the race.
OA, lacking anything like DM's long-term funding from Google or its enormous headcount, is making a startup-like bet that they know an important truth which is a secret: "the scaling hypothesis is true" and so simple DRL algorithms like PPO on top of large simple architectures like RNNs or Transformers can emerge and meta-learn their way to powerful capabilities, enabling further funding for still more compute & scaling, in a virtuous cycle. And if OA is wrong to trust in the God of Straight Lines On Graphs, well, they never could compete with DM directly using DM's favored approach, and were always going to be an also-ran footnote.
While all of this hypothetically can be replicated relatively easily (never underestimate the amount of tweaking and special sauce it takes) by competitors if they wished (the necessary amounts of compute budgets are still trivial in terms of Big Science or other investments like AlphaGo or AlphaStar or Waymo, after all), said competitors lack the very most important thing, which no amount of money or GPUs can ever cure: the courage of their convictions. They are too hidebound and deeply philosophically wrong to ever admit fault and try to overtake OA until it's too late. This might seem absurd, but look at the repeated criticism of OA every time they release a new example of the scaling hypothesis, from GPT-1 to Dactyl to OA5 to GPT-2 to iGPT to GPT-3... (When faced with the choice between having to admit all their fancy hard work is a dead-end, swallow the bitter lesson, and start budgeting tens of millions of compute, or instead writing a tweet explaining how, "actually, GPT-3 shows that scaling is a dead end and it's just imitation intelligence" - most people will get busy on the tweet!)
What I'll be watching for is whether orgs beyond 'the usual suspects' (MS ZeRO, Nvidia, Salesfore, Allen, DM/GB, Connor/EleutherAI, FAIR) start participating or if they continue to dismiss scaling.
Feels worth pasting in this other comment of yours from last week, which dovetails well with this:
DL so far has been easy to predict - if you bought into a specific theory of connectionism & scaling espoused by Schmidhuber, Moravec, Sutskever, and a few others, as I point out in https://www.gwern.net/newsletter/2019/13#what-progress & https://www.gwern.net/newsletter/2020/05#gpt-3 . Even the dates are more or less correct! The really surprising thing is that that particular extreme fringe lunatic theory turned out to be correct. So the question is, was everyone else wrong for the right reasons (similar to the Greeks dismissing heliocentrism for excellent reasons yet still being wrong), or wrong for the wrong reasons, and why, and how can we prevent that from happening again and spending the next decade being surprised in potentially very bad ways?
Personally, these two comments have kicked me into thinking about theories of AI in the same context as also-ran theories of physics like vortex atoms or the Great Debate. It really is striking how long one person with a major prior success to their name can push for a theory when the evidence is being stacked against it.
A bit closer to home than DM and GB, it also feels like a lot of AI safety people have missed the mark. It's hard for me to criticise too loudly because, well, 'AI anxiety' doesn't show up in my diary until June 3rd (and that's with a link to your May newsletter). But a lot of AI safety work increasingly looks like it'd help make a hypothetical kind of AI safe, rather than helping with the prosaic ones we're actually building.
I'm committing something like the peso problem here in that lots of safety work was - is - influenced by worries about the worst-case world, where something self-improving bootstraps itself out of something entirely innocuous. In that sense we're kind of fortunate that we've ended up with a bloody language model fire-alarm of all things, but I can't claim that helps me sleep at night.
I'm imagining a tiny AI Safety organization, circa 2010, that focused on how to achieve probable alignment for scaled-up versions of that year's state-of-the-art AI designs. It's interesting to ask whether that organization would have achieved more or less than MIRI has, in terms of generalizable work and in terms of field-building.
Certainly it would have resulted in a lot of work that was initially successful but ultimately dead-end. But maybe early concrete results would have attracted more talent/attention/respect/funding, and the org could have thrown that at DL once it began to win the race.
On the other hand, maybe committing to 2010's AI paradigm would have made them a laughingstock by 2015, and killed the field. Maybe the org would have too much inertia to pivot, and it would have taken away the oxygen for anyone else to do DL-compatible AI safety work. Maybe it would have stated its problems less clearly, inviting more philosophical confusion and even more hangers-on answering the wrong questions.
Or, worst, maybe it would have made a juicy target for a hostile takeover. Compare what happened to nanotechnology research (and nanotech safety research) when too much money got in too early - savvy academics and industry representatives exiled Drexler from the field he founded so that they could spend the federal dollars on regular materials science and call it nanotechnology.
a lot of AI safety work increasingly looks like it'd help make a hypothetical kind of AI safe
I think there are many reasons a researcher might still prioritize non-prosaic AI safety work. Off the top of my head:
In the Gwern quote, what does "Even the dates are more or less correct!" refer to? Which dates were predicted for what?
Look at, for example, Moravec. His extrapolation assumes that supercomputer will not be made available for AI work until AI work has already been proven successful (correct) and that AI will have to wait for hardware to become so powerful that even a grad student can afford it with $1k (also correct, see AlexNet), and extrapolating from ~1998, estimates:
At the present rate, computers suitable for humanlike robots will appear in the 2020s.
Guess what year today is.
If you extrapolated those straight lines further, doesn't it mean that even small businesses will be able to afford training their own quadrillion-parameter-models just a few years after Google?
What makes you think there will be small businesses at that point, or that anyone would care what these hypothetical small businesses may or may not be doing?
Is it more than 30% likely that in the short term (say 5 years), Google isn't wrong? If you applied massive scale to the AI algorithms of 1997, you would get better performance, but would your result be economically useful? Is it possible we're in a similar situation today where the real-world applications of AI are already good-enough and additional performance is less useful than the money spent on extra compute? (self-driving cars is perhaps the closest example: clearly it would be economically valuable, but what if the compute to train it would cost 20 billion US dollars? Your competitors will catch up eventually, could you make enough profit in the interim to pay for that compute?)
I'd say it's at least 30% likely that's the case! But if you believe that, you'd be pants-on-head loony not to drop a billion on the 'residual' 70% chance that you'll be first to market on a world-changing trillion-dollar technology. VCs would sacrifice their firstborn for that kind of deal.
This answer likely betrays my lack of imagination, but I'm not sure what Google would use GPT-3 for. It's probably much more expensive than whatever gmail uses to predict text, and the additional accuracy might not provide much additional value.
Maybe they could sell it as a service, as part of GCP? I'm not sure how many people inside Google have the ability to sign $15M checks, you would need at least one of them to believe in a large market, and I'm personally not sure there's a large enough market for GPT-3 for it to be worth Google's time.
This is all to say, I don't think you should draw the conclusion that Google is either stupid or hiding something. They're likely focusing on finding better architectures, it seems a little early to focus on scaling up existing ones.
Text embeddings for knowledge graphs and ads is the most immediately obvious big bucks application.
GPT-3 based text embedding should be extremely useful for creating summaries of arbitrary text (such as, web pages or ad text) which can be fed into the existing Google search/ad infrastructure. (The API already has a less-known half, where you upload sets of docs and GPT-3 searches them.) Of course, they already surely use NNs for embeddings, but at Google scale, enhanced embeddings ought to be worth billions.
Worth noting that they already use BERT in Search. https://blog.google/products/search/search-language-understanding-bert/
Promoted to curated: I think the question of whether we are in an AI overhang is pretty obviously relevant to a lot of thinking about AI Risk, and this post covers the topic quite well. I particularly liked the use of a lot of small fermi estimate, and how it covered a lot of ground in relatively little writing.
I also really appreciated the discussion in the comments, and felt that Gwern's comment on AI development strategies in particular help me build a much map of the modern ML space (though I wouldn't want it to be interpreted as a complete map of a space, just a kind of foothold that helped me get a better grasp on thinking about this).
Most of my immediate critiques are formatting related. I feel like the listed section could have used some more clarity, maybe by bolding the name for each bullet point consideration, but it flowed pretty well as is. I was also a bit concerned about there being some infohazard-like risks from promoting the idea of being in an AI overhang too much, but after talking to some more people about it, and thinking for a bit about it, decided that I don't think this post adds much additional risk (e.g. by encouraging AI companies to act on being in an overhang and try to drastically scale up models without concern for safety).
One thing we have to account for is advances architecture even in a world where Moore's law is dead, to what extent memory bandwidth is a constraint on model size, etc. You could rephrase this as how much of an "architecture overhang" exists. One frame to view this through is in era the of Moore's law we sort of banked a lot of parallel architectural advances as we lacked a good use case for such things. We now have such a use case. So the question is how much performance is sitting in the bank, waiting to be pulled out in the next 5 years.
I don't know how seriously to take the AI ASIC people, but they are claiming very large increases in capability, on the order of 100-1000x in the next 10 years, if this is a true this is a multiplier on top of increased investment. See this response from a panel including big-wigs at NVIDIA, Google, and Cerebras about projected capabilities: https://youtu.be/E__85F_vnmU?t=4016. On top of this, one has to account, too, for algorithmic advancement: https://openai.com/blog/ai-and-efficiency/
Another thing to note is though by parameter count, the largest modern models are 10000x smaller than the human brain, if one buys that parameter >= synapse idea (which most don't but is not entirely off the table), the temporal resolution is far higher. So once we get human-sized models, they may be trained almost comically faster than human minds are. So on top an architecture overhang we may have this "temporal resolution overhang", too, where once models are as powerful as the human brain they will almost certainly be trained much faster. And on top of this there is an "inference overhang" where because inference is much, much cheaper than training, once you are done training an economically useful model, you will almost tautologically have a lot of compute to exploit it with.
Hopefully I am just being paranoid (I am definitely more of a squib than a wizard in these domains), but I am seeing overhangs everywhere!
As an aside, though it's not mentioned in the paper, I feel like this could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something.
The GPT architecture isn't even close to being the best Transformer architecture anyway. As an example, someone benchmarked XLNet (over a year old) last week (which has recurrency, one of the ways to break GPT's context window bottleneck), and it achieves ~10x better parameter efficiency (a 0.4b-parameter XLNet model ~ 5b GPT-3 model) at the few-shot meta-learning task he tried.
Expanding to 2048 BPEs probably buys GPT-3 more headroom (more useful data to learn from, and more for the meta-learning to condition on), and expanding to efficient attentions/recurrency/memory will enable even better prediction performance, with unknown meta-learning or generalization consequences.
(The problem there is the tradeoff between compute efficiency of training and better architectures. It's not obvious where you want to go: GShard, for example, takes the POV that even GPT is too fancy and slow and inefficient to train on existing hardware, and goes with the even more drastically parameter-inefficient - but efficient to train on GPUs! - mixture-of-expert small Transformers approach.)
Moore's Law is not dead. I could rant about the market dynamics that made people think otherwise, but it's easier just to point to the data.
https://docs.google.com/spreadsheets/d/1NNOqbJfcISFyMd0EsSrhppW7PT6GCfnrVGhxhLA5PVw
Moore's Law might die in the short future, but I've yet to hear a convincing argument for when or why. Even if it does die, Cerebras presumably has at least 4 node shrinks left in the short term (16nm→10nm→7nm→5nm→3nm) for a >10x density scaling, and many sister technologies (3D stacking, silicon photonics, new non-volatile memories, cheaper fab tech) are far from exhausted. One can easily imagine a 3nm Cerebras waffle coated with a few layers of Nantero's NRAM, with a few hundred of these connected together using low-latency silicon photonics. That would easily train quadrillion parameter models, using only technology already on our roadmap.
Alas, the nature of technology is that while there are many potential avenues for revolutionary improvement, only some small fraction of them win. So it's probably wrong to look at any specific unproven technology as a given path to 10,000x scaling. But there are a lot of similarly revolutionary technologies, and so it's much harder to say they will all fail.
Is density even relevant when your computations can be run in parallel? I feel like price-performance will be the only relevant measure, even if that means slower clock cycles.
Density is important because it affects both price and communication speed. These are the fundamental roadblocks to building larger models. If you scale to too large clusters of computers, or primarily use high-density off-chip memory, you spend most of your time waiting for data to arrive in the right place.
[comment wondering about impracticality of running a 1000x scaled up GPT. But as Gwern points out, running costs are actually pretty low. So even if we spent a billion or more on training a human-level AI, running costs would still be manageable.]
As noted, the electricity cost of running GPT-3 is quite low, and even with the capital cost of GPUs being amortized in, GPT-3 likely doesn't cost dollars to run per hundred pages, so scaled up ones aren't going to cost millions to run either. (But how much would you be willing to pay for the right set of 100 pages from a legal or a novel-writing AI? "Information wants to be expensive, because the right information can change your life...") GPT-3 cost millions of dollars to train, but pennies to run.
That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN. (This is not true of many paradigms, particularly ones where there's less of a distinction between training and running, but it is of NNs.) This is part of why there's a hardware overhang - once you have the hardware to create an AGI NN, you then by definition already have the hardware to run orders of magnitude more copies or more cheaply or bootstrap it into a more powerful agent.
That's the terrifying thing about NNs and what I dub the "neural net overhang": the cost to create a powerful NN is millions of times greater than the cost to run that NN.
I'm not sure why that's terrifying. It seems reassuring to me because it means that there's no way for the NN to suddenly go FOOM because it can't just quickly retrain.
But it can. That's the whole point of GPT-3! Transfer learning and meta-learning are so much faster than the baseline model training. You can 'train' GPT-3 without even any gradient steps - just examples. You pay the extremely steep upfront cost of One Big Model to Rule Them All, and then reuse it everywhere at tiny marginal cost.
With NNs, 'foom' is not merely possible, it's the default. If you train a model, then as soon as it's done you get, among other things:
the ability to run thousands of copies in parallel on the same hardware
meta-learning / transfer-learning to any related domain, cutting training requirements by orders of magnitude
model compression/distillation to train student models which are a fraction of the size, FLOPS, or latency (ratios varying widely based on task, approach, domain, acceptable performance degradation, targeted hardware etc, but often extreme like 1/100th)
reuse of the model elsewhere to instantly power up other models (eg use of text or image embeddings for a DRL agent)
learning-by-doing/learning curve effects (highest in information technologies), so the next from-scratch model may be much cheaper (eg OA5 got a, what was it, 5x cost reduction for the second model OA trained from scratch based on the lessons of the first?)
It means that if there are approaches that don't need as much compute, the AI can invent them fast.
Over on Developmental Stages of GPTs, orthonormal mentions
An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected.
I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months.
Investment Bounds
GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant.
GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another $10m in labour.
Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability.
A concrete example is Waymo, which is raising $2bn investment rounds - and that's for a technology with a much longer road to market.
Compute Cost
The other side of the equation is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/PFLOPS-day.
However, this cost is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts’ current estimates, and offers another >10x speedup to our model.
I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound.
Other Constraints
I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint.
Beyond 1000x
Here we go from just pointing at big numbers and onto straight-up theorycrafting.
In all, tech investment as it is today plausibly supports another 100x-1000x scale up in the very-near-term. If we get to 1000x - 1 ZFLOPS-day per model, $1bn per model - then there are a few paths open.
I think the key question is if by 1000x, a GPT successor is obviously superior to humans over a wide range of economic activities. If it is - and I think it's plausible that it will be - then further investment will arrive through the usual market mechanisms, until the largest models are being allocated a substantial fraction of global GDP.
On paper that leaves room for another 1000x scale-up as it reaches up to $1tn, though current market mechanisms aren't really capable of that scale of investment. Left to the market as-is, I think commoditization would kick in as the binding constraint.
That's from the perspective of the market today though. Transformative AI might enable $100tn-market-cap companies, or nation-states could pick up the torch. The Apollo Program made for a $1tn-today share of GDP, so this degree of public investment is possible in principle.
The even more extreme path is if by 1000x you've got something that can design better algorithms and better hardware. Then I think we're in the hands of Christiano's slow takeoff four-year-GDP-doubling.
That's all assuming performance continues to improve, though. If by 1000x the model is not obviously a challenger to human supremacy, then things will hopefully slow down to ye olde fashioned 2010s-Moore's-Law rates of progress and we can rest safe in the arms of something that's merely HyperGoogle.