Vladimir Nesov

Wiki Contributions

Comments

Sorted by

How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1

Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.

GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million

Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision.

The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it's in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3.

toy model ... f(x) = Ax and g(x) = Bx, where x is the compute invested

Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone.


  1. That is, raw utilized compute. I'm assuming the same compute utilization for all models. ↩︎

IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).

GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla's compute optimal 20 tokens/parameter is approximately correct for GPT-3, it's 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity.

(The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn't worth mentioning compared to everything else about it that's different from GPT-4.)

Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).

Many models aren't trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller model is now much better, but this is not an improvement in compute efficiency, doesn't in any way indicate that it became possible to train a better compute optimal model with a given amount of compute. The data and post-training also recently got better, which creates the illusion of algorithmic progress in pretraining, but their effect is bounded (while RL doesn't take off), doesn't get better according to pretraining scaling laws once much more data becomes necessary. There is enough data until 2026-2028, but not enough good data.

I don't think the cumulative compute multiplier since GPT-4 is that high, I'm guessing 3x, except perhaps for DeepSeek-V3, which wasn't trained compute optimally and didn't use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.

The amount of raw compute since original GPT-4 only increased maybe 5x, from 2e25 FLOPs to about 1e26 FLOPs, and it's unclear if there were any compute optimal models trained on notably more compute than original GPT-4. We know Llama-3-405B is compute optimal, but it's not MoE, so has lower compute efficiency and only used 4e25 FLOPs. Probably Claude 3 Opus is compute optimal, but unclear if it used a lot of compute compared to original GPT-4.

If there was a 6e25 FLOPs compute optimal model with a 3x compute multiplier over GPT-4, it's therefore only trained for 9x more effective compute than original GPT-4. The 100K H100s clusters have likely recently trained a new generation of base models for about 3e26 FLOPs, possibly a 45x improvement in effective compute over original GPT-4, but there's no word on whether any of them were compute optimal (except perhaps Claude 3.5 Opus), and it's unclear if there is an actual 3x compute multiplier over GPT-4 that made it all the way into pretraining of frontier models. Also, waiting for NVL72 GB200s (that are much better at inference for larger models), non-Google labs might want to delay deploying compute optimal models in the 1e26-5e26 FLOPs range until later in 2025.

Comparing GPT-3 to GPT-4 gives very little signal on how much of the improvement is from compute, and so how much should be expected beyond GPT-4 from more compute. While modern models are making good use of not being compute optimal by using fewer active parameters, GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative. It also wasn't a MoE model. And most of the bounded low hanging fruit that is not about pretraining efficiency hasn't been applied to it.

So the currently deployed models don't demonstrate the results of the experiment in training a much more compute efficient model on much more compute. And the previous leaps in capability are in large part explained by things that are not improvement in compute efficiency or increase in amount of compute. But in 2026-2027, 1 GW training systems will train models with 250x compute of original GPT-4. And probably in 2028-2029, 5 GW training systems will train models with 2500x raw compute of original GPT-4. With a compute multiplier of 5x-10x from algorithmic improvements plausible by that time, we get 10,000x-25,000x original GPT-4 in effective compute. This is enough of a leap that lack of significant improvement from only 9x of currently deployed models (or 20x-45x of non-deployed newer models, rumored to be underwhelming) is not a strong indication of what happens by 2028-2029 (from scaling of pretraining alone).

As argued briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory.

I think most FDT/embeddedness weirdness is about explaining the environment using bounded computations that are not (necessarily) literally already found in the environment as part of it. Not about sharing the actual source code, just any information about what's going on, captured in the form of computations, known to have captured that information before they are carried out. Things like static program analysis and deep learning models try to do this, but don't confront the weirdness of FDT/embeddedness.

Solomonoff induction is a very clean way of doing something like this, but doesn't go into decision theory. AIXI is closest to both doing it cleanly and confronting the weirdness, but something basic might be missing to make it applicable, that should be possible to fix.

I don't think Phi-4 offers convincing evidence either way. You can push performance on verifiable tasks quite far without the model becoming generally more capable. AlphaZero doesn't imply that scaling with its methods gestures at general superintelligence, and similarly with Phi-4.

In contrast, using o1-like training as a way to better access ground truth in less tractable domains seems more promising, since by some accounts its tactics on long reasoning traces work even in non-technical domains (unlike for DeepSeek R1), possibly because they are emergent rather than directly encouraged with task-specific training.

From footnote 2 to The state of AI today:

GPT-2 cost an estimated $43,000 to train in 2019; today it is possible to train a 124M parameter GPT-2 for $20 in 90 minutes.

Isn't $43,000 the estimate for the 1.5B replication of GPT-2 rather than for the 124M? If so, this phrasing is somewhat misleading. We only need $250 even for the 1.5B version, but still.

From chapter The state of AI today:

Later this year, the first 100,000 GPU cluster will go online

It's not the first, there's xAI cluster from September, and likely a Microsoft cluster from May.

Even the cited The Information article says about the Meta cluster in question that

The previously unreported cluster, which could be fully completed by October or November, comes as two other companies have touted their own.

From chapter The state of AI today:

The most likely and proximal blocker is power consumption (data-centers training modern AIs use enormous amounts of electricity, up to the equivalent of the yearly consumption of 1000 average US households) and ...

Clusters like xAI's Memphis datacenter with 100K H100s consume about 150 megawatts. An average US household consumes 10,800 kilowatt-hours a year, which is 1.23 kilowatts on average. So the power consumption of a 100K H100s cluster is equivalent to that of 121,000 average US households, not 1,000 average US households. If we take a cluster of 16K H100s that trained Llama-3-405B, that's still 24 megawatts and equivalent to 19,000 average US households.

So you likely mean the amount of energy (as opposed to power) consumed in training a model ("yearly consumption of 1000 average US households"). The fraction of all power consumed by a cluster of H100s is about 1500 watts per GPU, and that GPU at 40% compute utilization produces 0.4e15 FLOP/s of useful dense BF16 compute. Thus about 3.75e-12 joules is expended per FLOP that goes into training a model. For 4e25 FLOPs of Llama-3-405B, that's 1.5e14 joules, or 41e6 kilowatt-hours, which is consumed by 3,800 average US households in a year[1].

This interpretation fits the numbers better, but it's a bit confusing, since the model is trained for much less than a year, while the clusters will go on consuming their energy all year long. And the power constraints that are a plausible proximal blocker of scaling are about power, not energy.


  1. If we instead take 2e25 FLOPs attributed to original GPT-4, and 700 watts of a single H100, while ignoring the surrounding machinery of a datacenter (even though you are talking about what a datacenter consumes in this quote, so this is an incorrect way of estimating energy consumption), and train on H100s (instead of A100s used for original GPT-4), then this gives 9.7e6 kilowatt-hours, or the yearly consumption of 900 average US households. With A100s, we instead have 400 watts and 0.3e15 FLOP/s (becoming 0.12e15 FLOP/s at 40% utilization), which gets us 18.5e6 kilowatt-hours for a 2e25 FLOPs model, or yearly consumption of 1,700 average US households (again, ignoring the rest of the datacenter, which is not the correct thing to do). ↩︎

Health itself is an inspiring concept at a technical level, because it is meaningful at many scales of organization at once: healthy cells, healthy organs, healthy people, healthy families, healthy communities, healthy businesses, healthy countries, and (dare I say) healthy civilizations all have certain features in common, to do with self-sustenance, harmony with others, and flexible but functional boundaries.

Healthcare in this general sense is highly relevant to machines. Conversely, sufficient tech to upload/backup/instantiate humans makes biology-specific healthcare (including life extension) mostly superfluous.

The key property of machines is initial advantage in scalability, which quickly makes anything human-specific tiny and easily ignorable in comparison, however you taxonomize the distinction. Humans persevere only if scalable machine sources of power (care to) lend us the benefits of their scale. Intent alignment for example would need to be able to harness a significant fraction of machine intent (rather than being centrally about human intent).

The data wall discussion in the podcast applies Chinchilla's 20 tokens/parameter too broadly and doesn't account for repetition of data in training. These issues partially cancel out, but new information on these ingredients would affect the amended argument differently. I wrote up the argument as a new post.

Load More