Vladimir Nesov

Wiki Contributions

Comments

Sorted by

From footnote 2 to The state of AI today:

GPT-2 cost an estimated $43,000 to train in 2019; today it is possible to train a 124M parameter GPT-2 for $20 in 90 minutes.

Isn't $43,000 the estimate for the 1.5B replication of GPT-2 rather than for the 124M? If so, this phrasing is somewhat misleading. We only need $250 even for the 1.5B version, but still.

From chapter The state of AI today:

Later this year, the first 100,000 GPU cluster will go online

It's not the first, there's xAI cluster from September, and likely a Microsoft cluster from May.

Even the cited The Information article says about the Meta cluster in question that

The previously unreported cluster, which could be fully completed by October or November, comes as two other companies have touted their own.

From chapter The state of AI today:

The most likely and proximal blocker is power consumption (data-centers training modern AIs use enormous amounts of electricity, up to the equivalent of the yearly consumption of 1000 average US households) and ...

Clusters like xAI's Memphis datacenter with 100K H100s consume about 150 megawatts. An average US household consumes 10,800 kilowatt-hours a year, which is 1.23 kilowatts on average. So the power consumption of a 100K H100s cluster is equivalent to that of 121,000 average US households, not 1,000 average US households. If we take a cluster of 16K H100s that trained Llama-3-405B, that's still 24 megawatts and equivalent to 19,000 average US households.

So you likely mean the amount of energy (as opposed to power) consumed in training a model ("yearly consumption of 1000 average US households"). The fraction of all power consumed by a cluster of H100s is about 1500 watts per GPU, and that GPU at 40% compute utilization produces 0.4e15 FLOP/s of useful dense BF16 compute. Thus about 3.75e-12 joules is expended per FLOP that goes into training a model. For 4e25 FLOPs of Llama-3-405B, that's 1.5e14 joules, or 41e6 kilowatt-hours, which is consumed by 3,800 average US households in a year[1].

This interpretation fits the numbers better, but it's a bit confusing, since the model is trained for much less than a year, while the clusters will go on consuming their energy all year long. And the power constraints that are a plausible proximal blocker of scaling are about power, not energy.


  1. If we instead take 2e25 FLOPs attributed to original GPT-4, and 700 watts of a single H100, while ignoring the surrounding machinery of a datacenter (even though you are talking about what a datacenter consumes in this quote, so this is an incorrect way of estimating energy consumption), and train on H100s (instead of A100s used for original GPT-4), then this gives 9.7e6 kilowatt-hours, or the yearly consumption of 900 average US households. With A100s, we instead have 400 watts and 0.3e15 FLOP/s (becoming 0.12e15 FLOP/s at 40% utilization), which gets us 18.5e6 kilowatt-hours for a 2e25 FLOPs model, or yearly consumption of 1,700 average US households (again, ignoring the rest of the datacenter, which is not the correct thing to do). ↩︎

Health itself is an inspiring concept at a technical level, because it is meaningful at many scales of organization at once: healthy cells, healthy organs, healthy people, healthy families, healthy communities, healthy businesses, healthy countries, and (dare I say) healthy civilizations all have certain features in common, to do with self-sustenance, harmony with others, and flexible but functional boundaries.

Healthcare in this general sense is highly relevant to machines. Conversely, sufficient tech to upload/backup/instantiate humans makes biology-specific healthcare (including life extension) mostly superfluous.

The key property of machines is initial advantage in scalability, which quickly makes anything human-specific tiny and easily ignorable in comparison, however you taxonomize the distinction. Humans persevere only if scalable machine sources of power (care to) lend us the benefits of their scale. Intent alignment for example would need to be able to harness a significant fraction of machine intent (rather than being centrally about human intent).

The data wall discussion in the podcast applies Chinchilla's 20 tokens/parameter too broadly and doesn't account for repetition of data in training. These issues partially cancel out, but new information on these ingredients would affect the amended argument differently. I wrote up the argument as a new post.

These posts might be relevant:

The details of Constitutional AI seem highly contingent, while the general idea is simply automation of data for post-training, so that the remaining external input is the "constitution". In the original paper there are recipes both for instruction tuning data and for preference data. RLAIF is essentially RLHF that runs on synthetic preference data, maybe together with a recipe for generating it. But preference data could also be used to run DPO or something else, in which case RLAIF becomes a misnomer for describing automation of that preference data.

Llama 3 report suggests that instruction tuning data can be largely automated, but human preference data is still better. And data foundry business is still alive, so a lot of human data is at least not widely recognized as useless. But it's unclear if future models won't soon do better than humans at labeling, or possibly already do better at some leading labs. Meta didn't have a GPT-4 level model as a starting point before Llama 3, and then there are the upcoming 5e26 FLOPs models, and o1-like reasoning models.

For counterlogical mugging, it's unclear if it should be possible to correctly discover the parity of the relevant digit of pi. I would expect that in the counterfactual where it's even, it will eventually be discovered to be even. And in the countefactual where it's odd, that same digit will eventually be discovered to be odd.

ASP and Transparent Newcomb might be closer to test cases for formulating updateless policies that have the character of getting better as they grow more powerful. These problems ask the agent to use a decision procedure that intentionally doesn't take certain information into account, whether the agent as a whole has access to that information or not. But they lack future steps that would let that decision procedure benefit from eventually getting stronger than the agent that initially formulated it, so these aren't quite the thought experiments needed here.

Updatelessness is about coordination between possible versions of an agent. Coordination with more distant versions of an agent gets more difficult or less informative, and a stronger version of an agent can reach further. This results in many local commitments that coordinate more related versions of an agent.

These local commitments, as agents in their own right, can grow stronger and should themselves coordinate with each other, where their parents failed to reach. Commitment to a strategy that won't itself engage in future rounds of coordination with its alternative possible forms (and other things) is a bad commitment.

I think the FDT dictum of treating an agent like an abstract algorithm rather than any given physical instance of it ("I am an algorithm") extends to treating goals as about the collective abstract consequences of behavior of abstract algorithms (other algorithms, that are not necessarily the agent) rather than of any given incarnation of those algorithms or consequences in any given incarnation, such as the physical consequences of running algorithms on computers in a physical world.

In this ontology, goals are not about optimizing configurations of the world, they are about optimizing behaviors of abstract algorithms or optimizing properties of mathematical structures. Physically, this predicts computronium (to run acausal interactions with all the abstract things, in order to influence their properties and behaviors) and anti-predicts squiggles or any such focus on the physical form of what's going on, other than efficiency at accessing more computation.

if you assign an extremely low credence to that scenario, then whatever

I don't assign low credence to the scenario where LLMs don't scale to AGI (and my point doesn't depend on this). I assign low credence to the scenario where it's knowable today that LLMs very likely won't scale to AGI. That is, that there is a thing I could study that should change my mind on this. This is more of a crux than the question as a whole, studying that thing would be actionable if I knew what it is.

whether or not LLMs will scale to AGI

This wording mostly answers one of my questions, I'm now guessing that you would say that LLMs are (in hindsight) "the right kind of algorithm" if the scenario I described comes to pass, which wasn't clear to me from the post.

Load More