Can governments quickly and cheaply slow AI training?

joshc

I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience.

But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view explainers of what's going on and what the bottlenecks are.

Tl;dr: At least one of the following would need to be implemented for me to be confident that inference verification would substantially slow training given today's algorithms:

Proof of work or proof of memory that accounts for > 95% of computation.
Memory wipes every few minutes.
Output re-computation that reduces covert channel capacity below 0.01%.

To my knowledge, no one has prototyped verification demos that reach these thresholds; so whether rapidly-implementable inference verification is feasible is still an open question.

1. Summary

I think powerful AI will be obviously scary at some point, and companies or governments might want to slow it down to buy time for additional safety or oversight. Maybe this could be done quickly, e.g. by:

Unplugging inter-rack cables to slow gradient syncs
Limiting bandwidth with simple devices
Periodically erasing clusters to delete covert training checkpoints
Recomputing a sample of outputs to confirm they are, in fact, inference generations

(Section 2)

Would these methods actually work? Or more specifically, if these methods were implemented quickly and correctly, would they substantially slow AI development?

I looked into this question for around a week, and here are my current views:

Current prototypes of inference-verification would probably be ineffective. Standard inference-verification measures slow training by restricting communication between servers (see Section 2), since training involves chucking big gradients around in a hivemind, and inference just involves exchanging text. However, communication restrictions might not actually slow AI training much, especially if that training is reinforcement learning (RL).

RL doesn’t require much communication. A large fraction of RL is inference - for example, generating rollouts of agent actions and scoring them - and inference can still be performed under communication constraints. Developers could use 95% of their compute (the compute that’s under verification) to generate RL rollouts, and 5% of compute in covert data centers to calculate training updates. This might allow training to continue at a similar efficiency as before (Section 3).

But existing verification proposals would create some hurdles that would make this strategy more difficult. For example, developers would need to frequently upload new checkpoints to inference servers, which could be slow given bandwidth limits. And developers might still need to send some fraction of rollouts through verifier network taps - even though these rollouts don’t actually match expected model outputs. But as I discuss in Section 3.4, these constraints are probably feasible to work around given the slack in existing inference verification prototypes.

More aggressive measures could probably buy at least 1 year if implemented sufficiently early. These more aggressive measures include:

Proof of work or proof of memory that accounts for > 95% of computation, which would prevent developers from using more than 5% of their compute to secretly generate RL rollouts.
Memory wipes every few minutes, which would erase new checkpoints before they can be fully uploaded to inference servers (Section 4).
Output recomputation that reduces covert channel capacity below 0.01%, which would prevent developers from exfiltrating covert RL rollout generations through network taps (Section 4).

Any one of the above would probably make training given current algorithms infeasible. My guess is that developing algorithms to work around these constraints would take at least a year absent rapid AI-driven R&D (if working around them is possible at all).

A year could be enough. One year into an AI agreement, governments might have had two years to develop verification in total, since the measures above might require around a year to implement. And after two years of working on verification mechanisms, governments might be able to implement general purpose on-chip security features that are more robust.

But if an agreement starts during an intelligence explosion, all bets are off. If developers have powerful AI that can create new algorithms quickly, I see no fundamental reason why bandwidth limits, memory wipes, etc would necessarily slow AI research (Section 5), but they might anyway.

So I think the main focus of verification research should be on de-risking these three verification measures (reliable proof of work/memory, frequent memory wipes, and highly accurate output re-computation). Without these measures, I think training would hardly slow at all. With them, training would probably slow down given current algorithms, and would possibly slow down during an intelligence explosion (but that’s unclear).

None of these measures are obviously feasible to implement on short notice; so I think whether rapidly-retrofitable inference verification is viable is still an open question.

The next sections defend these claims.

Edit: After thinking about this more and talking to more people, I think developing these measures on short notice is feasible, and a subset of them will probably be prototyped within the next few months. In particular, “proof of memory” and “output re-computation” are probably workable.

2. Ways to quickly and cheaply slow training by restricting communication

During training, models share their learnings in a big hive-mind. There’s a lot of communication going on.

Here’s what training looks like:

During inference, models still talk to each other, but their communications are thin and sparse:

Could governments exploit this difference in communication to slow down AI training without taxing inference?

Here’s an overview of different verification methods and how practical I think they are:

Method	Can be securely implemented with 1 year of lead time?
Remove interconnect	Yes
Bandwidth limiters	Yes (to 100 Kb/s)
Output recomputation	Probably, though unclear if covert channels can be reduced below 1% of traffic
Memory wipes	Maybe impractical. More than 99.99% of memory has to be erased on a server rack to reliably wipe checkpoints, and this might not be achievable given uncertainties about developer hardware, firmware opaque memory, etc.
Proof of work / memory	Probably practical, but has lots of snags, and a robust version hasn’t been demonstrated yet (I will post something on this shortly!)

2.1. Method #1: Disconnect inter-rack high-speed cables

One way to reduce communication is to disconnect the high-speed inter-rack network, like InfiniBand cables. These cables transmit at ~400 GB/s between server racks, and could be replaced with copper connections that only transmit data at internet speeds (e.g. 100 MB/s).

2.2. Method #2: Tap-verified bandwidth limits

Today’s AI agents don’t need to stream videos. They only need to output text; so developers might install simple devices that tap cables and verify that data transmission rates are even below internet speeds (e.g. 100 KB/s).

Aside: Why 100 KB/s? Bandwidth restrictions are most practical to enforce on intra-rack cables - not on the dense copper back-plate connecting GPUs on server racks. How much data needs to flow through these cables to support inference? Server racks currently house around ~100 H100 equivalents. Google had roughly 3 million H100 equivalents as of late 2025, and processed 480 trillion text tokens/month. This implies 12 billion tokens of input data go through each server rack every second, which is 20 KB/s of input flow (where output flow is probably smaller); so input and output limits of 100 KB/s on each server rack is close to the minimum before data rate limits would interfere with current inference workloads.

2.3. Method #3: Output re-computation

Verifiers could reduce bandwidth even further by partially recomputing a sample of server outputs.

Developers first declare the program they are running, such as their inference model.
Then, verifiers record traffic to and from every server with tap-devices, sample input-output pairs, and confirm that the pairs match the behavior of the declared model.

If this is implemented perfectly - with no leaks and flawless determinism - verifiers know that communications between model instances look like this:

Meaning:

Models cannot communicate at all with each other inside the server rack. If they did, model outputs wouldn’t be recomputable solely from the inputs.
Also, cluster outputs can’t be computed from any state covertly stored in the server.

Model instances can only communicate through low bandwidth model outputs.

At first glance, this looks like it solves everything. This is close to the most strict communication-constraint that can be enforced without disrupting inference.

But there are some important caveats.

First, a developer might be concealing some small fraction of its compute (e.g. 5%) in secret data centers. Historically, intelligence agencies have accounted for covert assets (like tanks, warheads, etc) to within a 15% median error. Compute will probably be easier to account for than (say) missiles. But I think it will still be hard to know if (for example) the PRC government diverted 5% of China’s compute to covert data centers.

*Historical intelligence estimates vs ground truth*

This gives cheaters a non-zero budget for high-bandwidth operations.

Second, output re-computation might not account for all server traffic. There might be non-determinism or side-channels that allow covert workloads on inference servers to leak data.

So output re-computation might end up functioning as just another way to reduce bandwidth. Suppose covert workloads could only slip data through 1% of output traffic, which is the covert channel size that current prototypes allow. Then, output bandwidth would be reduced by 100x, from ~100 KB/s to ~1 KB/s.

2.4. Method #4: Memory wipes

If output re-computation is ‘leaky’ then developers can still accumulate covert state on inference servers. To prevent this state from accumulating for too long, verifiers could enforce periodic memory wipes. For example:

Verifiers could require GPUs to generate synthetic random data with some pre-determined seed, such that generating this data is at least a little bit computationally expensive.
Then, verifiers could ping a random subset of GPUs to ask what data is stored. Since this data is big and time-consuming to generate on the fly, GPUs can only return the data quickly if it is in fact in storage.
Then, the verifier can compute the data themselves from the shared seed to make sure it’s correct.
This confirms that the GPUs are in fact storing >99.99% of the synthetic random data developers say they are (if the sample size is sufficiently large).
If this data accounts for nearly all of the server’s memory, i.e. down to a (1 GB model)/(100 B100s x 192 GB per B100) = 0.005% error - then the verifier knows that the server isn’t storing a covert model.
Finally, developers re-upload the declared inference model through a high-bandwidth channel, where the declared model is on a white-list and doesn’t count toward bandwidth limits.

2.5. Method #5: Proof of work / proof of memory

Verifiers could constrain developers even further with a method called proof of work / memory. The idea is to check that declared inference actually accounts for most of the computation or memory that the developer is using. Specifically:

Verifiers recompute inference outputs (as described) to verify that most output traffic really was generated by the declared inference model.
Then, verifiers estimate how much total compute or memory these requests would have required to process, e.g. by simply re-running them and recording GPU-hours / GB of memory used.
Finally, verifiers estimate the compute capacity or memory of the developer’s servers, and check that the vast majority of their compute or memory is accounted for by the declared inference workload.

Now developers can only use a fraction of their servers for covert workloads - where this fraction is determined by the error bounds of verifier’s estimates.

3. Ways to efficiently continue training despite these constraints

Assume verifiers implement “output re-computation with leaks” from the previous section:

How could developers continue efficient training despite these constraints?

Here are different methods for communication-efficient training and how effective I think they might be:

Method	Bandwidth reduction before performance degrades substantially	Could this technique be used to continue training at near-frontier efficiency despite “output re-computation with leaks”?
Larger batch sizes + infrequent SGD steps	Very little (< 10x)	No.
Periodically merge independent training runs (i.e. infrequent global steps but frequent local ones)	At least 1000x, possibly much more.	Plausibly, even with current algorithms.
Compress gradients / weights	At least 100x.	Plausibly, with additional R&D.
Use all declared compute for RL rollouts. Do training on dark compute.	Extreme bandwidth reductions.	Yes, probably, even with current algorithms.

3.1. Method #1: Larger batch size + infrequent SGD steps

Verdict: Ineffective.

Reasoning: Under 1 kb/s bandwidth, a 10 GB gradient would take 6 months to upload, and many uploads would be required. Current data suggests that the loss landscape is curved enough to where at least thousands of serial training steps are needed to reach strong performance, which implies that a near-frontier training run would take 6 months x 1000 = hundreds of years to complete.

Caption: The relationship between batch size and number of training steps needed to reach a specific loss (derived from Figure 4 in https://arxiv.org/abs/2505.13738). Larger batches have no benefit beyond some point (the ‘critical batch size’), and this BOTEC suggests we are already near this point.

Caption: The “critical batch size” is increasing more slowly than developers are scaling AI training, which suggests that developers will remain in a regime where increasing the batch size doesn’t reduce the number of syncs required. From: https://arxiv.org/abs/2311.08105

3.2. Method #2: Periodically merge independent training runs

Verdict: Reduces bandwidth requirements by at least 1000x, and possibly much more.

Reasoning: Merging independently trained models reduces loss because the loss landscape is convex; so averaging models achieves lower than the average of the models’ losses.

Caption: Why merging independently trained models works.

Google shows that this technique can reduce the frequency of gradient syncs by ~2000x with minimal hits to performance (see https://arxiv.org/abs/2501.18512 and https://arxiv.org/abs/2311.08105):

Caption: Training loss for different sync frequencies. From the DiLoCo paper.

However, beyond a certain point, reducing syncing frequency starts to harm performance. But Li et al showed that this problem can be mitigated by training models on different and independent subsets of data. For example, a model that’s learning pytorch does not need to sync frequently with a model learning about game development. The pytorch model’s learnings don’t have serial dependencies with the game development model’s learnings; so training can be parallelized to the extent that the data can be fragmented into isolated domains.

However, models benefit non-trivially from generalization; so at what sync frequencies do performance costs kick? One paper trained experts with ~10,000 H100 hours before merging at high performance; so if we assume that each server trains a single model, and a server has 100 GPUs, then developers could sync models every 10,000 / 100 / 24 = 4 days and maintain frontier efficiency.

3.3. Method #3: Compress gradients and weights

Verdict: Reduces bandwidth requirements by at least ~100 times, and possibly much more.

Reasoning: Developers can reduce the size of (1) gradients (2) pseudo gradients and (3) model weights, which reduces the time required to send these objects faster to share across bandwidth-limited channels.

Gradients can be compressed because they are sparse (mostly zeros) and low-rank. Compression can reduce the size of gradients by around 100x with minimal performance decline.
Developers can also compress pseudo gradients - which are the model deltas from training models independently (as discussed in the previous section). Pseudo gradients can be compressed by around 100x as well.
Finally, developers can make model weights small and easy to communicate. One way to do this is to split model weights into many tiny experts, each of which is hosted on a single server; so the time needed to upload each expert through bandwidth limits is short. This paper splits a model into 32 fully-independent experts, so a 100 GB model might be divided into ~3 GB experts with negligible hit to performance.

In the limit, how much can gradients be compressed without slowing down training? Theoretically: there’s no free lunch. At a certain threshold, compressing by 10x slows training by 10x. But what is this threshold? In principle, gradients could be losslessly compressed to the size of the training data used to generate them. If training data is ~1 million times smaller than a gradient (~100 GB / 100 KB), then gradients could be compressed by ~1 million times. Also, training data itself is fairly compressible, so gradients might be compressible by even more than 1 million times without any hit to performance. Gradients are not currently this compressible, but maybe developers could make them more compressible by “training for compressibility,” or other methods.

3.4. Method #4: Use more compute for inference rollouts, and less for training

Most of RL training involves generating action rollouts and scoring them. This is just inference, so developers could use their declared inference servers for rollout generation, and use secret data centers for training.

If secret data centers are ~5% of developer, then developers could use this 5% for training, and 95% for rollout generation. Would this asymmetric compute allocation slow down RL training? Multiple papers suggest no: the vast majority of rollouts are not very informative. So if developers train on only the most informative 5% of rollouts, RL still converges at a similar rate. [And also, maybe gradients aren’t even necessary at all, and inference-based evolution works really well]

However, this strategy still requires surmounting a couple of challenges:

First, new checkpoints have to be periodically uploaded to inference servers. RL generates rollouts using the model being trained. So, new models would need to be uploaded to inference servers continually during training. But bandwidth limits could make this upload process slow.

At 100 KB/s, a 3 GB expert model could be uploaded every few hours. Currently, models during reinforcement learning are updated every ~few minutes (the time needed to generate rollouts); so this would be a ~100x slower update frequency than is typical.

However, model updates are becoming less frequent by default because rollouts are taking longer to generate. Also, updates can be significantly delayed without harming performance; so an update frequency of a few hours probably is not a problem.

Additionally, the rollout data has to sent through verifier taps even though it isn't generated by declared models. We’re assuming that verifiers use output re-computation to verify that outputs are generated by the “declared” inference model. But the model producing rollouts for RL is different from the “declared model” and is constantly changing; so RL rollouts would need to somehow escape servers even though they can’t be re-computed by verifiers.

Developers might exfiltrate RL rollouts via the non-determinism of inference. Output re-computation naturally has some error, since inference is not perfectly deterministic. Even hardware-level irregularities will cause some deviations. In current prototypes, the “unsuspicious error” (aka level of non-determinism) of output re-computation is 1%; so 1% of server outputs could be generated by our undeclared RL model without raising suspicion.

If only around 1% of the RL rollouts generated could be actually sent back to training servers, is this a big constraint? Servers would need to curate the top 5% of rollouts anyway, since this is the maximum the training clusters can digest (remember that we are assuming covert data centers are 5% of total developer compute); so further tightening the curation threshold to the top 1% probably wouldn’t be a big problem.

Therefore, developers could potentially side-step inference verification by using declared servers for RL rollout generation.

4. But more aggressive verification methods would probably make training with current algorithms impractical

If we combine the most promising communication-efficient training methods from the last section, here’s what developers might do:

Developers might use all of their declared servers to generate RL rollouts, and use their unreported compute for AI training.
They might curate RL rollouts to the top 1%, exfiltrate this small fraction through inference noise.
Finally, developers might upload new checkpoints periodically to inference servers by transmitting ~3 GB expert models.

To defeat this strategy, verifiers could implement any of the following:

Proof of work or proof of memory that accounts for > 95% of computation. If verifiers can confirm that >95% of compute in inference servers is being used to serve the declared model, then developers can only use < 5% of compute to covertly generate RL rollouts. ~5% is the important threshold because, below this point, developers don’t gain much from secret inference relative to running inference on the covert data centers that already might represent around 5% of their compute.
Memory wipes every few minutes. Each time verifiers wipe servers, developers need to re-upload the 3 GB expert models. But uploading 3 GB across 100 KB/s would take 8 hours; so if memory is wiped every few minutes, developers couldn’t upload a full expert before all of their progress is erased.
In order to upload a checkpoint within the 1-minute memory-wipe window, models experts would need to be around ~5 MB. I’m not aware of any work that has split a near-frontier, 100 GB model into 20,000 different independent 5 MB experts and retained high performance. I’m not sure if this is even possible.
Output re-computation that reduces covert channel capacity below 0.01%. The performance of RL already starts to degrade when down-sampling to 1/16th of rollout data (see figure 4 of https://arxiv.org/pdf/2504.13818). If output recomputation has a covert channel capacity of < 0.01%, then rollouts would need to be down-sampled by 10,000x to exfiltrate them through inference noise. This is much more extreme down-sampling than what I’ve seen in any published RL implementation.

Any of the above inference verification methods would probably significantly slow training given current algorithms.

5. However, if developers (or AIs) have a lot of time to research better algorithms, all bets are off

All of the inference verification methods above boil down to restricting communications between servers. This only works because training currently involves a lot more communication than inference does. But I see no reason why this has to be the case.

For example:

Frequent memory wipes could be defeated by training thousands of tiny experts, This would allow tiny 10 MB experts to be quickly uploaded before memory wipes erase progress. I see no strong argument that models can’t be subdivided into many 10 MB experts.
Bandwidth limits can be defeated with extreme gradient compression. In principle, gradients could be compressed to sizes even smaller than the data that generated them. For example, consider how compressible human learning is. I could trip on my shoelaces, and the whole experience would require a lot of storage to record on video. But the lesson I draw would simply be the words “tie your shoe next time.” Maybe models could share small fragments of information that capture most of the value of a gradient update.
Bandwidth limits can be defeated with infrequent syncs. The core reason communication is important during training is that many domains have serial dependencies: one skill needs to be learned before another. But no one knows exactly how strong these dependencies are. Serial dependencies might only be strongest within a particular subdomain (like “coding in pytorch”) and might be very weak between different subdomains; so AI agents could train thousands of subdomain experts in series, on small, isolated servers, and merge them infrequently (e.g. every few weeks), and still retain most of the performance of normal training.
Even if “frontier training” is slow, AI research might not be. Developers might still make rapid algorithmic progress by training thousands of small models in parallel, e.g. one ~gpt-3 sized model on every server-rack of ~100 GPUs, every few hours. Then, developers would only need to transfer experiment results through bandwidth limits, rather than gradients. Maybe training lots of small models is a good way to improve AI software.

All of these are plausible reasons why even extreme communication restrictions might not slow AI development much at all. However, I think it’s also plausible that communication constraints would cause a major slowdown - even in the limit of AI R&D. This all just depends on the nature of AI development and learning. AI development might be intrinsically much more bottlenecked on compute than communication. Or it might not be. We don’t know.

6. Conclusion

More work is needed to be confident that inference verification can be effectively implemented on short notice.

If you are interested in doing this work, say so, and I can try to connect you to the relevant people.

The best way to reach me is via email: joshuamclymer@gmail.com

Appendix

Are we in the serially bottlenecked training regime? A BOTEC by Claude

Setup

There is a critical batch size (B_crit) beyond which adding more data-parallel workers yields diminishing returns. Below B_crit, doubling the batch size roughly halves the number of training steps needed — perfect parallelization. Above B_crit, you still need roughly the same number of serial steps, but you're burning extra tokens for no benefit.

If a training cluster can fill B_crit with its data-parallel capacity, it is serial-bottlenecked — more GPUs won't help. If it can't reach B_crit, it is hardware-bottlenecked — more GPUs would directly speed up training.

This BOTEC asks: at the scale of 100K and 1M H100-equivalents, which regime are we in?

Key Formula

From Bergsma et al. 2025 ("Power Lines"), trained on GPT-2-style models up to 3.3B parameters on SlimPajama:

where B is in sequences of 2048 tokens and D is total training tokens. This was fit on datasets up to ~143B tokens; we are extrapolating 100–400× beyond the fitted range.

B_crit at Frontier Scale

Dataset size (D)	B_crit (tokens/batch)	S_min (steps)	Wall-clock at B_crit, 2s/step
15T (DeepSeek-V3 scale)	118M	127K	5.9 days
30T (Llama 4 scale)	162M	185K	8.5 days
60T (next-gen frontier)	224M	268K	12.4 days

At B_crit, the number of training steps is 2 × S_min, and the total tokens consumed is 2 × D_min. The lab pays a 2× token overhead in exchange for minimizing wall-clock time.

How Many GPUs Per Model Replica?

Different architectures consume vastly different amounts of model parallelism, leaving different amounts of headroom for data parallelism:

Architecture	TP	PP	EP	GPUs/replica
Dense ~300B	8	16	—	128
Dense ~600B	8	32	—	256
MoE 671B (DeepSeek-V3 style)	1	16	64	1,024
MoE ~2T (Behemoth style)	1	16	256	4,096

Achievable Batch Size vs. B_crit

Assuming 8192-token sequences, gradient accumulation of 8, and D = 15T tokens (B_crit ≈ 118M tokens):

Cluster	Architecture	DP replicas	Batch size	Ratio to B_crit	Regime
100K	Dense ~300B	781	51M tok	0.4×	Hardware-bottlenecked
100K	Dense ~600B	390	26M tok	0.2×	Hardware-bottlenecked
100K	MoE 671B	97	6M tok	0.05×	Hardware-bottlenecked
100K	MoE ~2T	24	2M tok	0.01×	Hardware-bottlenecked
1M	Dense ~300B	7,812	512M tok	4.3×	Serial-bottlenecked
1M	Dense ~600B	3,906	256M tok	2.2×	Serial-bottlenecked
1M	MoE 671B	976	64M tok	0.5×	Hardware-bottlenecked
1M	MoE ~2T	244	16M tok	0.1×	Hardware-bottlenecked

Key Takeaways

At 100K H100s, every architecture is hardware-bottlenecked. Even a relatively small dense model can only reach ~0.4× B_crit. More GPUs would directly speed up training. MoE models are especially far from B_crit because expert parallelism consumes most of the GPU budget.

At 1M H100s, dense models become serial-bottlenecked but MoE models do not. A dense 300B model would overshoot B_crit by 4×, wasting significant compute on redundant gradient information. But a DeepSeek-V3-style MoE still only reaches 0.5× B_crit, and a 2T-parameter MoE reaches just 0.1×. MoE architectures absorb GPU capacity into model parallelism, keeping the data-parallel dimension small and the training compute-efficient.

This is a structural argument for MoE at scale. As clusters grow, dense models hit the seriality wall first. MoE provides a way to productively use additional GPUs (holding more expert parameters) without pushing batch size past B_crit. It converts excess parallel capacity into model quality rather than wasted data parallelism.

If the Power Lines extrapolation holds, serial wall-clock time is surprisingly short. At B_crit with 2s/step, a 15T-token run finishes in ~6 days. Actual frontier training runs take months, suggesting labs operate well below B_crit — trading wall-clock time for compute efficiency — or that step times are much longer than 2 seconds at these configurations.

Caveats

These estimates rest on several shaky assumptions:

Extrapolation. The Power Lines scaling law was fit on models ≤3.3B parameters and datasets ≤143B tokens. Extrapolating to 15–60T tokens is a 100–400× extrapolation. The exponent (0.462) could be different at frontier scale.
MoE. The scaling law was fit on dense models. MoE architectures may have different B_crit scaling — the gradient noise structure could differ when only a fraction of parameters are active per token. No published work has measured B_crit for large MoE models.
Parallelism overhead. The model parallelism estimates (TP, PP, EP) are rough. Real configurations depend on interconnect topology, memory capacity, and engineering choices. Some labs may achieve higher DP with clever parallelism strategies.
Step time. We assumed 2 seconds per step, which is a rough estimate. At high DP with large models, communication overhead can push step times to 5–10+ seconds, significantly increasing wall-clock time.
Batch size warmup. B_crit is not constant during training — it starts near zero and grows. Early training is always highly serial regardless of cluster size.

Sources

Bergsma et al. 2025, "Power Lines: Scaling Laws for Large Language Model Training" (arxiv.org/abs/2505.13738)
McCandlish et al. 2018, "An Empirical Model of Large-Batch Training" (arxiv.org/abs/1812.06162)
Merrill et al. 2025, "Critical Batch Size Revisited" (arxiv.org/abs/2505.23971)
Epoch AI, "Data Movement Bottlenecks: Scaling Past 1e28 FLOP" (epoch.ai/blog/data-movement-bottlenecks-scaling-past-1e28-flop)
DeepSeek-V3 Technical Report (arxiv.org/abs/2412.19437)