Coalitional agency seems like an unnecessary constraint on design of a composite agent, since an individual agent could just (choose to) listen to other agents and behave the way their coalition would endorse, thereby effectively becoming a composite agent, without being composite "by construction". The step where an agent chooses which other (hypothetical) agents to listen to makes constraints on the nature of agents unnecessary, because the choice to listen to some agents and not others can impose any constraints that particular agent cares about, and so an "agent" could be as vague as a "computation" or a program.
(Choosing to listen to a computation means choosing a computation based on considerations other than its output, committing to use its output in a particular way without yet knowing what it's going to be, and carrying out that commitment once the output becomes available, regardless of what it turns out to be.)
This way we can get back to individual rationality, figuring out how an agent should choose to listen to which other agents/computations when coming up with its own beliefs and decisions. But actually occasionally listening to those other computations is the missing step in most decision theories, which would take care of interaction with other agents (both actual and hypothetical).
Discussions of how to aggregate values and probabilities feel disjoint. Jeffrey-Bolker formulation of expected utility presents the preference data as two probability distributions over the same sample space, so that expected utility of an event is reconstructed as the ratio of the event's measures given by the two priors. (The measure that goes into the numerator is "shouldness", and the other one remains "probability".)
This gestures at a way of reducing the problem of aggregating values to the problem of aggregating probabilities. In particular, markets seem to be easier to set up for probabilities than for expected utilities, so it might be better to set up two markets that are technically the same type of thing, one for probability and one for shouldness, than to target expected utility directly. Values of different agents are incomparable, but so are priors, any fundamental issues with aggregation seem to remain unchanged by this reformulation. These can't be "prediction" markets since resolution is not straightforward and somewhat circular, grounded in what the coalition will settle on eventually, but logical induction has to deal with similar issues already.
Cyberattacks can't disable anything with any reliability or for more than days to weeks though, and there are dozens of major datacenter campuses from multiple somewhat independent vendors. Hypothetical AI-developed attacks might change that, but then there will also be AI-developed information security, adapting to any known kinds of attacks and stopping them from being effective shortly after. So the MAD analogy seems tenuous, the effect size (of this particular kind of intervention) is much smaller, to the extent that it seems misleading to even mention cyberattacks in this role/context.
Oversight, auditing, and accountability are jobs. Agriculture shows that 95% of jobs going away is not the problem. But AI might be better at the new jobs as well, without any window of opportunity where humans are initially doing them and AI needs to catch up. Instead it's AI that starts doing all the new things well first and humans get no opportunity to become competitive at anything, old or new, ever again.
Even formulation of aligned high-level tasks and intent alignment of AIs make sense as jobs that could be done well by misaligned AIs for instrumental reasons. Which is not even deceptive alignment, but still plausibly segues into gradual disempowerment or sharp left turn.
My point is that a bit of scaling (like 3x) doesn't matter, even though at the scale of GPT-4.5 or Grok 3 it requires building a $5bn training system, but a lot of scaling (like 2000x up from the original GPT-4) is still the most important thing impacting capabilities that will predictably happen soon. And it's going to arrive a little bit at a time, so won't be obviously impactful at any particular step, not doing anything to disrupt the rumors of no longer being important. It's a rising sea kind of thing (if you have the compute).
Long reasoning traces were always necessary to start working at some point, and s1 paper illustrates that we don't really have evidence yet that R1-like training creates rather than elicits nontrivial capabilities (things that wouldn't be possible to transfer in mere 1000 traces). Amodei is suggesting that RL training can be scaled to billions of dollars, but unclear if this assumes that AIs will automate creation of verifiable tasks. If constructing such tasks (or very good reward models) is the bottleneck, this direction of scaling can't quickly get very far outside specialized domains like chess where a single verifiable task (winning a game) generates endless data.
The quality data wall and flatlining benchmarks (with base model scaling) are about compute multipliers that depend on good data but don't scale very far. As opposed to scalable multipliers like high sparsity MoE. So I think these recent 4x a year compute multipliers mostly won't work above 1e27-1e28 FLOPs, which superficially looks bad for scaling of pretraining, but won't impact the less legible aspects of scaling token prediction (measured in perplexity on non-benchmark data) that are more important for general intelligence. There's also the hard data wall of literally running out of text data, but being less stringent on data quality and training for multiple epochs (giving up the ephemeral compute multipliers from data quality) should keep it at bay for now.
my intuitions have been shaped by events like the pretraining slowdown
I don't see it. GPT-4.5 is much better than the original GPT-4, probably at 15x more compute. But it's not 100x more compute. And GPT-4o is an intermediate point, so the change from GPT-4o to GPT-4.5 is even smaller, maybe 4x.
I think 3x change in compute has an effect at the level of noise from different reasonable choices in constructing a model, and 100K H100s is only 5x more than 20K H100s of 2023. It's not a slowdown relative to what it should've been. And there are models with 200x more raw compute than went into GPT-4.5 that are probably coming in 2027-2029, much more than the 4x-15x observed since 2022-2023.
LLMs compute probability of a sequence, but truth/good distinction is captured by two-dimensional Jeffrey-Bolker measure (I'm calling its components "probability" and "shouldness", their ratio is the expected utility of an event). Shouldness is reconstructed from probability and expected utility as their product, so plausibly it behaves on long sequences similarly to probability, it generally gets lower for longer sequences, but tends to be higher for simpler sequences.
The analogy between probability and shouldness suggests that some form of pretraining might be able to create models for either of them (as opposed to a base model that learns something inbetween from raw data with no supervision from preference data). Then expected utility is the ratio, that is instead of looking at logits of one LLM, we look at differences of logits for two LLMs, a shouldness-LLM and a probability-LLM (with some regularization that anchors to a base model instead of goodharting towards high approximate expected utility low probability sequences). Possibly this needs interspersing preference training with pretraining, rather than only applying preference training during post-training, so that there are two different pretrained models that nurture different collections of circuits (for probability and for shouldness).
(Some kind of Solomonoff induction analogy for probability/shouldness should be a clearer thing to express, might be more relevant in decision theory context, where you start with description lengths of programs in two different languages, a language of probability-programs and another language of shouldness-programs, and then convert these into probability and shouldness distributions over sequences, enabling both probability induction and shouldness induction for the next element of a sequence. Solomonoff induction ignores distinctions between languages in the limit, but this kind of probability/shouldness induction works with pairs of languages and the distinction between two languages in a given pair is the most important thing, as it defines expected utility.)
The bet that "makes sense" is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we're going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also "suggesting" that quality is compute-insensitive in a large range, so there is no benefit from more compute per token.
But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivates using inference for more tokens even with models that have the same active parameter count (as in Jevons paradox), that argument doesn't work. Also, the ceiling of quality at the possible scaling slowdown point depends on efficiency of training (compute multiplier) applied to the largest training system that the AI economics will support (maybe 5-15 GW without almost-AGI), and improved efficiency of DeepSeek-V3 raises that ceiling.
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1
Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.
GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million
Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision.
The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it's in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3.
toy model ... f(x) = Ax and g(x) = Bx, where x is the compute invested
Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone.
That is, raw utilized compute. I'm assuming the same compute utilization for all models. ↩︎
It's an essential aspect of decision making for an agent to figure out where it might be. Thought experiments try to declare the current situation, but they don't necessarily need to be able to convincingly succeed. Algorithmic induction, such as updating from Solomonoff prior, is the basic way an agent figures out which situations it should care about, and declaring that we are working with a particular thought experiment doesn't affect the prior. In line with updatelessness, an agent should be ready for observations in general (according to which of them it cares about more), rather than particular "fair" observations, so distinguishing observations that describe "fair" thought experiments doesn't seem right either.