Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

[-]Rohin Shah2mo2423

I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years^[1]), and (c) not a great measure of what people care about^[2]^[3]. I don't think you should strongly expect that particular line to continue to be straight over a long time period.

(Though I tend to think that it's more likely to slow down by 2027 than to speed up, relative to what you'd predict by extrapolating the 2024-2025 trend, so I agree with the overall conclusion of the post. And obviously one should have non-trivial probability on each of {slow down, stay the same, speed up}, as you say.)

^{^}
I'm excluding GPT-2 and GPT-3 which seem very noisy.
^{^}
Compared to things like Moore's Law, or energy efficiency of solar, or pretraining perplexity, or other similar things where we observe straight lines.
^{^}
None of this is to say that the METR graph is bad! It is the best empirical evidence on timelines I'm aware of.

[-]Vladimir_Nesov2mo*65

Non-OpenAI pre-RLVR chatbots might serve as an anchor for how long it takes an AI company to turn an algorithmic idea into a frontier model, after it becomes a clearly worthwhile thing to do. Arguably only Anthropic managed to catch up to OpenAI, and it took them 1.5 years with Sonnet 3.5. Even Google never caught up after 2+ years, their first credibly frontier chatbot is Gemini 2.5 Pro, which is already well into RLVR (and similarly for Grok 4). So it seems reasonable to expect that it would take about 2 years for RLVR-based models to start being done well, somewhere in 2026-2027.

The IMO results probably indicate something about the current lower bound on capabilities in principle, for informally graded tasks such as natural language proofs. This is a lot higher than what finds practical use so far, and improvements in 2026-2027 might be able to capture this kind of thing (without needing the scale of 2026 compute).

[-]jsd2mo30

Another way to think about this is that it could be reasonable to spend within the same order of magnitude on each RL environment as you spend in compute cost to train on that environment. I think the compute cost for doing RL on a hard agentic software engineering task might be around $10 to $1000 ($0.1 to $1 for each long rollout and you might do 100 to 1k rollouts?), so this justifies a lot of spending per environment. And, environments can be reused across multiple training runs (though they could eventually grow obsolete).

Agreed, cf https://www.mechanize.work/blog/cheap-rl-tasks-will-waste-compute/

[-]ryan_greenblatt2mo*40

Thanks, I wasn't aware of this post. (I think it overstates the level of spending we'll see on the average RL env within a year by maybe 10x or more, but I agree directionally.)

[-]jsd2mo10

In their BOTEC, it seems you roughly agree with a group size of 64 and 5 reuses per task (since 5 * 64 is between 100 and 1k).

You wrote $0.1 to $1 per rollout, whereas they have in mind 500,000 * $15 / 1M = $7.5. 500,000 doesn't seem especially high for hard agentic software engineering tasks which often reach into the millions.

Does the disagreement come from:

Thinking the $15 estimate from opportunity cost is too high (so compute cost lower than Mechanize claims)
Expecting most of the RL training to somehow not be end-to-end? (so compute cost lower than Mechanize claims)
Expecting spending per RL environment to be smaller than compute spending, even if within an OOM.

[-]ryan_greenblatt2mo50

I expect lower cost per rollout on average due to AI companies doing RL on a bunch of smaller tasks and from companies not necessarily using tons of reasoning tokens on most envs. Also, API prices are marked up relative to what companies actually pay on compute which can easily add a factor of 3. If we are just looking at the hardest agentic software engineering environments, then this closes the gap a decent amount.
I expect spending on RL enviroments to be more like 10x lower than RL training compute rather than similar (and I wouldn't be surprised by a large gap) because it's hard to massively scale up spending on RL envs effectively in a short period of time while we already have an scaled up industrial process for buying more compute.

I'm more sympathetic to "companies will spend this much on some high quality RL envs" than "the typical RL env will be very expensive", but I think some disagreement remains.

This naively suggests a rate of massively trend breaking breakthroughs of around 1 every 10 years. This is part of why I don't put that low of a probability on full AI R&D automation in just a few years: there is a reasonable chance of a massive breakthrough which substantially boosts the probability. Note that a massive breakthrough wouldn't necessarily suffice, but this does drive up the probability a bunch. Also, I don't feel that confident about the rate, and I'd be sympathetic to interpreting the evidence as suggesting a rate of more like 1/20 years or 1/7 years. ↩︎
Given basically on-trend AI progress, less than 3 years until full automation of AI R&D seems pretty unlikely. ↩︎
I expect easily verified/benchmarked capabilities will surpass real world usefulness, but I expect real world usefulness will still be pretty high at the point when AIs are doing this well on easily verified tasks. ↩︎
That said, each time there is a salient advance which seems like it could be a very big deal (once scaled up / figured out / etc.), we should put some probability on this driving above trend progress, like I did at the start of 2025. In retrospect, I think I was still too bullish ex-ante. ↩︎
It's unclear what level of capability (in terms of e.g. METR's notion of horizon-length) corresponds to being able to generate high quality RL environments mostly autonomously (as in, humans provide suggestions and AIs execute). I'd guess this probably scales somewhat with the difficulty of the RL environments and has some minimum level of capability. I'd guess making a high quality new class of RL environments takes human engineers around a few days (???), so we'd maybe expect that at least 80% reliability horizon lengths of 24 hours are required which we'd expect in maybe like 3 years. Though, we'd presumably expect substantial gains from lots of AI automation of RL environment construction before this. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

57

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

57

Counterargument: Actually, companies haven't gotten around to improving RL environment quality until recently (or there is substantial lead time on scaling up RL environments etc.) so better RL environments didn't drive much of late 2024 and 2025 progress

Counterargument: AIs will soon reach a critical capability threshold where AIs themselves can build high quality RL environments

Counterargument: AI companies are massively fucking up their training runs (either pretraining or RL) and once they get their shit together more, we'll see fast progress

Counterargument: This isn't that related to RL scale up, but OpenAI has some massive internal advance in verification which they demonstrated via getting IMO gold and this will cause (much) faster progress late this year or early next year

Thoughts and speculation on scaling up the quality of RL environments