Thane Ruthenis - AI Alignment Forum

What 2026 looks like

Trying to evaluate this forecast in order to figure out how update on the newer one.

It certainly reads as surprisingly prescient. Notably, it predicts both the successes and the failures of the LLM paradigm: the ongoing discussion regarding how "shallow" or not their understanding is, the emergence of the reasoning paradigm, the complicated LLM bureaucracies/scaffolds, lots of investment in LLM-wrapper apps which don't quite work, the relative lull of progress in 2024, troubles with agency and with generating new ideas, "scary AI" demos being dismissed because LLMs do all kinds of whimsical bullshit...

And it was written in the base-GPT-3 era, before ChatGPT, before even the Instruct models. I know I couldn't have come close to calling any of this back then. Pretty wild stuff.

In comparison, the new "AI 2027" scenario is very... ordinary. Nothing that's in it is surprising to me, it's indeed the "default" "nothing new happens" scenario in many ways.

But perhaps the difference is in the eye of the beholder. Back in 2021, I barely knew how DL worked, forget being well-versed in deep LLM lore. The real question is, if I had been as immersed in the DL discourse in 2021 as I am now, would this counterfactual 2021!Thane have considered this forecast as standard as the AI 2027 forecast seems to 2025!Thane?

More broadly: "AI 2027" seems like the reflection of the default predictions regarding AI progress in certain well-informed circles/subcultures. Those circles/subcultures are fairly broad nowadays; e. g., significant parts of the whole AI Twitter. Back in 2021, the AI subculture was much smaller... But was there, similarly, an obviously maximally-well-informed fraction of that subculture which would've considered "What 2026 Looks Like" the somewhat-boring default prediction?

Reframing: @Daniel Kokotajlo, do you recall how wildly speculative you considered "What 2026 Looks Like" at the time of writing, and whether it's more or less speculative than "AI 2027" feels to you now? (And perhaps the speculativeness levels of the pre-2027 and post-2027 parts of the "AI 2027" report should be evaluated separately here.)

Another reframing: To what extent do you think your alpha here was in making unusually good predictions, vs. in paying attention to the correct things at a time when no-one focused on them, then making fairly basic predictions/extrapolations? (Which is important for evaluating how much your forecasts should be expected to "beat the (prediction) market" today, now that (some parts of) that market are paying attention to the right things as well.)

Fabien's Shortform

Thane Ruthenis1mo30

I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it's likely that GPT-4o will too

Maaaybe. Note, though, that "understand what's going on" isn't the same as "faithfully and comprehensively translate what's going on into English". Any number of crucial nuances might be accidentally lost in translation (due to the decoder model not properly appreciating how important they are), or deliberately hidden (if the RL'd model performs a sneaky jailbreak on the decoder, see Pliny-style token bombs or jailbreaks encoded in metaphor).

Fabien's Shortform

Thane Ruthenis1mo42

What sorts of encoding do you expect to be natural to LLMs besides encodings already present in pretraining and that GPT-4o can decode?

Spitballing: Some sort of trivial-but-abstract combination of the encodings present in pretraining that produces an encoding that's particularly easy for the LLM to think in due to their architecture/learned associations, but which is opaque to us, because the abstract space in which the combination is trivial is hopelessly beyond our current theory of languages, such that we can't easily reverse-engineer it.

"Switching between languages every sentence" is a trivial first step. Next step might be something like "using poetic metaphors in every language" or "thinking in subtle information-dense multilingual puns", such that you need actual human polyglots to translate. Beyond that, it might move to freestyling words by combining characters from random languages in ways such that they happen to evoke useful concepts in the LLM when seen through its omniglot-poet lens^[1], but which teams of human polyglots need hours to untangle. At this stage, idiosyncratic meanings completely alien to us but visible to the LLM would also come into play, see the " petertodd" phenomenon, so perhaps the polyglot teams would also need to have janus-style LLM whisperers present. And then it gets worse from there.

Edit: Here, I dug up an example. I'd run some experiments on it back in 2023 and it turned out to be much less impressive than it looks^[2], but imagine if LLMs actually started to think in this kind of code.

Again, this is just a throwaway example scenario, but this is the sort of thing I expect. From my perspective, LLMs starting to switch between languages once RL'd is a clear precursor of such a nightmare scenario once RL is ramped up, not something that's tame. (Though I admit it's been less bad than I'd expected so far. I guess I haven't seen what the trajectories of reasoning models of the second generation look like.)

^{^}
E. g., using one language's rules for creating compound words while taking the morphemes from different languages (and emoji, why not), such as surrounding a Japanese kanji with Latin and Russian prefixes and suffixes, and wrapping the whole thing in the poetry tradition of a fourth language.
^{^}
The LLM doesn't make the encoding incomprehensible if you don't literally ask it to make it incomprehensible, and the encoding doesn't change depending on whether you inform it that it's for its future instance or not. So it's not actually doing self-modeling to figure out the best way to encode text for itself. Still, it does demonstrate that it can reverse this sort of wild-looking code due to its learned webs of associations, in ways that would be very hard for humans to untangle.

Fabien's Shortform

Thane Ruthenis1mo1212

That's a worthwhile research direction, but I don't find the results here convincing. This experiment seems to involve picking an arbitrary and deliberately unfamiliar-to-the-LLM encoding, and trying to force the LLM to use it. That's not the threat model with RL causing steganography, the idea there is the opposite: that there is some encoding which would come natural to the model, more natural than English, and that RL would beeline for it.

"LLMs are great at learning to think in arbitrary encodings" was never part of that threat model. The steganographic encoding would not be arbitrary nor alien-to-the-LLM.

Daniel Kokotajlo's Shortform

Thane Ruthenis2mo*3-2

Can you point to any challenges that seem (a) necessary for speeding up AI R&D by 5x, and (b) not engineering challenges?

We'd discussed that some before, but one way to distill it is... I think autonomously doing nontrivial R&D engineering projects requires sustaining coherent agency across a large "inferential distance". "Time" in the sense of "long-horizon tasks" is a solid proxy for it, but not really the core feature. Instead, it's about being able to maintain a stable picture of the project even as you move from a fairly simple-in-terms-of-memorized-templates version of that project, to some sprawling, highly specific, real-life mess.

My sense is that, even now, LLMs are terrible at this^[1] (including Anthropic's recent coding agent), and that scaling along this dimension has not at all been good. So the straightforward projection of the current trends is not in fact "autonomous R&D agents in <3 years", and some qualitative advancement is needed to get there.

Making them more useful seems analogous to selective breeding or animal training

Are they useful? Yes. Can they be made more useful? For sure. Is the impression that the rate at which they're getting more useful would result in them 5x'ing AI R&D in <3 years a deceptive impression, the result of us setting up a selection process that would spit out something fooling us into forming this impression? Potentially yes, I argue.

^{^}
Having looked it up now, METR's benchmark admits that the environments in which they test are unrealistically "clean", such that, I imagine, solving the task correctly is the "path of least resistance" in a certain sense (see "systematic differences from the real world" here).

Daniel Kokotajlo's Shortform

Thane Ruthenis2mo*106

I think that if I consistently applied that argument, I'd end up thinking AGI was probably 5+ years away right up until the day AGI was announced.

Point 1: That would not necessarily be incorrect; it's not necessary that you ought to be able to do better than that. Consider math discoveries, which seem to follow a memoryless exponential distribution. Any given time period has a constant probability of a conjecture being proven, so until you observe it happening, it's always a fixed number of years in the future. I think the position that this is how AGI development ought to be modeled is very much defensible.

Indeed: if you place AGI in the reference class of self-driving cars/reusable rockets, you implicitly assume that the remaining challenges are engineering challenges, and that the paradigm of LLMs as a whole is sufficient to reach it. Then time-to-AGI could indeed be estimated more or less accurately.

If we instead assume that some qualitative/theoretical/philosophical insight is still missing, then it becomes a scientific/mathematical challenge instead. The reference class of those is things like Millennium Problems, quantum computing (or, well, it was until recently?), fusion. And as above, the memes like "fusion is always X years away" is not necessarily evidence that there's something wrong with how we do world-modeling.

Point 2: DL is kind of different from other technologies. Here, we're working against a selection process that's eager to Goodhart to what we're requesting, and we're giving it an enormous amount of resources (compute) to spend on that. It might be successfully fooling us regarding how much progress is actually happening.

One connection that comes to mind is the "just add epicycles" tragedy:

Finally, I’m particularly struck by the superficial similarities between the way Ptolemy and Copernicus happened upon a general, overpowered tool for function approximation (Fourier analysis) that enabled them to misleadingly gerrymander false theories around the data, and the way modern ML has been criticized as an inscrutable heap of linear algebra and super-efficient GPUs. I haven’t explored whether these similarities go any deeper, but one implication seems to be that the power and versatility of deep learning might allow suboptimal architectures to perform deceivingly well (just like the power of epicycle-multiplication kept geocentrism alive) and hence distract us from uncovering the actual architectures underlying cognition and intelligence.

That analogy seems incredibly potent to me.

Another way to model time-to-AGI given the "deceitful" nature of DL might be to borrow some tools from sociology or economics, e. g. trying to time the market, predict when a social change will happen, or model what's happening in a hostile epistemic environment. No clear analogy immediately comes to mind, though.

Have LLMs Generated Novel Insights?

Thane Ruthenis2mo1115

On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.

Is there? It's one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what's the ground-truth signal for "the framework of Bayesian probability/category theory is genuinely practically useful"?

This is the reason I'm bearish on the reasoning models even for math. The realistic benefits of them seem to be:

Much faster feedback loops on mathematical conjectures.
Solving long-standing mathematical challenges such as Riemann's or P vs. NP.
Mathematicians might be able to find hints of whole new math paradigms in the proofs for the long-standing challenges the models generate.

Of those:

(1) still requires mathematicians to figure out which conjectures are useful. It compresses hours, days, weeks, or months (depending on how well it scales) of a very specific and niche type of work into minutes, which is cool, but not Singularity-tier.
(2) is very speculative. It's basically "compresses decades of work into minutes", while the current crop of reasoning models can barely solve problems that ought to be pretty "shallow" from their perspective. Maybe Altman is right, the paradigm is in its GPT-2 stage, and we're all about to be blown away by what they're capable of. Or maybe it doesn't scale past the frontier of human mathematical knowledge very well at all, and the parallels with AlphaZero are overstated. We'll see.
(3) is dependent on (2) working out.

(The reasoning-model hype is so confusing for me. Superficially there's a ton of potential, but I don't think there's been any real indication they're up to the real challenges still ahead.)

Have LLMs Generated Novel Insights?

Thane Ruthenis2mo1617

The question is IMO not "has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes), but "is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?". A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn't mean the clock is useful for telling the time or that the RWG has the property of being insightful.

And my current impression is that no, there's no way to do that. If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.

This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it's very rare, and only happens when I've already done 90% of the work setting up the problem. Mostly they're useful as rubber ducks or primers on existing knowledge; not idea-generators.

Abstract Mathematical Concepts vs. Abstractions Over Real-World Systems

Thane Ruthenis2mo10

It seems like this is the sort of thing you could only ever learn by learning about the real world first

Yep. The idea is to try and get a system that develops all practically useful "theoretical" abstractions, including those we haven't discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out "real-world" stuff, leaving only a purified superhuman abstract reasoning engine.

One of the nice-to-have properties here would be is if we don't need to be able to interpret its world-model to filter out the concepts – if, in place of human understanding and judgement calls, we can blindly use some ground-truth-correct definition of what is and isn't a real-world concept.

Thane Ruthenis's Shortform

Thane Ruthenis2mo60

Some more evidence that whatever the AI progress on benchmarks is measuring, it's likely not measuring what you think it's measuring:

AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.
I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It's really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can't multiply 3-digit numbers. I was wrong, I guess.
I then used openai's Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
https://quora.com/In-what-bases-b-does-b-7-divide-into-9b-7-without-any-remainder
I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange:
https://math.stackexchange.com/questions/3548821/
Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange:
https://math.stackexchange.com/questions/3146556/how-many-five-digit-numbers-formed-from-digits-1-2-3-4-5-used-exactly-once-a#:~:text=,are%20divisible%20by%20%2412
I haven't checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.
So, what--if anything--does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?
I'm not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it's still generalization. I am sympathetic to that. But, I also wouldn't rule out that GRPO is amazing at sharpening memories along with math skills.
At the very least, the above show that data decontamination is hard.
Never ever underestimate the amount of stuff you can find online. Practically everything exists online.

I expected that:

I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn't go into a benchmark (even if it were initially intended for a benchmark), it'd go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don't have a bird's-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.

I expect the same is the case with programming benchmarks, science-quiz benchmarks, et cetera.

Now, this doesn't necessarily mean that the AI progress has been largely illusory and that we're way further from AGI than the AI hype men would have you believe (although I am very tempted to make this very pleasant claim, and I do place plenty of probability mass on it).

But if you're scoring AIs by the problems they succeed at, rather than the problems they fail at, you're likely massively overestimating their actual problem-solving capabilities.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments