In a recent post, Cole Wyeth makes a bold claim:

. . . there is one crucial test (yes this is a crux) that LLMs have not passed. They have never done anything important. 

They haven't proven any theorems that anyone cares about. They haven't written anything that anyone will want to read in ten years (or even one year). Despite apparently memorizing more information than any human could ever dream of, they have made precisely zero novel connections or insights in any area of science[3].

I commented:

An anecdote I heard through the grapevine: some chemist was trying to synthesize some chemical. He couldn't get some step to work, and tried for a while to find solutions on the internet. He eventually asked an LLM. The LLM gave a very plausible causal story about what was going wrong and suggested a modified setup which, in fact, fixed the problem. The idea seemed so hum-drum that the chemist thought, surely, the idea was actually out there in the world and the LLM had scraped it from the internet. However, the chemist continued searching and, even with the details in hand, could not find anyone talking about this anywhere. Weak conclusion: the LLM actually came up with this idea due to correctly learning a good-enough causal model generalizing not-very-closely-related chemistry ideas in its training set.

Weak conclusion: there are more than precisely zero novel scientific insights in LLMs.

My question is: can anyone confirm the above rumor, or cite any other positive examples of LLMs generating insights which help with a scientific or mathematical project, with those insights not being available anywhere else (ie seemingly absent from the training data)?

Cole Wyeth predicts "no"; though LLMs are able to solve problems which they have not seen by standard methods, they are not capable of performing novel research. I (Abram Demski) find it plausible (but not certain) that the answer is "yes". This touches on AI timeline questions.

I find it plausible that LLMs can generate such insights, because I think the predictive ground layer of LLMs contains a significant "world-model" triangulated from diffuse information. This "world-model" can contain some insights not present in the training data. I think this paper has some evidence for such a conclusion: 

In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs  can articulate a definition of and compute inverses.

However, the setup in this paper is obviously artificial, setting up questions that humans already know the answers to, even if they aren't present in the data. The question is whether LLMs synthesize any new knowledge in this way.

New Answer
New Comment

1 Answers sorted by

Kaj Sotala

71

Derya Unutmaz reported that o1-pro came up with a novel idea in the domain of immunotherapy:

Introduction to the Context:

I’m working on developing innovative cancer immunotherapy approaches to address key challenges in the field. Immunotherapy is an exceptionally powerful strategy for curing cancer because it harnesses the body’s immune system—our internal army—and empowers it to recognize and eliminate cancer cells. In this effort, we are focusing on engineering T cells, the immune system’s soldiers and generals, through synthetic biology.

However, significant challenges remain, especially in treating solid tumors like breast cancer. Within the tumor microenvironment, T cells often become exhausted due to the overwhelming number of cancer cells and the suppressive environment created by the tumor. This exhaustion severely limits the effectiveness of these therapies.

To tackle this issue, we employ a cutting-edge model system using 3D bioprinted breast cancer tissue integrated with engineered human T cells. These T cells are reprogrammed through advanced synthetic biology techniques to test and develop solutions for overcoming exhaustion.

Prompt to O1-Pro:

Building on work I’ve previously done and tested with o1-Preview and GPT-4o, I posed the following prompt:

“I’d like you to focus on 3D bioprinted solid tumors as a model to address the T cell exhaustion problem. Specifically, the model should incorporate stroma, as seen in breast cancer, to replicate the tumor microenvironment and explore potential solutions. These solutions could involve technologies like T cell reprogramming, synthetic biology circuits, cytokines, transcription factors related to exhaustion, or metabolic programming. Draw inspiration from other fields, such as Battle Royale games or the immune system’s ability to clear infected cells without triggering autoimmunity. Identify potential pitfalls in developing these therapies and propose alternative approaches. Think outside the box and outline iterative goals that could evolve into full-scale projects. Focus exclusively on in vitro human systems and models.”

Why Battle Royale Games?

You might wonder why I referenced Battle Royale games. That’s precisely the point—I wanted to push the model to think beyond conventional approaches and draw from completely different systems for inspiration. While o1-Preview and GPT-4o were able to generate some interesting ideas based on this concept, but they were mostly what I could also conceive though better most PhD students. In contrast, o1-Pro came up with far more creative and innovative solutions, that left me in awe!

Idea #9: A Remarkable Paradigm

Here, I’m sharing one specific idea, which I’ll call Idea #9 based on its iteration sequence. This idea was exceptional because it proposed an extraordinary paradigm inspired by Battle Royale games but more importantly within the context of deep temporal understanding of biological processes. This was the first time any model explicitly considered the time-dependent nature of biological events—an insight that reflects a remarkably advanced and nuanced understanding! 

“Adapt or Fail” Under Escalating Challenges:

Another remarkable aspect of idea #9 was that conceptually it drew from the idea of “adapt or fail” in escalating challenges, directly inspired by Battle Royale mechanics. This was the first time any model could think of it from this perspective.  It also emphasized the importance of temporal intervals in reversing or eliminating exhausted T cells. Indeed, this approach mirrors the necessity for T cells to adapt dynamically under pressure and survive progressively tougher challenges, something we would love to model in in vitro systems! One particularly further striking insight was the role of stimulation intervals in preventing exhaustion. Idea #9 suggested that overly short intervals between stimuli might be a key factor driving T cell exhaustion in current therapies. This observation really amazed me with its precision and relevance—because it pinpointed a subtle but critical aspect of T cell activations and development of exhaustion mechanisms. 

There's more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author's specialty with his work having close to 30 000 citations on Google Scholar, so I'd assume him to know what he's talking about.

18 comments, sorted by Click to highlight new comments since:

The question is IMO not "has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes), but "is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?". A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn't mean the clock is useful for telling the time or that the RWG has the property of being insightful.

And my current impression is that no, there's no way to do that. If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.

This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it's very rare, and only happens when I've already done 90% of the work setting up the problem. Mostly they're useful as rubber ducks or primers on existing knowledge; not idea-generators.

Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer.

So, if we can get LLMs to lay the groundwork and pose the right questions then we'll have autonomous scientists in whatever fields LLMs are OK at problem solving.

This seems like something LLMs will learn to do as inference-time compute is scaled up. Reasoners benefit from coming up with sub-problems whose solutions can be built atop of to solve the problem posed by the user. 

LLMs will learn that in order to solve difficult questions, they must pose and solve novel sub-questions. 

So, once given an interesting research problem, the LLM will hum away for days doing good, often-novel work. 

I think the argument you’re making is that since LLMs can make eps > 0 progress, they can repeat it N times to make unbounded progress. But this is not the structure of conceptual insight as a general rule. Concretely, it fails for the architectural reasons I explained in the original post. 

Yeah, I agree with this. If you feed an LLM enough hints about the solution you believe is right, and it generates ten solutions, one of them will sound to you like the right solution.

For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as 

"has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes),

& I created the question to see if we could substantiate the "yes" here with evidence. 

It makes somewhat more sense to me for your timeline crux to be "can we do this reliably" as opposed to "has this literally ever happened" -- but the claim in your post was quite explicit about the "this has literally never happened" version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what's going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.

This strong position even makes some sense to me; it isn't totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.

There is a specific type of thinking, which I tried to gesture at in my original post, which I think LLMs seem to be literally incapable of. It’s possible to unpack the phrase “scientific insight” in more than one way, and some interpretations fall on either side of the line. 

Yeah, that makes sense.

If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.

I have been seeing a bit of this, mostly uses of o1-pro and OpenAI Deep Research in chem/bio/medicine, and mostly via Twitter hype so far. But it might be the start of something.

It seems suspicious to me that this hype is coming from fields were it seems hard to verify (is the LLM actually coming up with original ideas or is it just fusing standard procedures? Are the ideas the bottleneck or is the experimental time the bottleneck? Are the ideas actually working or do they just sound impressive?). And of course this is Twitter. 

Why not progress on hard (or even easy but open) math problems? Are LLMs afraid of proof verifiers? On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal. 

On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal. 

Is there? It's one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what's the ground-truth signal for "the framework of Bayesian probability/category theory is genuinely practically useful"?

This is the reason I'm bearish on the reasoning models even for math. The realistic benefits of them seem to be:

  1. Much faster feedback loops on mathematical conjectures.
  2. Solving long-standing mathematical challenges such as Riemann's or P vs. NP.
  3. Mathematicians might be able to find hints of whole new math paradigms in the proofs for the long-standing challenges the models generate.

Of those:

  • (1) still requires mathematicians to figure out which conjectures are useful. It compresses hours, days, weeks, or months (depending on how well it scales) of a very specific and niche type of work into minutes, which is cool, but not Singularity-tier.
  • (2) is very speculative. It's basically "compresses decades of work into minutes", while the current crop of reasoning models can barely solve problems that ought to be pretty "shallow" from their perspective. Maybe Altman is right, the paradigm is in its GPT-2 stage, and we're all about to be blown away by what they're capable of. Or maybe it doesn't scale past the frontier of human mathematical knowledge very well at all, and the parallels with AlphaZero are overstated. We'll see.
  • (3) is dependent on (2) working out.

(The reasoning-model hype is so confusing for me. Superficially there's a ton of potential, but I don't think there's been any real indication they're up to the real challenges still ahead.)

That's a reasonable suspicion but as a counterpoint there might be more low-hanging fruit in biomedicine than math, precisely because it's harder to test ideas in the former. Without the need for expensive experiments, math has already been driven much deeper than other fields, and therefore requires a deeper understanding to have any hope of making novel progress.

edit: Also, if I recall correctly, the average IQ of mathematicians is higher than biologists, which is consistent with it being harder to make progress in math.

On the other hand, frontier math (pun intended) is much worse financed than biomedicine because most of the PhD-level math has barely any practical applications worth spending many manhours of high-IQ mathematicians (which often makes them switch career, you know). So, I would argue, if productivity of math postdocs when armed with future LLMs raises by, let's say, an order of magnitude, they will be able to attack more laborious problems.

Not that I expect it to make much difference to the general populace or even the scientific community at large though

I think this is one of the most important questions we currently have in relation to time to AGI, and one of the most important "benchmarks" that tell us where we are in terms of timelines.

I agree; I will shift to an end-game strategy as soon as LLMs demonstrate the ability to automate research.

Do you have an endgame strategy ready?

Instead of "have LLMs generated novel insights", how about "have LLMs demonstrated the ability to identify which views about a non-formal topic make more or less sense?" This question seems easier to operationalize and I suspect points at a highly related ability.

Random thought: maybe (at least pre-reasoning-models) LLMs are RLHF'd to be "competent" in a way that makes them less curious & excitable, which greatly reduces their chance of coming up with (and recognizing) any real breakthroughs. I would expect though that for reasoning models such limitations will necessarily disappear and they'll be much more likely to produce novel insights. Still, scaffolding and lack of context and agency can be a serious bottleneck.

I think it’s the latter.

Curated and popular this week