Derya Unutmaz reported that o1-pro came up with a novel idea in the domain of immunotherapy:
Introduction to the Context:
I’m working on developing innovative cancer immunotherapy approaches to address key challenges in the field. Immunotherapy is an exceptionally powerful strategy for curing cancer because it harnesses the body’s immune system—our internal army—and empowers it to recognize and eliminate cancer cells. In this effort, we are focusing on engineering T cells, the immune system’s soldiers and generals, through synthetic biology.
However, significant challenges remain, especially in treating solid tumors like breast cancer. Within the tumor microenvironment, T cells often become exhausted due to the overwhelming number of cancer cells and the suppressive environment created by the tumor. This exhaustion severely limits the effectiveness of these therapies.
To tackle this issue, we employ a cutting-edge model system using 3D bioprinted breast cancer tissue integrated with engineered human T cells. These T cells are reprogrammed through advanced synthetic biology techniques to test and develop solutions for overcoming exhaustion.
Prompt to O1-Pro:
Building on work I’ve previously done and tested with o1-Preview and GPT-4o, I posed the following prompt:
“I’d like you to focus on 3D bioprinted solid tumors as a model to address the T cell exhaustion problem. Specifically, the model should incorporate stroma, as seen in breast cancer, to replicate the tumor microenvironment and explore potential solutions. These solutions could involve technologies like T cell reprogramming, synthetic biology circuits, cytokines, transcription factors related to exhaustion, or metabolic programming. Draw inspiration from other fields, such as Battle Royale games or the immune system’s ability to clear infected cells without triggering autoimmunity. Identify potential pitfalls in developing these therapies and propose alternative approaches. Think outside the box and outline iterative goals that could evolve into full-scale projects. Focus exclusively on in vitro human systems and models.”
Why Battle Royale Games?
You might wonder why I referenced Battle Royale games. That’s precisely the point—I wanted to push the model to think beyond conventional approaches and draw from completely different systems for inspiration. While o1-Preview and GPT-4o were able to generate some interesting ideas based on this concept, but they were mostly what I could also conceive though better most PhD students. In contrast, o1-Pro came up with far more creative and innovative solutions, that left me in awe!
Idea #9: A Remarkable Paradigm
Here, I’m sharing one specific idea, which I’ll call Idea #9 based on its iteration sequence. This idea was exceptional because it proposed an extraordinary paradigm inspired by Battle Royale games but more importantly within the context of deep temporal understanding of biological processes. This was the first time any model explicitly considered the time-dependent nature of biological events—an insight that reflects a remarkably advanced and nuanced understanding!
“Adapt or Fail” Under Escalating Challenges:
Another remarkable aspect of idea #9 was that conceptually it drew from the idea of “adapt or fail” in escalating challenges, directly inspired by Battle Royale mechanics. This was the first time any model could think of it from this perspective. It also emphasized the importance of temporal intervals in reversing or eliminating exhausted T cells. Indeed, this approach mirrors the necessity for T cells to adapt dynamically under pressure and survive progressively tougher challenges, something we would love to model in in vitro systems! One particularly further striking insight was the role of stimulation intervals in preventing exhaustion. Idea #9 suggested that overly short intervals between stimuli might be a key factor driving T cell exhaustion in current therapies. This observation really amazed me with its precision and relevance—because it pinpointed a subtle but critical aspect of T cell activations and development of exhaustion mechanisms.
There's more behind the link. I have no relevant expertise that would allow me to evaluate how novel this actually was. But immunology is the author's specialty with his work having close to 30 000 citations on Google Scholar, so I'd assume him to know what he's talking about.
The question is IMO not "has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes), but "is there any way to get an LLM to autonomously generate and recognize genuine scientific insights at least at the same rate as human scientists?". A stopped clock is right twice a day, a random-word generator can eventually produce an insight, and talking to a rubber duck can let you work through a problem. That doesn't mean the clock is useful for telling the time or that the RWG has the property of being insightful.
And my current impression is that no, there's no way to do that. If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.
This aligns with my experience. Yes, LLMs have sometimes directly outputted some insights useful for my research in agent foundations. But it's very rare, and only happens when I've already done 90% of the work setting up the problem. Mostly they're useful as rubber ducks or primers on existing knowledge; not idea-generators.
Current LLMs are capable of solving novel problems when the user does most the work: when the user lays the groundwork and poses the right question for the LLM to answer.
So, if we can get LLMs to lay the groundwork and pose the right questions then we'll have autonomous scientists in whatever fields LLMs are OK at problem solving.
This seems like something LLMs will learn to do as inference-time compute is scaled up. Reasoners benefit from coming up with sub-problems whose solutions can be built atop of to solve the problem posed by the user.
LLMs will learn that in order to solve difficult questions, they must pose and solve novel sub-questions.
So, once given an interesting research problem, the LLM will hum away for days doing good, often-novel work.
I think the argument you’re making is that since LLMs can make eps > 0 progress, they can repeat it N times to make unbounded progress. But this is not the structure of conceptual insight as a general rule. Concretely, it fails for the architectural reasons I explained in the original post.
Yeah, I agree with this. If you feed an LLM enough hints about the solution you believe is right, and it generates ten solutions, one of them will sound to you like the right solution.
For me, this is significantly different from the position I understood you to be taking. My push-back was essentially the same as
"has there been, across the world and throughout the years, a nonzero number of scientific insights generated by LLMs?" (obviously yes),
& I created the question to see if we could substantiate the "yes" here with evidence.
It makes somewhat more sense to me for your timeline crux to be "can we do this reliably" as opposed to "has this literally ever happened" -- but the claim in your post was quite explicit about the "this has literally never happened" version. I took your position to be that this-literally-ever-happening would be significant evidence towards it happening more reliably soon, on your model of what's going on with LLMs, since (I took it) your current model strongly predicts that it has literally never happened.
This strong position even makes some sense to me; it isn't totally obvious whether it has literally ever happened. The chemistry story I referenced seemed surprising to me when I heard about it, even considering selection effects on what stories would get passed around.
There is a specific type of thinking, which I tried to gesture at in my original post, which I think LLMs seem to be literally incapable of. It’s possible to unpack the phrase “scientific insight” in more than one way, and some interpretations fall on either side of the line.
If there were, we would've probably heard about massive shifts in how scientists (and entrepreneurs!) are doing their work.
I have been seeing a bit of this, mostly uses of o1-pro and OpenAI Deep Research in chem/bio/medicine, and mostly via Twitter hype so far. But it might be the start of something.
It seems suspicious to me that this hype is coming from fields were it seems hard to verify (is the LLM actually coming up with original ideas or is it just fusing standard procedures? Are the ideas the bottleneck or is the experimental time the bottleneck? Are the ideas actually working or do they just sound impressive?). And of course this is Twitter.
Why not progress on hard (or even easy but open) math problems? Are LLMs afraid of proof verifiers? On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.
On the contrary, it seems like this is the area where we should be able to best apply RL, since there is a clear reward signal.
Is there? It's one thing to verify whether a proof is correct; whether an expression (posed by a human!) is tautologous to a different expression (also posed by a human!). But what's the ground-truth signal for "the framework of Bayesian probability/category theory is genuinely practically useful"?
This is the reason I'm bearish on the reasoning models even for math. The realistic benefits of them seem to be:
Of those:
(The reasoning-model hype is so confusing for me. Superficially there's a ton of potential, but I don't think there's been any real indication they're up to the real challenges still ahead.)
That's a reasonable suspicion but as a counterpoint there might be more low-hanging fruit in biomedicine than math, precisely because it's harder to test ideas in the former. Without the need for expensive experiments, math has already been driven much deeper than other fields, and therefore requires a deeper understanding to have any hope of making novel progress.
edit: Also, if I recall correctly, the average IQ of mathematicians is higher than biologists, which is consistent with it being harder to make progress in math.
On the other hand, frontier math (pun intended) is much worse financed than biomedicine because most of the PhD-level math has barely any practical applications worth spending many manhours of high-IQ mathematicians (which often makes them switch career, you know). So, I would argue, if productivity of math postdocs when armed with future LLMs raises by, let's say, an order of magnitude, they will be able to attack more laborious problems.
Not that I expect it to make much difference to the general populace or even the scientific community at large though
I think this is one of the most important questions we currently have in relation to time to AGI, and one of the most important "benchmarks" that tell us where we are in terms of timelines.
Instead of "have LLMs generated novel insights", how about "have LLMs demonstrated the ability to identify which views about a non-formal topic make more or less sense?" This question seems easier to operationalize and I suspect points at a highly related ability.
Random thought: maybe (at least pre-reasoning-models) LLMs are RLHF'd to be "competent" in a way that makes them less curious & excitable, which greatly reduces their chance of coming up with (and recognizing) any real breakthroughs. I would expect though that for reasoning models such limitations will necessarily disappear and they'll be much more likely to produce novel insights. Still, scaffolding and lack of context and agency can be a serious bottleneck.
In a recent post, Cole Wyeth makes a bold claim:
I commented:
My question is: can anyone confirm the above rumor, or cite any other positive examples of LLMs generating insights which help with a scientific or mathematical project, with those insights not being available anywhere else (ie seemingly absent from the training data)?
Cole Wyeth predicts "no"; though LLMs are able to solve problems which they have not seen by standard methods, they are not capable of performing novel research. I (Abram Demski) find it plausible (but not certain) that the answer is "yes". This touches on AI timeline questions.
I find it plausible that LLMs can generate such insights, because I think the predictive ground layer of LLMs contains a significant "world-model" triangulated from diffuse information. This "world-model" can contain some insights not present in the training data. I think this paper has some evidence for such a conclusion:
However, the setup in this paper is obviously artificial, setting up questions that humans already know the answers to, even if they aren't present in the data. The question is whether LLMs synthesize any new knowledge in this way.