As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').
I encourage you to get some humans to take the same test you gave the models, so that we have a better human baseline. It matters a lot for what the takeaways should be, if LLMs are already comparable or better to humans at this task vs. still significantly worse.
Summary
Can LLMs science? The answer to this question can tell us important things about timelines to AGI. In this small pilot experiment, we test frontier LLMs on their ability to perform a minimal version of scientific research, where they must discover a hidden rule about lists of integers by iteratively generating and testing hypotheses. Results are ambiguous: they're mostly pretty bad at it but top systems show apparent signs of life. We're working on a larger, more rigorous experiment, and we really want your input.
Structure
In this post we:
This is followed by appendices with more detail (eg related work, limitations) but we've kept the main body as short and direct as possible.
Introduction
Over the past six months we have been trying to better understand the degree to which LLMs are capable of general reasoning[1] (in LLM Generality is a Timeline Crux, and LLMs Look Increasingly Like General Reasoners). In short: researchers, including AI safety researchers, have widely differing positions on whether (and to what degree) LLMs are capable of the same sort of accurate and broadly applicable reasoning ability that humans are. At one extreme is the stochastic parrot hypothesis that "no actual language understanding is taking place in [LLMs]" and any apparent signs of reasoning are actually a kind of cheating. At the other extreme is the view that LLMs are already fully capable of all the same types of reasoning as humans are and just need a bit more scaling.
An important negative consequence of this disagreement is that the AI safety community's limited resources are spread across a range of timelines, in a way that impacts the scaling labs much less. The best use of those resources is significantly different if LLMs will scale straight to AGI[2], relative to the worlds where substantial further breakthroughs are needed first[3].
In order to try to improve our understanding of this issue and help resolve the disagreement, we created the Numberwang[4] experiment.
The experiment
This is a pilot experiment for a larger research project which started this month and runs through April 2025 (described in a later section). The key question (in both the larger project and the pilot) is whether LLMs can demonstrate general reasoning by autonomously performing a simplified form of scientific investigation. This approach is useful both for addressing the question of general reasoning, and for learning more about the critical object-level question of whether LLMs are likely to be able to accelerate AI development by performing scientific research independently.
The experiment is similar to one performed on humans by Peter Wason in 2019 in his paper, 'On the failure to eliminate hypotheses in a conceptual task' (an experiment familiar to many from Veritasium or Eliezer Yudkowsky). We chose the experiment as a highly simplified version of the scientific process: iteratively proposing hypotheses, testing them, and rejecting or refining them until a hypothesis is found which correctly predicts both positive and negative results.
We generated a set of 60 possible rules which described some lists of integers but not others and randomly sampled 10 in such a way as to ensure an even selection across difficulty levels. Some example rules:
For each rule, each of 11 well-known LLMs was prompted to try to figure out the rule, and then given three initial examples of lists that follow the rule. Here's a shortened version of the prompt, which describes the rest of the procedure (see appendix E for the full prompt, including instruction to avoid various failure modes like confirmation bias):
Claude-3.6-Sonnet[5] was used to check the test lists, and to evaluate whether the model's final hypothesis was (extensionally) equivalent to the real rule. Our prompt encouraged Claude to be fairly lenient in judging.
Full details are available in the preregistration and repository (and we're happy to go into more detail in the comments).
Results
Most LLMs did very badly. Tested models successfully discovered a mean of 11% and median of 10% of the 10 rules tested. 4 of 11 tested models got 0 correct; 5 of 11 got 1 correct. OpenAI-o1-preview got 3 and Claude-3.6-Sonnet got 4[6].
The authors' independent prior estimates of problem difficulty were mostly confirmed (problems are in order of estimated difficulty in the figure), although one problem (the 'collatzish' hailstone sequence variation) was easier than we expected, possibly because models' training data contains many descriptions of the hailstone sequence itself.
This is a somewhat ambiguous result in terms of what it says about the degree to which LLMs are capable of hypothesis generation and testing. Mostly LLMs weren't capable of it, suggesting that the LLM architecture may be poorly suited to these tasks. On the other hand, the latest and most sophisticated models seemed to show clear signs of life.
As a partial point of comparison, in Wason's testing only about 20% of humans solved the problem tested, but Wason's experiment differed in two important ways: first, subjects were deliberately given a misleading example, and second, only one task was tested (our easiest-rated task, 'strictly increasing order').
Why did the models fail? First, in about 25% of cases (primarily for the smaller models), they failed to produce valid JSON output. The primary failure mode other than these basic errors was confirmation bias[7]; despite repeated instructions to do so, models failed to consistently attempt to falsify their hypotheses. Interestingly, in some cases models would effectively add epicycles in the face of contrary evidence, complicating their hypotheses and adding special cases rather than simply rejecting them. Sometimes the model's success or failure would be determined by whether their initial pool of hypotheses happened to contain one which was sufficiently similar to the true rule that the model could move iteratively from one to the other. Another failure mode (again, primarily for smaller models) was just analysis errors, eg making arithmetic errors when analyzing an example sequence.
While the conclusions we can draw from this experiment are limited, the full project, which we describe next, tries to find more definitive answers.
Full experiment (in progress)
Key advantages over the pilot
Procedure
The full experiment is similar in most ways to this pilot project, but instead of lists of integers, we randomly generate novel quasi-physical domains. These domains consist of objects of various types; each type of object has certain properties (which may be numeric or boolean). There is a set of operations that can be performed either upon a single object, or by one object on another. All names (object types, properties, operations) are randomly generated.
As a running example, let's say the system contains a type of object called a zorple, which may or may not be quizzleplistic, and whose zibblosity has some numeric value. It is possible to splorficate a zorple, or to wuzzle it. Our domain might contain, say, eight zorples, each of which may or may not be quizzleplistic, and each of which has its own zibblosity value.
The system has a set of underlying (again, randomly generated) causal relationships. For example, perhaps when a zorple is splorficated, its zibblosity doubles. The LLM being tested is not told anything about these ground-truth relationships.
Note that these domains can be as simple or complex as we like. They could contain a single type of object with a single boolean property, with a single operation that can be performed, and a small number of mathematically simple causal relationships. Or they could contain many types of objects, each with many numeric and boolean properties, many operations that have different effects depending on what objects perform and receive them, and a large number of complex causal relationships.
As in the pilot experiment, we inform the model that it is a scientist investigating a new domain, and that it should perform experiments in order to scientifically characterize the domain. Its task is to decide on specific experiments to perform; in our example above, it might decide to splorficate three of the zorples to see whether and how their properties change.
For each experiment that the model chooses to perform, an automated 'lab assistant' conducts the experiments and reports the results (whose effects depend entirely on the pre-established causal relations of the domain). The lab assistant takes no initiative at all; their only role is to report the output of experiments that the LLM requests.
Once the model-as-scientist has performed as many experiments as it wishes to, we test whether it has learned a full and correct model of the domain (this could be done in several ways, ranging from being able to correctly predict the outcomes of new experiments to being able to write a report containing all relevant equations).
For a bit more detail, see this writeup.
What we want from you
Over the past six months, we have had many discussions with other researchers, here and elsewhere, about whether LLMs are capable of general reasoning, and in particular whether they are capable of general reasoning out-of-distribution. Opinions vary quite dramatically. We claim that the full project described above will provide good evidence on this question.
If this is a topic you have a view on, we'd really like to hear from you in the comments:
Conclusion
In our view, it's vitally important to better understand whether LLMs are weaker than they naively appear, in ways that will limit their ability to do autonomous scientific research and prevent direct scaling to full AGI. The pilot project shown here provides some initial, ambiguous evidence on this question, and we expect the full project to provide significantly stronger evidence. We'd very much like to get your input on how we could shape the project to provide even better evidence on this key question.
To the extent that these results (or results from the main experiment) show that LLMs struggle with general reasoning, we must be careful about what implications we draw. First, these results clearly do not mean that LLMs can't do anything that we would think of as reasoning; numerous benchmarks and results show LLMs succeeding at tasks that we typically think of as reasoning, even if it's not fully general or extended out-of-distribution. Second, these results don't mean that LLMs can't increase the velocity of scientific research; again, there's a growing body of literature (eg see examples here) showing that they can. What it could show, though, is that LLMs are potentially fundamentally limited at autonomously doing the end-to-end cycle of scientific research, and that likely places bounds on the degree to which they can accelerate research (and in particularly capabilities research). One human scientist might be able to direct a handful of LLMs and thereby speed up their research a few times, but it seems less likely that they could spin up 1000 LLMs, point them at a problem, and turn them loose.
Inversely, to the extent that these results suggest signs of life from LLMs on this sort of autonomous research, interpretation of that result requires equal care. Our experiment here (and in the main project) aims to test almost the simplest possible version of such research. LLMs may need multiple orders of magnitude further scaling before they can tackle important real-world problems autonomously. Nevertheless, such signs of life suggest that this is something we'll see in the relatively near future, and without the need for major further breakthroughs; scaling will likely be enough.
We are eager to get more compelling and rigorous evidence on this question from the main project. The resulting clarification of how far we are from AGI will let us better direct our resources toward achieving the best possible chance of good outcomes for humanity.
Appendices
In order to keep the main post as short and clear as possible, we've moved these sections to the bottom.
A. Acknowledgments
Thanks for helpful input to Abdur Raheem Ali (who had ideas for a similar experiment which partly inspired this one), Daniel Tan, Robert Kralisch, and Andy Arditi. Thanks for draft feedback to Nicholas Kees Dupuis, Daniel Tan, James Chua, Felix Binder, Seth Herd, Andy Arditi, Jord Nguyen, Aaron Scher, and Robert Adragna.
B. Related work
We will provide a more extensive discussion of related work with the main experiment, but to briefly point to some interesting representative examples (for both the pilot and the full project):
C. Limitations
This pilot project has many limitations, some of which will be overcome in the main project.
D. Full list of tested rules
The full set of 60 candidate rules can be found here; the following are the 10 we used for the pilot:
E. Full prompt
Roughly seven hours were spent iterating on the prompt.
You can also see the analysis and judgment prompts given to the evaluating model in the repository.
F. Minimal human baseline
The authors played four rounds of the game with each other. This was fairly informal and the details were slightly different, so this shouldn't be taken as an official baseline, but results were:
G. Example successes and failures
You can find the full set of official transcripts here, and various non-official ones here and here.
Example of success
Here's a small model (32B parameters) succeeding at 'There are no repeated numbers in the list', one of two models that got this problem:
Example of failure
Here we see a failure from a state-of-the-art model on a problem that only one model (Sonnet) succeeded on:
Current working definition of 'general reasoning' (from here): "The ability to do deduction, induction, and abduction in a careful, step by step way, without making many errors that a better reasoner could avoid, including in new domains." Possibly also "the ability to use all of that to build a self-consistent internal model of the domain under consideration," although I'm uncertain whether that part is both valid and necessary.
I'm using a definition of AGI often attributed to Shane Legg, as a system that 'can do practically any cognitive task a human can do' -- but my point here applies to most other definitions of AGI as well.
Note also that in those worlds where LLMs are already capable of general reasoning, they will likely themselves accelerate advancement to and past AGI by being able to do autonomous or semi-autonomous capabilities research.
Name taken from an excellent series of sketches from Mitchell & Webb, of which the first is here.
Officially and confusingly known as 'claude-3-5-sonnet-20241022'; this model is considerably more capable than the earlier versions also tagged as claude-3.5-sonnet, so many have taken to referring to it as 3.6.
We suspect that Sonnet's 40% correct is a bit of an outlier, based on earlier experiments (on a separate set of rules of comparable difficulty). On two reruns, Sonnet got 30%. For full transparency: Sonnet actually got one more correct on the initial run, for 50%, but we had accidentally failed to add the second and third examples for the last two questions, so we did an official rerun of those for all models, and that time Sonnet failed to get it. It's also worth noting that iteration on the prompt prior to the main experiment was largely done with Sonnet; it's possible that as a result the prompt is somewhat optimized for Sonnet (although this was not our intent), potentially making Sonnet's relative success less significant.
Previous research (eg 1, 2) has demonstrated confirmation bias and other cognitive biases in LLMs.