Update: After seeing a comment by AdamK on Manifold, I dug into the code and can confirm that the way the codebase queries for articles does at least check for meta tags that indicate when an article was last updated (which my guess is aren't reliable, but it does seem like they at least tried). I would be highly surprised if their code addresses all of the myriad data-contamination issues (including very tricky ones like news articles that predicted things accurately getting more traffic after a forecasted event happened and therefore coming up higher in search results, even if they were written before the resolution time). I am currently taking bets that on prospective forecasts this system will perform worse than advertised (and also separately think that the advertised performance does not meaningfully make this system "superhuman")
How did you handle issues of data contamination?
In your technical report you say you validated performance for this AI system using retrodiction:
Performance. To evaluate the performance of the model, we perform retrodiction, pioneered in Zou
et al. [3]. That is to say, we take questions about past events that resolve after the model’s pretraining
data cutoff date. We then compare the accuracy of the crowd with the accuracy of the model, both having access to the same amount of recent information. When we retrieve articles for the forecasting AI, we use the search engine’s date cutoff feature, so as not to leak the answer to the model.
I am quite concerned about search engines actually not being capable of filtering out data for recent events.As an example, I searched “Israel attack on Iran” as you mention that as a concrete example in this excerpt of the blog post:
Concretely, we asked the bot whether Israel would carry out an attack on Iran before May 1, 2024. .
The first result of searching for "Israel attack on Iran", if you set the date cutoff to October 1st 2023, is this:
As you can see, Google claims a publishing data of "Aug 11, 2022". However, when you click into this article, you will quickly find the following text:
The article actually includes updates from April 19, 2024! This is very common, as many articles get updated after they are published.
The technical report just says:
When we retrieve articles for the forecasting AI, we use the search engine’s date cutoff feature, so as not to leak the answer to the model.
But at least for Google this fails, unless you are using an unknown functionality for Google.
Looking into the source code, it appears that the first priority source you check is some meta tags:
date_meta_tags = ['article:published_time', 'datePublished', 'date', 'pubdate', 'og:published_time', 'publishdate' ]
However, for the article I just linked, those meta tags do indeed say the article was published in 2022:
<meta property="article:published_time" content="2022-08-11T11:15:47-0300">
This means this article, as far as I can tell from parsing the source code, would have its full text end up in the search results, even though it’s been updated in 2024 and includes the events that are supposed to be forecasted (it might be filtered out by something else, but I can’t seem to find any handling of modified articles).
Generally, data contamination is a huge issue for retrodiction, so I am assuming you have done something good here, otherwise it seems very likely your results are inflated because of those data contamination issues, and we should basically dismiss the results of your technical report.
To be clear, I did not do any cherry-picking of data here. The very first search query on any topic that I tried was the search I document above.
Yes, the Google 'search by date' is basically completely busted and has been for a while (even though Google possesses the capability to date content accurately by using their Internet-wide snapshot archive going back to the 1990s, whose existence was recently publicly confirmed by the ad API leak). For kicks, try searching things like "Xi Jinping" with date-ranges like 2013... It is most regrettable, as it used to be a useful tool for me in digging up old stuff. There also seem to be issues in the other direction, where Google is 'forgetting' old articles which aren't being indexed at all, apparently, in any publicly-searchable fashion (which might be contributing to the former, by a base rates screening-paradox effect - if all the real old articles have been forgotten by the index, then only erroneously timestamped hits will be available). I'm not aware of any search engine whose date cutoff is truly reliable. Even if they were, you would still have to manually check and clean to be sure that things like sidebars or recommendations were not causing data leakage.
I also agree that if this is really the only countermeasure to data leakage OP has taken, then the results seem dead on arrival. ML models infamously ruthlessly exploit far subtler forms of temporal data leakage than this...
It sounds like I'll be waiting for some actually out-of-sample forecasting numbers before I believe any claims about beating Metaculus etc.
(This is a surprising mistake for a benchmarking expert to make. Even if you knew nothing about the specific problems with date-range search, it should be obvious how even with completely unedited, static snapshots from the past, that there would be leakage - like results will rank higher or lower based on future events. If Israel attacked Iran, obviously all articles before arguing that Israel will/should/could attack Iran are going to benefit from being 'right' and ranked higher than articles arguing the opposite, many of which will outright quietly disappear & cease to be mentioned, and a LLM conditioned on those rather than the lower-ranking ones will automatically & correctly 'predict' more accurately. And countless other leakages like that, which are not fixed as easily as "just download a snapshot from the IA".)
Danny Halawi says there is lower performance on a different set of more heldout predictions, and the claims about GPT-4 knowledge cutoff are probably wrong:
The results in "LLMs Are Superhuman Forecasters" don't hold when given another set of forecasting questions. I used their codebase (models, prompts, retrieval, etc.) to evaluate a new set of 324 questions—all opened after November 2023.
Findings:
- Their Brier score: 0.195
- Crowd Brier score: 0.141 [lower=better]
First issue:
The authors assumed that GPT-4o/GPT-4o-mini has a knowledge cut-off date of October 2023.
However, this is not correct.
For example, GPT-4o knows that Mike Johnson replaced Kevin McCarthy as speaker of the house.
- This event happened at the end of October.
- This also happens to be a question in the Metaculus dataset.
Seems worth mentioning SOTA, which is https://futuresearch.ai/. Based on the competence & epistemics of Futuresearch team and their bot get very strong but not superhuman performance, roll to disbelieve this demo is actually way better and predicts future events at superhuman level.
Also I think it is a generally bad to not mention or compare to SOTA but just cite your own prior work. Shame.
Can you explain what's your definition of "accuracy"? (the 87.7% figure)
Does it correspond to some proper scoring rule?
How much does it cost to run one instance of this right now?
One obvious idea is to build integrations for Manifold, Fatebook, Metaculus etc where you automatically have this bot give an answer. I'm curious how much that costs when you're just doing it all over the place.
(I don't know that I believe the current generation of bots are as good as implied here for reasons Habryka/Gwern are saying, but, regardless, I expect at least pretty decent bots of this form to be available within a year if they aren't already, and for it to be useful for various sites to have SOTA forecasting bots plugged into them if they're at least nontrivially better than chance)
Some additional thoughts:
A thing that is different from a forecasting bot, is an "operationalization" bot, that scans threads with disagreements for plausible operationalizations of the disagreement.
I think good discourse is partly bottlenecked on "it's pretty expensive and annoying to operationalize many disagreements, in ways that actually help get to the heart of the matter."
I'd be pretty interested in experimenting with operationalization-bot and forecasting-bot combos (once they get sufficiently good to not be annoying), that scan disagreements, suggest little cloud-bubbles of "is this a cruxy operationalization you could bet on?", which maybe disappear if nobody clicks "yep".
Then, people are encouraged to put their own probabilities on it, and then, afterwards (to avoid anchoring), a forecasting bot gives it's own guess.
In a recent appearance on Conversations with Tyler, famed political forecaster Nate Silver expressed skepticism about AIs replacing human forecasters in the near future. When asked how long it might take for AIs to reach superhuman forecasting abilities, Silver replied: “15 or 20 [years].”
In light of this, we are excited to announce “FiveThirtyNine,” an AI forecasting bot. Our bot, built on GPT-4o, provides probabilities for any user-entered query, including “Will Trump win the 2024 presidential election?” and “Will China invade Taiwan by 2030?” Our bot performs better than experienced human forecasters and performs roughly the same as (and sometimes even better than) crowds of experienced forecasters; since crowds are for the most part superhuman, FiveThirtyNine is in a similar sense. (We discuss limitations later in this post.)
Our bot and other forecasting bots can be used in a wide variety of contexts. For example, these AIs could help policymakers minimize bias in their decision-making or help improve global epistemics and institutional decision-making by providing trustworthy, calibrated forecasts.
We hope that forecasting bots like ours will be quickly integrated into frontier AI models. For now, we will keep our bot available at forecast.safe.ai, where users are free to experiment and test its capabilities.
Quick Links
Problem
Policymakers at the highest echelons of government and corporate power have difficulty making high-quality decisions on complicated topics. As the world grows increasingly complex, even coming to a consensus agreement on basic facts is becoming more challenging, as it can be hard to absorb all the relevant information or know which sources to trust. Separately, online discourse could be greatly improved. Discussions on uncertain, contentious issues all too often devolve into battles between interest groups, each intent on name-calling and spouting the most extreme versions of their views through highly biased op-eds and tweets.
FiveThirtyNine
Before transitioning to how forecasting bots like FiveThirtyNine can help improve epistemics, it might be helpful to give a summary of what FiveThirtyNine is and how it works.
FiveThirtyNine can be given a query—for example, “Will Trump win the 2024 US presidential election?” FiveThirtyNine is prompted to behave like an “AI that is superhuman at forecasting”. It is then asked to make a series of search engine queries for news and opinion articles that might contribute to its prediction. (The following example from FiveThirtyNine uses GPT-4o as the base LLM.)
Based on these sources and its wealth of prior knowledge, FiveThirtyNine compiles a summary of key facts. Given these facts, it’s asked to give reasons for and against Trump winning the election, before weighing each reason based on its strength and salience.
Finally, FiveThirtyNine aggregates its considerations while adjusting for negativity and sensationalism bias in news sources and outputs a tentative probability. It is asked to sanity check this probability and adjust it up or down based on further reasoning, before putting out a final, calibrated probability—in this case, 52%.
Evaluation. To test how well our bot performs, we evaluated it on questions from the Metaculus forecasting platform. We restricted the bot to make predictions only using the information human forecasters had, ensuring a valid comparison. Specifically, GPT-4o is only trained on data up to October 2023, and we restricted the news and opinion articles it could access to only those published before a certain date. From there, we asked it to compute the probabilities of 177 events from Metaculus that had happened (or not happened) since.
We compared the probabilities our bot arrived at to those arrived at independently by crowds of forecasters on the prediction platform Metaculus. For example, we asked the bot to estimate the probability that Israel would carry out an attack on Iran before May 1, 2024, restricting it to use the same information available to human forecasters at the time. This event did not occur, allowing us to grade the AI and human forecasts. Across the full dataset of events, we found that FiveThirtyNine performed just as well as crowd forecasts.
Strengths over prediction markets. On the 177 events, the Metaculus crowd got 87.0% accuracy, while FiveThirtyNine got 87.7% ± 1.4. A link to the technical report is here. This bot lacks many of the drawbacks of prediction markets. It makes forecasts within seconds. Additionally, groups of humans do not need to be incentivized with cash prizes to make and continually update their predictions. Forecasting AIs are several orders of magnitude faster and cheaper than prediction markets, and they’re similarly accurate.
Limitations. The bot is not fine-tuned, and doing so could potentially make it far more accurate. It simply retrieves articles and writes a report as guided through an engineered prompt. (Its prompt can be found by clicking on the gear icon in forecast.safe.ai.) Moreover, probabilities from AIs are also known to lead to automation bias, and improvements in the interface could ameliorate this. The bot is also not designed to be used in personal contexts, and it has not been tested on its ability to predict financial markets. Forecasting can also lead to the neglect of tail risks and lead to self-fulfilling prophecies. That said, we believe this could be an important first step towards establishing a cleaner, more accurate information landscape. The bot is decidedly subhuman in delimited ways, even if it is usually beyond the median human in breadth, speed, and accuracy. If it’s given an invalid query it will still forecast---a reject option is not yet implemented. If something is not in the pretraining distribution and if no articles are written about it, it doesn’t know about it. That is, if it’s a forecast about something that’s only on the X platform, it won’t know about it, even if a human could. For forecasts for very soon-to-resolve or recent events, it does poorly. That’s because it finished pretraining a while ago so by default thinks Joe Biden is in the race and need to see articles to appreciate the change in facts. Its probabilities are not always completely consistent either (like prediction markets). In claiming the bot is superhuman (around crowd level in accuracy), we’re not claiming it’s superhuman in every possible respect, much like how academics can claim image classifiers are superhuman at ImageNet, despite being vulnerable to adversarial ImageNet images. We do not think AI forecasting is overall subhuman.
Vision
Epistemic technologies such as Wikipedia and Community Notes have had significant positive impacts on our ability to understand the world, hold informed discussions, and maintain consensus reality. Superhuman forecasting AIs could have similar effects, enabling improved decision-making and public discourse in an increasingly complex world. By acting as a neutral intelligent third party, forecasting AIs could act as a tempering force on those pushing extreme, polarized positions.
Chatbots. Through integration into chatbots and personal AI assistants, strong automated forecasting could help with informing consequential decisions and anticipating severe risks. For example, a forecasting AI could provide trustworthy, impartial probability assessments to policymakers. The AI could also help quantify risks that are foreseeable to experts but not yet to the general public, such the possibility that China might steal OpenAI’s model weights.
Posts. Forecasting AIs could be integrated on social media and complement posts to help users weigh the information they are receiving.
News stories. Forecasting could also complement news stories and topics. For example, for news associated with California AI Safety bill SB 1047, a forecasting bot could let users know the probability that it gets signed into law.
Conclusion
Carl Sagan noted, “If we continue to accumulate only power and not wisdom, we will surely destroy ourselves.” AIs will continue to become more powerful, but their forecasting capabilities will hopefully help make us more prudent and increase our foresight.