I had a nice conversation with Ege today over dinner, in which we identified a possible bet to make! Something I think will probably happen in the next 4 years, that Ege thinks will probably NOT happen in the next 15 years, such that if it happens in the next 4 years Ege will update towards my position and if it doesn't happen in the next 4 years I'll update towards Ege's position.
Drumroll...
I (DK) have lots of ideas for ML experiments, e.g. dangerous capabilities evals, e.g. simple experiments related to paraphrasers and so forth in the Faithful CoT agenda. But I'm a philosopher, I don't code myself. I know enough that if I had some ML engineers working for me that would be sufficient for my experiments to get built and run, but I can't do it by myself.
When will I be able to implement most of these ideas with the help of AI assistants basically substituting for ML engineers? So I'd still be designing the experiments and interpreting the results, but AutoGPT5 or whatever would be chatting with me and writing and debugging the code.
I think: Probably in the next 4 years. Ege thinks: probably not in the next 15.
Ege, is this an accurate summary?
Here's a sketch for what I'd like to see in the future--a better version of the scenario experiment done above:
Curated. I feel like over the last few years my visceral timelines have shortened significantly. This is partly in contact with LLMs, particularly their increased coding utility, and a lot downstream of Ajeya's and Daniel's models and outreach (I remember spending an afternoon on an arts-and-crafts 'build your own timeline distribution' that Daniel had nerdsniped me with). I think a lot of people are in a similar position and have been similarly influenced. It's nice to get more details on those models and the differences between them, as well as to hear Ege pushing back with "yeah but what if there are some pretty important pieces that are missing and won't get scaled away?", which I hear from my environment much less often.
There are a couple of pieces of extra polish that I appreciate. First, having some specific operationalisations with numbers and distributions up-front is pretty nice for grounding the discussion. Second, I'm glad that there was a summary extracted out front, as sometimes the dialogue format can be a little tricky to wade through.
On the object level, I thought the focus on schlep in the Ajeya-Daniel section and slowness of economy turnover in the Ajaniel-Ege section was pretty interesting. I think there's a bit of a cycle with trying to do complicated things like forecast timelines, where people come up with simple compelling models that move the discourse a lot and sharpen people's thinking. People have vague complaints that the model seems like it's missing something, but it's hard to point out exactly what. Eventually someone (often the person with the simple model) is able to name one of the pieces that is missing, and the discourse broadens a bit. I feel like schlep is a handle that captures an important axis that all three of our participants differ on.
I agree with Daniel that a pretty cool follow-up activity would be an expanded version of the exercise at the end with multiple different average worlds.
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute. The 70b models are borderline capable of following routine instructions to label data or pour it into specified shapes. GPT-4 is almost robustly capable of that. There are 3-4 more effective OOMs in the current investment scaling sprint (3-5 years), so another 2 steps of improvement if there was enough equally useful training data to feed the process, which there isn't. At some point, training gets books in images that weren't previously available as high quality text, which might partially compensate for running out of text data. Perhaps there are 1.5 steps of improvement over GPT-4 in total despite the competence-dense data shortage. (All of this happens too quickly to be restrained by regulation, and without AGI never becomes more scary than useful.)
Leela Zero is a 50m parameter model that plays superhuman Go, a product of quality of its synthetic dataset. Just as with images, sound, natural languages, and programming languages, we can think of playing Go and writing formal proofs as additional modalities. A foundational model that reuses circuits between modalities would be able to take the competence where synthetic data recipes are known, and channel it to better reasoning in natural language, understanding human textbooks and papers, getting closer to improving the quality of its natural language datasets. Competence at in-context learning or sample efficiency during pre-training are only relevant where the system is unable to do real work on its own, the reason essential use of RL can seem necessary for AGI. But once a system is good enough to pluck the low hanging R&D fruit around contemporary AI architectures, these obstructions are gone. (Productively tinkering with generalized multimodality and synthetic data doesn't require going outside the scale of preceding models, which keeps existing regulation too befuddled to intervene.)
Leela Zero uses MCTS, it doesnt play superhuman in one forward pass (like gpt-4 can do in some subdomains) (i think, didnt find any evaluations of Leela Zero at 1 forward pass), and i'd guess that the network itself doesnt contain any more generalized game playing circuitry than an llm, it just has good intuitions for Go.
Nit:
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute.
1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2. And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming - llama70b is about as good as gpt-3.5. And I'd personally guess gpt4 is 1.5 OOMs effective compute above llama70b, not 2.
Leela Zero uses MCTS, it doesnt play superhuman in one forward pass
Good catch, since the context from LLMs is performance in one forward pass, the claim should be about that, and I'm not sure it's superhuman without MCTS. I think the intended point survives this mistake, that is it's a much smaller model than modern LLMs that has relatively very impressive performance primarily because of high quality of the synthetic dataset it effectively trains on. Thus models at the scale of near future LLMs will have a reality-warping amount of dataset quality overhang. This makes ability of LLMs to improve datasets much more impactful than their competence at other tasks, hence the anchors of capability I was pointing out were about labeling and rearranging data according to instructions. And also makes compute threshold gated regulation potentially toothless.
Subjectively there is clear improvement between 7b vs. 70b vs. GPT-4, each step 1.5-2 OOMs of training compute.
1.5 to 2 OOMs? 7b to 70b is 1 OOM of compute, adding in chinchilla efficiency would make it like 1.5 OOMs of effective compute, not 2.
With Chinchilla scaling, compute is square of model size, so 2 OOMs under that assumption. Of course current 7b models are overtrained compared to Chinchilla (all sizes of LLaMA-2 are trained on 2T tokens), which might be your point. And Mistral-7b is less obviously a whole step below LLaMA-2-70b, so the full-step-of-improvement should be about earlier 7b models more representative of how the frontier of scaling advances, where a Chinchilla-like tradeoff won't yet completely break down, probably preserving data squared compute scaling estimate (parameter count no longer works very well as an anchor with all the MoE and sparse pre-training stuff). Not clear what assumptions make it 1.5 OOMs instead of either 1 or 2, possibly Chinchilla-inefficiency of overtraining?
And llama 70b to gpt-4 is 1 OOM effective compute according to openai naming - llama70b is about as good as gpt-3.5.
I was going from EpochAI estimate that puts LLaMA 2 at 8e23 FLOPs and GPT-4 at 2e25 FLOPs, which is 1.4 OOMs. I'm thinking of effective compute in terms of compute necessary for achieving the same pre-training loss (using lower amount of literal compute with pre-training algorithmic improvement), not in terms of meaningful benchmarks for fine-tunes. In this sense overtrained smaller LLaMAs get even less effective compute than literal compute, since they employ it to get loss Chinchilla-inefficiently. We can then ask the question of how much subjective improvement a given amount of pre-training loss scaling (in terms of effective compute) gets us. It's not that useful in detail, but gives an anchor for improvement from scale alone in the coming years, before industry and economy force a slowdown (absent AGI): It goes beyond GPT-4 about as far as GPT-4 is beyond LLaMA-2-13b.
Iirc, original alphago had a policy network that was grandmaster level but not superhuman without MCTS.
This is not quite true. Raw policy networks of AlphaGo-like models are often at a level around 3 dan in amateur rankings, which would qualify as a good amateur player but nowhere near the equivalent of grandmaster level. If you match percentiles in the rating distributions, 3d in Go is perhaps about as strong as an 1800 elo player in chess, while "master level" is at least 2200 elo and "grandmaster level" starts at 2500 elo.
Edit: Seems like policy networks have improved since I last checked these rankings, and the biggest networks currently available for public use can achieve a strength of possibly as high as 6d without MCTS. That would be somewhat weaker than a professional player, but not by much. Still far off from "grandmaster level" though.
According to figure 6b in "Mastering the Game of Go without Human Knowledge", the raw policy network has 3055 elo, which according to this other page (I have not checked that these Elos are comparable) makes it the 465th best player. (I don’t know much about this and so might be getting the inferences wrong, hopefully the facts are useful)
I found the discussion around Hofstadter's law in forecasting to be really useful as I've definitely found myself and others adding fudge factors to timelines to reflect unknown unknowns which may or may not be relevant when extrapolating capabilities from compute.
In my experience many people are of the feeling that current tools are primarily limited by their ability to plan and execute over longer time horizons. Once we have publicly available tools that are capable of carrying out even simple multi-step plans (book me a great weekend away with my parents with a budget of $x and send me the itinerary), I can see timelines amongst the general public being dramatically reduced.
I think unknown unknowns are a different phenomenon than Hofstadter's Law / Planning Fallacy. My thinking on unknown unknowns is that they should make people spread out their timelines distribution, so that it has more mass later than they naively expect, but also more mass earlier than they naively expect. (Just as there are unknown potential blockers, there are unknown potential accelerants.) Unfortunately I think many people just do the former and not the latter, and this is a huge mistake.
Interesting. I fully admit most of my experience with unknown unknowns comes from either civil engineering projects or bringing consumer products to market, both situations where the unknown unknowns are disproportionately blockers. But this doesn't seem to be the case with things like Moore's Law or continual improvements in solar panel efficiency where the unknowns have been relatively evenly distributed or even weighted towards being accelerants. I'd love to know if you have thoughts on what makes a given field more likely to be dominated by blockers or accelerants!
Could you elaborate on what it would mean to demonstrate 'savannah-to-boardroom' transfer? Our architecture was selected for in the wilds of nature, not our training data. To me it seems that when we use an architecture designed for language translation for understanding images we've demonstrated a similar degree of transfer.
I agree that we're not yet there on sample efficient learning in new domains (which I think is more what you're pointing at) but I'd like to be clearer on what benchmarks would show this. For example, how well GPT-4 can integrate a new domain of knowledge from (potentially multiple epochs of training on) a single textbook seems a much better test and something that I genuinely don't know the answer to.
Introduction
How many years will pass before transformative AI is built? Three people who have thought about this question a lot are Ajeya Cotra from Open Philanthropy, Daniel Kokotajlo from OpenAI and Ege Erdil from Epoch. Despite each spending at least hundreds of hours investigating this question, they still still disagree substantially about the relevant timescales. For instance, here are their median timelines for one operationalization of transformative AI:
You can see the strength of their disagreements in the graphs below, where they give very different probability distributions over two questions relating to AGI development (note that these graphs are very rough and are only intended to capture high-level differences, and especially aren't very robust in the left and right tails).
So I invited them to have a conversation about where their disagreements lie, sitting down for 3 hours to have a written dialogue. You can read the discussion below, which I personally found quite valuable.
The dialogue is roughly split in two, with the first part focusing on disagreements between Ajeya and Daniel, and the second part focusing on disagreements between Daniel/Ajeya and Ege.
I'll summarize the discussion here, but you can also jump straight in.
Summary of the Dialogue
Some Background on their Models
Ajeya and Daniel are using a compute-centric model for their AI forecasts, illustrated by Ajeya's draft AI Timelines report, and Tom Davidson's takeoff model where the question of "when transformative AI" gets reduced to "how much compute is necessary to get AGI and when will we have that much compute? (modeling algorithmic advances as reductions in necessary compute)".
Whereas Ege thinks such models should have a lot of weight in our forecasts, but that they likely miss important considerations and doesn't have enough evidence to justify the extraordinary predictions it makes.
Habryka's Overview of Ajeya & Daniel discussion
These disagreements probably explain some but not most of the differences in the timelines for Daniel and Ajeya.
Habryka's Overview of Ege & Ajeya/Daniel Discussion
Overall, whether AI will get substantially better at transfer learning (e.g. seeing an AI be trained on one genre of video game and then very quickly learn to play another genre of video game) would update all participants substantially towards shorter timelines.
We ended the dialogue with Ajeya, Daniel and Ege by putting numbers on how much various AGI milestones would cause them to update their timelines (with the concrete milestones proposed by Daniel). Time constraints made it hard to go into as much depth as we would have liked, but me and Daniel are excited about fleshing more concrete scenarios of how AGI could play out and then collecting more data on how people would update in such scenarios.
The Dialogue
Visual probability distributions
Opening statements
Daniel
Ege
Ajeya
On in-context learning as a potential crux
Taking into account government slowdown
Recursive self-improvement and AI's speeding up R&D
Do we expect transformative AI pre-overhang or post-overhang?
Hofstadter's law in AGI forecasting
Summary of where we are at so far and exploring additional directions
Exploring conversational directions
Ege's median world
Far-off-distribution transfer
A concrete scenario & where its surprises are
Overall summary, takeaways and next steps