The strongest critique of developmental interpretability we know is the following: while it is established that phase transitions exist in neural network training, it is not yet clear how common they are, and whether they make a good target for alignment.
Is it established that phase transitions exist in the training of non-toy neural networks?
There are clearly s-shaped loss curves in many non-toy cases, but I'm not aware of any known cases which are clearly phase transitions as defined here (which is how the term is commonly used in e.g. physics and how I think this post wants to use the term).
For instance, while formation of induction-like attention heads[1] probably results in s-shaped loss curves in at least some cases, my understanding is that this probably has nothing to do with changes in the minima of some notion of energy (as would be required for the definition linked above I think). I think the effect is probably the one described in Multi-Component Learning and S-Curves. Unless there is some notion of energy such that this multi-component case of s-shaped loss curves is well described as a phase transition and that's what's discussed in this post?
Some important disclaimers:
These attention heads probably do a bunch of stuff which isn't that well described as induction, so I'm reluctant to call them "induction heads". ↩︎
Great question, thanks. tldr it depends what you mean by established, probably the obstacle to establishing such a thing is lower than you think.
To clarify the two types of phase transitions involved here, in the terminology of Chen et al:
The former kind of phase transition is a first-order phase transition in the sense of statistical physics, once you relate the posterior to a Boltzmann distribution. The latter is a notion that belongs more to the theory of dynamical systems or potentially catastrophe theory. The link between these two notions is, as you say, not obvious.
However Singular Learning Theory (SLT) does provide a link, which we explore in Chen et al. SLT says that the phases of Bayesian learning are also dominated by critical points of the loss, and so you can ask whether a given dynamical phase transition has "standing behind it" a Bayesian phase transition where at some critical sample size the posterior shifts from being concentrated near to being concentrated near .
It turns out that, at least for sufficiently large , the only real obstruction to this Bayesian phase transition existing is that the local learning coefficient near should be higher than near . This will be hard to prove theoretically in non-toy systems, but we can estimate the local learning coefficient, compare them, and thereby provide evidence that a Bayesian phase transition exists.
This has been done in the Toy Model of Superposition in Chen et al, and we're in the process of looking at a range of larger systems including induction heads. We're not ready to share those results yet, but I would point you to Nina Rimsky and Dmitry Vaintrob's nice post on modular addition which I would say provides evidence for a Bayesian phase transition in that setting.
There are some caveats and details, that I can go into if you're interested. I would say the existence of Bayesian phase transitions in non-toy neural networks is not established yet, but at this point I think we can be reasonably confident they exist.
Thanks for the detailed response!
So, to check my understanding:
The toy cases discussed in Multi-Component Learning and S-Curves are clearly dynamical phase transitions. (It's easy to establish dynamical phase transitions based on just observation in general. And, in these cases we can verify this property holds for the corresponding differential equations (and step size is unimportant so differential equations are a good model).) Also, I speculate it's easy to prove the existence of a bayesian phase transition in the number of samples for these toy cases given how simple they are.
More generally, I wish that when people used the term "phase transition", they clarified whether they meant "s-shaped loss curves" or some more precise notion. Often, people are making a non-mechanistic claim when they say "phase transition" (we observed a loss curve with a s-shape), but there are also mechanistic claims which require additional evidence.
In particular, when citing other work somewhere, it would be nice to clarify what notion of phase transition the other work is discussing.
Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. Currently, we are working on singular learning theory and developmental interpretability. Over time we expect to work on a broader research agenda, and to create understanding-based evals informed by our research.
Activities
Our primary focus is research. For now, we're a remote-first organization. We collaborate primarily through online seminars and the DevInterp Discord, with regular in-person meetings at workshops and conferences (see below). We're also investing time in academic outreach to increase the general capacity for work in technical AI alignment.
Research
We believe singular learning theory, a mathematical subject founded by Sumio Watanabe, will lead to a better fundamental understanding of large-scale learning machines and the computational structures that they learn to represent. It has already given us concepts like the learning coefficient and insights into phase transitions in Bayesian learning. We expect significant advances in the theory to be possible, and that these advances can inform new tools for alignment.
Developmental interpretability is an approach to understanding the emergence of structure in neural networks, which is informed by singular learning theory but also draws on mechanistic interpretability and ideas from statistical physics and developmental biology. The key idea is that phase transitions organize learning and that detecting, locating, and understanding these transitions could pave a road to evaluation tools that prevent the development of dangerous capabilities, values, and behaviors. We're engaged in a research sprint to test the assumptions of this approach.
We see these as two particularly promising research directions, and they are our focus for now. Like any ambitious research, they are not guaranteed to succeed, but there's plenty more water in the well. Broadly speaking, the research agenda of Timaeus is oriented towards solving problems in technical AI alignment using deep ideas from across many areas of mathematics and the sciences, with a "full stack" approach that integrates work from pure mathematics through to machine learning experiments.
The outputs we have contributed to so far:
Academic Outreach
AI safety remains bottlenecked on senior researchers and mentorship capacity. The young people already in the field will grow into these roles. However, given the scale and urgency of the problem, we think it is important to open inroads to academia and encourage established scientists to spend their time on AI safety.
Singular learning theory and developmental interpretability can serve as a natural bridge between the emerging discipline of AI alignment and existing disciplines of mathematics and science, including physics and biology. We plan to spend part of our time onboarding scientists into alignment via concrete projects in these areas.
Conferences
We're organizing conferences, retreats, hackathons, etc. focusing on singular learning theory and developmental interpretability. These have included and will include:
Team
Core Team
The research agenda that we are contributing to was established by Daniel Murfet, who is a mathematician at the University of Melbourne and an expert in singular learning theory, algebraic geometry, and mathematical logic.
Research Assistants
We just concluded a round of hiring and are excited to bring on board several very talented Research Assistants (RAs), starting with
Friends and Collaborators
Here are some of the people we are actively collaborating with:
Inclusion on this list does not imply endorsement of Timaeus' views.
Advisors
We're advised by Evan Hubinger and David ("Davidad") Dalrymple.
FAQ
Where can I learn more, and contact you?
Learn more on the Timaeus webpage. You can email Jesse Hoogland.
What about capabilities risk?
There is a risk that fundamental progress in either singular learning theory or developmental interpretability could contribute to further acceleration in AI capabilities in the medium term. We take this seriously and are seeking advice from other alignment researchers and organizations. By the end of our current research sprint we will have in place institutional forms to help us navigate this risk.
Likewise, there is a risk that outreach which aims to involve more scientists in AI alignment work will also accelerate progress in AI capabilities. However, those of us in academia can already see that as the risks become more visible, scientists are starting to think about these problems on their own. So the question is not whether a broad range of scientists will become interested in alignment but when they will start to contribute and what they work on.
It is part of Timaeus' mission to help scientists to responsibly contribute to technical AI alignment, while minimizing these risks.
Are phase transitions really the key?
The strongest critique of developmental interpretability we know is the following: while it is established that phase transitions exist in neural network training, it is not yet clear how common they are, and whether they make a good target for alignment.
We think developmental interpretability is a good investment in a world where many of the important structures (e.g., circuits) in neural networks form in phase transitions. Figuring out whether we live in such a world is one of our top priorities. It's not trivial because even if transitions exist they may not necessarily be visible to naive probes. Our approach is to systematically advance the fundamental science of finding and classifying transitions, starting with smaller systems where transitions can be definitively shown to exist.
How are you funded?
We're funded through a $142k Manifund grant led primarily by Evan Hubinger, Ryan Kidd, Rachel Weinberg, and Marcus Abramovitch. We are fiscally sponsored by Ashgro.
"Timaeus"? How do I even pronounce that?
Pronounce it however you want.
Timaeus is the eponymous character in the dialogue where Plato introduces his theory of forms. The dialogue posits a correspondence between the elements that make up the world and the Platonic solids. That's wrong, but it contains the germ of the idea of the unreasonable effectiveness of mathematics in understanding the natural world.
We read the Timaeus dialogue with a spirit of hope, in the capacity of the human intellect to understand and solve wicked problems. The narrow gate to human flourishing is preceded by a narrow path.
We'll see you on that path.