Alexander, Matt and I want to chat about the field of Agent Foundations (AF), where it's at and how to strengthen and grow it going forward.
We will kick off by each of us making a first message outlining some of our key beliefs and open questions at the moment. Rather than giving a comprehensive take, the idea is to pick out 1-3 things we each care about/think are important, and/or that we are confused about/would like to discuss. We may respond to some subset of the following prompts:
Where is the field of AF at in your view? How do you see the role of AF in the larger alignment landscape/with respect to making AI futures go well? Where would you like to see it go? What do yo use as some of the key bottlenecks for getting there? What are some ideas you have about how we might overcome them?
Before we launch in properly, just a few things that seem worth clarifying:
The main question I'm interested in about agent foundations at the moment is whether it should continue in its idiosyncratic current form, or whether it should start to look more like an ordinary academic field.
I'm also interested in discussing theories of change, to the extent it has bearing on the other question.
My own reasoning for foundational work on agency being a potentially fruitful direction for alignment research is:
The main reason I think it might be tractable is that so far not that many person-hours have gone into trying to do it. A priori it seems like the sort of thing you can get a nice mathematical formalism for, and so far I don't think that we've collected much evidence that you can't.
So I think I'd like to get a large number of people with various different areas of expertise thinking about it, and I'd hope that some small fraction of them discovered something fundamentally important. And a key question is whether the way the field currently works is conducive to that.
Does Agent Foundations-in-the-broad-sense need a new name?
Is the name 'Agent Foundations' cursed?
Suggestions I've heard are
'What are minds', 'what are agents'. 'mathematical alignment'. 'Agent Mechanics'
Some thought snippets:
(1) Clarifying and creating common knowledge about the scope of Agent Foundations and strengthening epistemic pluralism
(2) Where does AF sit on the "path to impact"
(3) possible prompt: if you had 2 capable FTE and 500'000 USD for AF field building, what would you do?
..suffering from a lack of time, and will stop here for now.
One of my favorite blogposts is Schubert's "Against Cluelessness Pockets of Predictability" introducing 'Pockets of Predicability':
(...)intuitions about low-variance predictability long held back scientific and technological progress.*** Much of the world was once unknowable to humans, and people may have generalised from that, thinking that systematic study wouldn't pay off. But in fact knowability varied widely: there were pockets of knowability or predictability that people could understand even with the tools of the day (e.g. naturally simple systems like the planetary movements, or artificially simple systems like low-friction planes). Via these pockets of knowability, we could gradually expand our knowledge - and thus the world was more knowable than it seemed. As Ernest Gellner points out, the Scientific and Industrial Revolutions largely consisted in the realisation that the world is surprisingly knowable:
"the generic or second-order discovery that successful systematic investigation of Nature, and the application of the findings for the purpose of increased output, are feasible, and, once initiated, not too difficult."
I really like this way of thinking about the possibility of knowledge and development of science. I see a very similar 'predictability skepticism' across the field of Alignment.
This predictabilility skepticism is reflected in the Indefinite Optimism of lab-based alignment groups and the Indefinite Pessimism of doomers.
I want to introduce the idea of 'Pockets of Deep Expertise'. That is - I think much of scientific progress is made by small groups of people, mostly opague from the outside, ('pockets') building up a highly specific knowledge over fairly-long time stretches ('deep expertise').
These pockets are
For the purpose of technical alignment, we need to think like a VC:
bet on a broad range of highly specific bets
To my mind we are currently only employing a tiny fraction of the world's scientific talent.
Although Alignment now attracts very large group of promising young people, much of their energy and talent is being wasted on reinventing the wheel .
Everyone has mentioned something along the lines of wanting to get a broad range of specific bets or types of people. We could take that as read and discuss how to do it?
(Although if we are going to talk about how we want the field to look, that probably most naturally comes first)
Ok, great. Let's take stock quickly.
I think we are all interested in some version of "bet on a broad/plural range of highly specific bets". Maybe we should talk about that more at some point.
To help with the flow of this, it might be useful however to go a bit more concrete first. I suggest we take the follow prompt:
if you had 2 capable FTE and 500'000 USD for AF field building, what would you do?
I'll give the idea I was chatting about with Alexander yesterday as my first answer.
There are probably a large number of academics with expertise in a particular area which seems potentially useful for alignment, and who might be interested in doing alignment research. But they might not know that there's a connection, or know anything about alignment. And unlike with junior researchers they're not gonna attend some MATS-type programme to pick it up.
So the idea is "instead of senior alignment researchers helping onboard junior people to alignment research, how about junior alignment people help onboard senior researchers from other areas?" Anti-MATS.
EDIT: Renamed to Reverse MATS because people glancing at the sidebar thought someone in the dialogue was anti MATS. We are pro MATS!
We have a large pool of junior people who've read plenty about alignment, but don't have mentorship. And there's a large pool of experienced researchers in potentially relevant subjects who don't know anything about alignment. So we send a junior alignment person to work as a research assistant or something with an experienced researcher in complexity science or active inference or information theory or somewhere else we think there might be a connection, and they look for one together and if they find it perhaps a new research agenda develops.
Yeah, I like this direction. I agree with the problem statement. I'm not sure "junior helping senior person" is maybe helpful but unsure it's the crux to getting this thing right. Here is what i think might be some cruxes/bottlenecks:
I think you'd have to approach researchers individually to see if they'd like to be involved.
The most straightforward examples would be people who work in a pretty obviously related area or who are known to have some interest in alignment already (I think both were true in the case of Dan Murfet and SLT?) or who know some alignment people personally. My guess is this category is reasonably large.
Beyond that, if you have to make a cold pitch to someone about the relevance of alignment (in general and as a research problem for them) I think it's a lot more difficult.
I don't think, for example, there's a good intro resource you can send somebody that makes a common-sense case for "basic research into agency could be useful for avoiding risks from powerful AI", especially not one that has whatever hallmarks of legitimacy make it easy for an academic to justify a research project based on.
Yeah cool. I guess another question is: once you identified them, what do they need to succeed?
I've definitely also seen the failure mode where someone is only or too focused on "the puzzles of agency" without having an edge in linking those questions up with AI risk/alignment. Some ways of asking about/investigating agency are more and less relevant to alignment, so I think it's important that there is a clear/strong enough "signal" from the target domain (here: AI risk/alignment) to guide the search/research directions
Yes, I agree with this.
I wonder whether focusing on agency is not even the right angle for this, and 'alignment theory' is more relevant. Probably what would be most useful for those researchers would be to have the basic problems of alignment made clear to them, and if they think that focusing on agency is a good way to attack those problems given their expertise then they can do that, but if they don't see that as a good angle they can pursue a different one.
I do think that having somebody who's well-versed in the alignment literature around (i.e. the proposed mentee) is potentially very impactful. There's a bunch of ideas that are very obvious to people in the alignment community because they're talked about so often (e.g. the training signal is not necessarily the goal of the trained model) that might not be obvious to someone thinking from first principles. A busy person coming in from another area could just miss something, and end up creating a whole research vision which is brought down by a snag that would have been obvious to an inexperienced researcher who's read a lot of LW.
seniorMATS - a care home for ai safety researchers in the twilight of their career
Yes, good surface area to the problem is important. I think there is a good deal of know-how around on this by now. From introductory materials, to people with experience running the sort of research retreats that provide good initial contact with the space, to (as you describe) indivuals who could help/assist/facilitate along the way. Also worth asking what the role of a peer environment should/could be (e.g. a AF discord type thing, and/or something a bit more high-bandwidth)
Also, finding good general "lines of attack" might pretty useful here. For example, I have fond Evan's "model organism" to be a pretty good/generative frame getting AF type work to be more productively oriented towards concrete/applied alignment work.
Alignment Noob Training - Inexperienced Mentees Actually Teach Seniors (ANTIMATS)
My model here puts less emphasise on "juniro researchers mentoring up", and more on "creating the right surface area for people wiht the relevanat expertise" more generaly; one way to do do this may be junior researchers with more alginment exposure, but I don't think that should be the central pillar.
the three things I am looking for in an academic (or nonacademic) researcher with scientific potential is
You don't want them to run off doing capability work. There is an almost just as pernicious failure where people say they care about 'alignment' but they don't really. Often this is variants where alignment and safety becomes a vague buzzword that gets co-opted for whatever their hobbyhorse was.
2. belief in 'theory' - they think alignment is a deep technical problem and believe that we will need scientific & conceptual progress. Experiments are important but pure empirics is not sufficient to guarantee safety. Many people conclude (perhaps rightly so!) that technical alignment is too difficult and governance is the answer.
3. swallowed the bitter lesson - unfortunately, there are still researchers who do not accept that LLMs are here. These are especially common, surprisingly perhaps, in AI and ML departments. Gary Marcus adherents in various guises. More generally, there is a failure mode of disinterest in deep learning practice.
"creating the right surface area for people wiht the relevanat expertise"
That seems right. Creating a peer network for more senior people coming into the field from other areas seems like it could be similarly impactful.
You don't convince academics with money. You convince them with ideas. Academics are mental specialists. They have honed very specific mental skills over many years. To convince them to work on something you have to convince them that 1. the problems is tractable 2. fruitful & interesting and most importantly 3. vulnerable to the specific methods that this academic researcher has in their toolkit.
Another idea that Matt suggested was a BlueDot -style "Agent Foundations-in-the-broad-sense' course.
\EuclideanGeometry rant
The impact of Euclidean Geometry on Western Intellectual thought has been immense. But it is slighly surprising: Euclid's geometry has approximately no application. Here I mean Euclid's geometry as in the proof-based informal formal system of Euclidean geometry as put forward in Euclid's Elements.
It is quite interesting how the impact actually worked. Many thinkers cite euclidean geometry as decisive for their thinking - Descartes, Newton, Benjamin Franklin, Kant to name just a few. I think the reason is that it formed the 'model organism' of what conceptual, theoretical progress could look like. The notion of proof (which is interestingly unique to the Western mathematical tradition, despite ?15th century Kerala, India e.g. discovering Taylor series before Newton), the notion of true certainty, notion of modelling and idealizations, the idea of stacking many lemmas etc.
I think this kind of 'succesful conceptual/theoretical progress' is highly important in inspiring people both historically and currently.
I think the purpose of such an AF course would be to show academic researchers that there is real intellectual substance to conceptual Alignment work
[at this point we ran out of our time box and decided to stop]