The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.
I don't think these are crazy or bad ideas at all—I'd be happy to steelman them with you at some point if you want. Certainly, we don't know how to make any of them work right now, but I think they are all reasonable directions to go down if one wants to work on the various open problems related to them. The problem—and this is what I would say to somebody if they came to me with these ideas—is that they're not so much “ideas for how to solve alignment” so much as “entire research topics unto themselves.”
It is not clear to me to what extent this was part of the "training shoulder advisors" exercise, but to me, possibly the most important part of it is to keep the advisors at distance from your own thinking. In particular, in my impression, it seems likely the alignment research has been on average harmed by too many people "training their shoulder Eliezers" and the shoulder advisors pushing them to think in a crude version of Eliezer's ontology.
I chose the "train a shoulder advisor" framing specifically to keep my/Eliezer's models separate from the participants' own models. And I do think this worked pretty well - I've had multiple conversations with a participant where they say something, I disagree with it, and then they say "yup, that's what my John model said" - implying that they did in fact disagree with their John model. (That's not quite direct evidence of maintaining a separate ontology, but it's adjacent.)
I would love to see you say why you consider these bad ideas. Obvious such AI's could be unaligned themselves or is it more along the lines of these assistants needing a complete model of human values to be truly useful?
John's Why Not Just... sequence is a series of somewhat rough takes on a few of them. (though I think many of them are not written up super comprehensively)
This is true in every field and is very difficult to systemize apparently. Perhaps a highly unstable social state to have people changing directions or thinking/speaking super honestly often.
How could one succeed where so few have?
Epistemic status: lots of highly subjective and tentative personal impressions.
Occasionally people say “hey, alignment research has lots of money behind it now, why not fund basically everyone who wants to try it?”. Often this involves an analogy to venture capital: alignment funding is hits-based (i.e. the best few people are much more productive than everyone else combined), funders aren’t actually that good at distinguishing the future hits, so what we want is a whole bunch of uncorrelated bets.
The main place where this fails, in practice, is the “uncorrelated” part. It turns out that most newcomers to alignment have the same few Bad Ideas.
The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.
People who are less aware of standard alignment arguments tend to start with “train an AI on human feedback” or “iterate until the problems go away”. In the old days, pre-sequences, people started from even worse ideas; at least the waterline has risen somewhat.
People with more of a theory bent or an old-school AI background tend to reinvent IRL or CIRL variants. (A CIRL variant was my own starting Bad Idea - you can read about it in this post from 2020, although the notes from which that post was written were from about 2016-2017.)
My impression (based on very limited data) is that it takes most newcomers ~5 years to go from their initial Bad Idea to actually working on something plausibly useful. For lack of a better name, let’s call that process the Path of Alignment Maturity.
My impression is that progress along the Path of Alignment Maturity can be accelerated dramatically by actively looking for problems with your own plans - e.g. the builder/breaker framework from the Eliciting Latent Knowledge doc, or some version of the Alignment Game Tree exercise, or having a group of people who argue and poke holes in each others’ plans. (Of course these all first require not being too emotionally attached to your own plan; it helps a lot if you can come up with a second or third line of attack, thereby building confidence that there’s something else to move on to.) It can also be accelerated by starting with some background knowledge of difficult problems adjacent to alignment/agency - I notice philosophers tend to make unusually fast progress down the Path that way, and I think prior experience with adjacent problems also cut about 3-4 years off the Path for me. (To be clear, I don’t necessarily recommend that as a strategy for a newcomer - I spent ~5 years working on agency-adjacent problems before working on alignment, and that only cut ~3-4 years off my Path of Alignment Maturity. That wasn’t the only alignment-related value I gained from my background knowledge, but the faster progress down the Path was not worthwhile on its own.) General background experience/knowledge about the world also helps a lot - e.g. I expect someone who's founded and worked at a few startups will make faster progress than someone who’s only worked at one big company, and either of those will make faster progress than someone who’s never been outside of academia.
On the flip side, I expect that progress down the Path of Alignment Maturity is slower for people who spend their time heads-down in the technical details of a particular approach, and spend less time reflecting on whether it’s the right approach at all or arguing with people who have very different models. I’d guess that this is especially a problem for people at orgs with alignment work focused on specific agendas - e.g. I’d guess progress down the Path is slower at Redwood or OpenAI, but faster at Conjecture or Deepmind (because those orgs have a relatively high variety of alignment models internally, as I understand it).
I think accelerating newcomers’ progress down the Path of Alignment Maturity is one of the most tractable places where community builders and training programs can add a lot of value. I’ve been training about a dozen people through the MATS program this summer, and I currently think accelerating participants’ progress down the Path has been the biggest success. We had a lot of content aimed at that: the Alignment Game Tree, two days of the “train a shoulder John” exercise plus a third day of the same exercise with Eliezer, the less formal process of people organized into teams kicking ideas around and arguing with each other, and of course general encouragement to pivot to new problems and strategies (which most people did multiple times). Overall, my very tentative and subjective impression is that the program shaved ~3 years off the median participant’s Path of Alignment Maturity; they seem-to-me to be coming up with project ideas about on par with a typical person 3 years further in. The shoulder John/Eliezer exercises were relatively costly and I don’t think most groups should try to duplicate them, but other than those I expect most of the MATS content can scale quite well, so in principle it should be possible to do this with a lot more people.