“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.” –attributed to DL Moody[1]
The main threat model I’m working to address is the same as it’s been since I was hobby-blogging about AGI safety in 2019. Basically, I think that:
I think that, when this learning algorithm is understood, it will be easy to get it to do powerful and impressive things, and to make money, as long as it’s weak enough that humans can keep it under control. But past that stage, we’ll be relying on the AGIs to have good motivations, and not be egregiously misaligned and scheming to take over the world and wipe out humanity. Alas, I claim that the latter kind of motivation is what we should expect to occur, in the absence of yet-to-be-invented techniques to avoid it.
Inventing those yet-to-be-invented techniques constitutes the technical alignment problem for brain-like AGI. That’s the main thing I’ve been working on since I’ve been in the field. See my Intro to Brain-Like-AGI Safety (2022).
I think of brain-like AGI as belonging to the broad algorithm class known as “RL agents”, and more specifically a (not-yet-invented) variation on actor-critic model-based RL. (See Valence series §1.2–§1.3.) In terms of the technical alignment problem, I claim that it has somewhat more in common with the “RL agents” that learned to play Atari and Go in the 2010s, than with the LLMs of the 2020s.
More on my path-to-impact
(mostly copied from last year)
As 2025 began, I had just published Neuroscience of human social instincts: a sketch, which represented huge progress (years in the making) on my understanding of how prosocial human innate drives might work in the brain—drives like compassion and norm-following, which seem potentially alignment-relevant. Having done that, I sensed diminishing returns on puzzling over neuroscience. It was time to take the knowledge I already had, and apply it to the technical alignment problem directly!
That was my plan for 2025 (see last year’s review), and that’s what I did.
If my 2024 research felt like zipping down a street, then my 2025 research felt like laboriously wading through mud, waist deep.
It turns out that “figuring out things in neuroscience” (what I was mostly doing in 2024) is much easier and more immediately satisfying for me than “using that knowledge to try to solve the technical AGI alignment problem” (my activity in 2025).[2] But the technical AGI alignment problem is where the utils are. So, into the mud I went.
To keep my spirits up, throughout 2025, I kept up a list of things that I (think I) know now, that I didn’t know in 2024. And that list got pretty long!
So, here at the end of 2025, I am still in the waist-deep mud. But I’m in a noticeably different part of the mud from where I was 12 months ago. Yay!
Since at least 2022, I’ve had two general ideas in my head for what a solution to technical AGI alignment might look like:
My current take is that both of these types of plans have severe issues, which might or might not be solvable at all. Alas, I don’t have any better ideas. Guess I still got my work cut out.
(I figured out partway through the year that, if you think about it right, these are not two wholly different types of plans, but more like two points on a spectrum—see discussion in Perils of under- vs over-sculpting AGI desires. But knowing that didn’t really help me. I kinda think the whole spectrum stinks.)
For both these types of plans, there are a bunch of things I’m still confused about (albeit fewer than a year ago!), and a bunch of design space that I have not yet explored. But in case anyone is interested, my (very provisional!) top concerns in brief are:
(Separately, in terms of AGI consciousness and “successor species” kinds of considerations, I think Plan Type 2 seems the better of the two (see e.g. here), but I’m not sure about that either.)
I’ll divide my activities into eight “thrusts”.
As I turned to the technical alignment problem, I had to deal with all the big-picture problems that it relates to. Am I worried about AGI that egregiously schemes about how to wipe out humanity, or about AGI that makes subtler philosophical mistakes and goes off the rails, or something else? What am I hoping that AGI developers do with their AGIs anyway—pivotal acts, obedience, alignment research, what? Why are so many alignment researchers so much more optimistic than me—for example, assigning probabilities as high as 50%, or even higher, to the proposition that humanity will survive superintelligence?[4]
After a couple years with my nose in the neuroscience textbooks, I had a backlog of these kinds of questions, and so I set about trying to better understand the cruxes of my disagreements with other alignment people.
The first output of this crux-mapping effort was a post on timelines & takeoff (Foom & Doom 1: “Brain in a box in a basement”), explaining why I expect very fast takeoffs and singleton ASI, for better or worse.
The second output was a post on the difficulty of technical alignment: Foom & Doom 2: Technical alignment is hard. The title says it all. A big focus of this post was why I am not relieved by the fact that we can make today’s LLMs often helpful and cooperative.
The third output was intended to be something about multipolar AGI scenarios, including offense-defense balance and so on.[5] I wound up with a bunch of notes but much lingering confusion. But by then I had written up “Foom & Doom 1” above, and was more strongly expecting an ASI Singleton, and was figuring that the multipolar stuff was probably moot. So I decided to give up for now, and spend my time elsewhere. Sorry! But check out this comment about “conservation of wisdom” for a little glimpse of one of the things I was thinking about.
So that was my crux-mapping project.
One other little thing that fits into Thrust A was writing Reward Button Alignment, which (among other things) explains my skepticism that there will ever be immediate economic pressure for profit-maximizing companies to do the most important kind of AI alignment research.
This thrust is where I feel proudest of hard-won conceptual / “deconfusion” progress during 2025.
Taking things in reverse order, I ended the year on a call-to-action post:
and a companion post summarizing where I’m at:
The latter includes a glossary of many relevant terms and concepts, all of which I made up or started using in 2025, and which I now find indispensable for thinking about RL reward function design. Those terms and concepts were fleshed out over the course of 2025 via the following posts:
Also part of Thrust B is an important correction to my 2024 post [Valence series] 4. Valence & Liking / Admiring, and an important addition to my 2022 post [Intro to brain-like-AGI safety] 10. The alignment problem.
Another 2025 Thrust B post—albeit more about outreach / pedagogy than about new progress—is the beginner-friendly “The Era of Experience” has an unsolved technical alignment problem, which responds to a book chapter by Rich Sutton & David Silver.
Overall, as mentioned in Reward Function Design: a starter pack, I feel like a blind man who has now poked many different parts of the elephant, and maybe the gestalt whole is starting to come together, especially when combined with the next thrust:
I wrote three posts that follow up from last year’s Neuroscience of human social instincts: a sketch (the post I mentioned in §2 above):
I think of these as being in increasing order of importance. In particular, Social Drives 1 & 2, along with 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa, are part of my attempt to think through what would happen if we made an AGI with something like human social instincts.
…And what would happen if we made an AGI with something like human social instincts? Sorry, I don’t know yet! That’s the next step. I’ve been laying all these foundations, and hopefully I’m finally equipped to attack that question head-on in 2026.
My Plan for mediocre alignment of brain-like [model-based RL] AGI (“Plan Type 1” from §3 above) involves planting desires that we like into the AI at time t, and then the AI thinks and learns and gets new options (including concept extrapolation), and we hope that the AI will still have desires we like at some much later time. Will that work? If not, what can we do about it?
I discussed this issue in two 2025 posts:
Those were meaningful progress, but alas, I still feel quite confused about this topic. I’m not even sure that I stand by everything I wrote in the latter.
I think I have some paths forward, including by dropping all the way down to meta-ethical fundamentals. But I still have my work cut out, and hope to keep puzzling over it in 2026.
While neuroscience was not a focus of mine for 2025, I still opportunistically dabbled! In particular:
Many (not all!) economists are dismissive of superintelligence being possible, or having a big impact even if it appeared tomorrow. I think they’re wrong, and for some strange reason I sometimes feel like arguing with them.
This inclination led to a post Applying traditional economic thinking to AGI: a trilemma, which I later refreshed and expanded into the more click-baity: Four ways learning Econ makes people dumber re: future AI.
Relatedly, I had unusually lengthy twitter arguments about the economics of superintelligence with Matt Clancy (in response to his 80,000 Hours podcast), and with Konrad Kording (in response to his paper with Ioana Marinescu: (Artificial) Intelligence saturation and the future of work). I don’t seem to have moved either of them. Oh well; I appreciate their time in any case.
Boy, the hope of Safe & Beneficial AGI would feel a lot more promising if there were more people who understood the risks and took them seriously. So I have always spent a bit of my time and energy on outreach, when I find myself in a good position to do so. Some 2025 highlights include:
Non-work-related blog posts: Just for fun, I wrote two in 2025:
Personal productivity and workflow: I’ve gone through a lot of systems over the years, but seem to have settled into a local optimum that works for me (an idiosyncratic system of to-do lists, wall calendars, time-tracking software, checklists, accountability buddy, etc.). The only noticeable changes during 2025 are: (1) I (re)invented “self-dialogues” as a surprisingly useful way to think through things and make progress when I’m stuck; and (2) I’m using LLMs a bit more, although still probably much less than most people in my field.[6]
My plan for 2026 is the same as my plan for 2025—solve (or make progress towards solving) the technical alignment problem for brain-like AGI, or prove (or make progress towards proving) that no solution exists. This would include pseudocode for reward functions and training environments, and sorting out which aspects can be tested and de-risked in advance and how, and making a plan that fits into a sensible and ethical larger strategic picture, and so on.
The immediate next steps are the two I mentioned under Thrusts C & D above.
In addition to that main project, I plan to continue opportunistically dabbling in neuroscience, outreach, other aspects of AGI safety, and so on, into 2026, as I have in the past.
If someone thinks that I should be spending my time differently in 2026, please reach out and make your case!
Thanks Jed McCaleb & Astera Institute for generously supporting my research since August 2022!
Thanks to all the people who comment on my posts before or after publication, or share ideas and feedback with me through email or other channels, and especially those who patiently stick it out with me through long back-and-forths to hash out disagreements and confusions. I’ve learned so much that way![7]
Thanks Lightcone Infrastructure (don’t forget to donate!) for maintaining and continuously improving this site, which has always been an essential part of my workflow. Thanks to everyone else fighting for Safe and Beneficial AGI, and thanks to my family, and thanks to you all for reading! Happy Holidays!
It’s actually even worse than that—practically everyone working towards Safe & Beneficial AGI thinks that many (or even most) other people working towards Safe & Beneficial AGI are pursuing projects that, if successful, would be not just pointless but in fact making the situation worse. I indeed think that about many other people in the field, and they probably think that about me. …Of course, our positions are not at all parallel, because I’m correct whereas they are incorrect … … (nervous laughter)
Reasons why I seem to find neuroscience research much easier than alignment research:
…And so on.
To put it more starkly: if it’s possible for humans to develop a system that will ultimately lead to a good future, then it’s presumably also possible for sufficiently human-like AGIs to also develop such a system. Or if it’s not possible for humans to develop such a system, then we’re screwed no matter what.
My p(doom) = “y’know those movies where you’re 75% into it, and the hero has lost all his superpowers and been teleported to another dimension with no way of getting back, and the villain is more powerful than ever, and meanwhile the audience is shaking their head and saying to themselves, ‘jeez, how on Earth are the script-writers gonna resolve this one??’ …That’s my p(doom).”
…also including deciding whether or not to change anything in my old post What does it take to defend the world against out-of-control AGIs? (2022), which I’ve gone back and forth on many times.
The biggest change between 2024 and 2025 is that I’m now almost always showing my complete drafts to LLMs before publishing, and asking them to find typos, unexplained jargon, things to cut, mistakes, etc. Their advice is still usually bad, but they get some hits. I also use them for coding (e.g. one-off shell scripts), reverse lookup (when I don’t know the name of something), and similar, but not as a thought partner for novel research.
A couple 2025 examples that spring to mind: my coworker Seth Herd (among many other things) caught a critical mistake in an early draft version of Social Drives 1 & 2 that led to a complete rethink and rewrite; and Sharmake Farah’s patient skeptical questioning in comment threads led directly to my writing Self-dialogue & Reward button alignment. Plus countless ideas and pushback from my awesome collaborators and test-readers, like Adam Marblestone, Jeremy Gillen, Charlie Steiner, Justis Mills, Linda Linsefors, Simon Skade and many others. Many thanks to them and everyone else! :)