Lawrence, Erik, and Leon attempt to summarize the key claims of John Wentworth's natural abstractions agenda, formalize some of the mathematical proofs, outline how it aims to help with AI alignment, and critique gaps in the theory, relevance to alignment, and research methodology.
tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda's subproblems and my sketches of how to tackle them.
Back at the end of 2023, I wrote the following:
I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)
On the inside view, I'm pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a...
Which, importantly, includes every fruit of our science and technology.
I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate progress through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)
In contrast, I'd expect individual steps of scientific progress that happen within a single mind oft...
(Part 4 of the CAST sequence)
This document is an in-depth review of the primary documents discussing corrigibility that I’m aware of. In particular, I'll be focusing on the writing of Eliezer Yudkowsky and Paul Christiano, though I’ll also spend some time at the end briefly discussing other sources. As I go through the writing of those who’ve come before, I want to specifically compare and contrast those ideas with the conceptualization of corrigibility put forth in earlier documents and the strategy proposed in The CAST Strategy. At a high level I mostly agree with Christiano, except that he seems to think we’ll get corrigibility emergently, whereas I think it’s vital that we focus on directly training purely corrigible agents (and he wants to focus on recursive architectures...
I appreciated the detailed discussion and literature review here -- thanks.
(This is the video and transcript of talk I gave at the UT Austin AI and Human Objectives Initiative in September 2025. The slides are also available here. The main content of the talk is based on this recent essay.)
Hi, everyone. Thank you for coming. I'm honored to be part of this series and part of the beginning of this series.
I'm going to briefly introduce the core AI alignment problem as I see it. It's going to be a particular version of that problem, the version that I think is highest stakes. And then I'm going to talk about my current high-level picture of how that problem gets solved at a technical level. There's a bunch of aspects of this problem that aren't the technical level that...
Hi Steve -- thanks for this comment, I can see how the vibe of the talk/piece might call to mind something like "studying/intervening on an existing AI system" rather than focusing on how its trained/constructed, but I do mean for the techniques I discuss to cover both. For example, and re: your Bob example, I talk about our existing knowledge of human behavior as an example of behavioral science here -- and I talk lot about studying training as a part of behavioral science, e.g.:
...Let’s call an AI’s full range of behavior across all safe and accessibl
very exciting! influence functions are one of the best approaches for understanding generalization systematically. always excited to see improvements to influence function methodology.
Have you seen A Master-Slave Model of Human Preferences? To summarize, I think every human is trying to optimize for status, consciously or subconsciously, including those who otherwise fit your description of idealized platonic researcher. For example, I'm someone who has (apparently) "chosen ultimate (intellectual) freedom over all else", having done all of my research outside of academia or any formal organizations, but on reflection I think I was striving for status (prestige) as much as anyone, it was just that my subconscious picked a different strat... (read more)