I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but the most fundamental is my pursuit of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side...
I think that the concept of "alignment research" (and the distinction between that and "capabilities research") is currently a fairly confused one. In this post I’ll describe some of the problems with how people typically think about these terms, and offer replacement definitions. “Alignment” and “capabilities” are primarily properties of...
The coalitional frame Earlier in this sequence I laid out an argument that the goals of increasingly intelligent AIs will become increasingly systematized, until they converge to squiggle-maximization. In my last post, though, I touched on two reasons why this convergence might not happen: humans trying to prevent it, and...
This post builds on my previous post making the case that squiggle-maximizers are plausible. The argument I presented was a deliberately simplified one, though, and glossed over several possible issues. In this post I'll raise and explore three broad objections. (Before looking at mine, I encourage you to think of...
This post is the version of Yudkowsky's argument for inner misalignment that I wish I'd had in my head a few years ago. I don't claim that it's novel, that I endorse it, or even that Yudkowsky would endorse it; it's primarily an attempt to map his ideas into an...
This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillen, Daniel Kokotajlo, and Lukas Berglund for feedback. Summary Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal...
Many discussions of AI risk are unproductive or confused because it’s hard to pin down concepts like “coherence” and “expected utility maximization” in the context of deep learning. In this post I attempt to bridge this gap by describing a process by which AI values might become more coherent, which...