Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
Summary: AGI isn't super likely to come super soon. People should be working on stuff that saves humanity in worlds where AGI comes in 20 or 50 years, in addition to stuff that saves humanity in worlds where AGI comes in the next 10 years.
Thanks to Alexander Gietelink Oldenziel, Abram Demski, Daniel Kokotajlo, Cleo Nardo, Alex Zhu, and Sam Eisenstat for related conversations.
By "AGI" I mean the thing that has very large effects on the world (e.g., it kills everyone) via the same sort of route that humanity has large effects on the world. The route is where you figure out how to figure stuff out, and you figure a lot of stuff out using your figure-outers, and then the stuff you...
An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.
@Jude Stiel nudged me to (very much in my own words) update a bit to anticipate that it's plausible we'll see some degree of impoverished / partial originary (and therefore occasionally novel) concept formation. Some aspects of [real according to me] concept formation could be accessible to faster feedback. (This doesn't much change my overall picture, and my view would still be surprised by large numbers of concepts produced by AIs and as interesting+useful to humans as human-produced concepts.)
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.
All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.
This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns...
Since it's pretty common for people to find this content confusing, I tried to clarify its basic mechanics and purpose here.
I will be discussing weak-to-strong generalization with Sahil on Monday, November 3rd, 2025, 11am Pacific Daylight Time. You can join the discussion with this link.
Weak-to-strong generalization is an approach to alignment (and capabilities) which seeks to address the scarcity of human feedback by using a weak model to teach a strong model. This is similar to Paul Christiano's iterated distillation and amplification (IDA), but without the "amplification" step: the strong model is trained directly on labels generated by the weak model, not some "amplified" version of the weak model. I think of this as "reverse distillation".[1]
Why would this work at all? From a naive Bayesian perspective, it is tempting to imagine the "strong model" containing the "weak model" within its larger hypothesis space. Given enough data, the...
I think my example was not well-chosen here, because this is confusing. Part of what I'm modeling here is sampling from the weak teacher's model, in order to generate training data for the student. The coin-flips are independent in the same way that distinct samples from an LLM are independent.
Imagine that 10% of the weak teacher's samples are misaligned, while 90% are aligned. The teacher has this "aligned" hypothesis and "misaligned" hypothesis in its latent space, so we get some of one, some of the other. Intuitively, a "strong" student should learn to ...
This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity.
The main focus of this post is clarifying the basic machinery of the behavioral selection model, and conveying why it matters to disambiguate between different “motivations” for AI behavior. Very similar or identical behavior in training can correspond to radically different outcomes in deployment based on what motivated it.
I’ll preface by saying: I think the behavioral selection model is quite predictive and useful to understand, especially in the short-medium term. But it leaves out some really important dynamics for predicting AI motivations, and I wish I had clarified this more in the original post. Most importantly (as Habryka...
A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.
This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything.
When I try to talk to LLMs about technical AI safety work, however, I just...
I'm curious to hear any updates you have on this thinking a couple years later.
How is it that we solve engineering problems? What is the nature of the design process that humans follow when building an air conditioner or computer program? How does this differ from the search processes present in machine learning and evolution?This essay studies search and design as distinct approaches to engineering, arguing that establishing trust in an artifact is tied to understanding how that artifact works, and that a central difference between search and design is the comprehensibility of the artifacts produced.
What... (read more)