Inner alignment refers to the problem of aligning a machine learning model's internal goals (mesa-objective) with the intended goals we are optimizing for externally (base objective). Even if we specify the right base objective, the model may develop its own misaligned mesa-objective through the training process. This poses challenges for AI safety.
I sort of agree? I think the net effect on overall capabilites progress is pretty small and some of the action I proposed would hopefully divert people from generic capabilites to working on this type of (hopefully particularly differential) capabilities. But I agree that some of these actions would involve safety motivated people doing work that would shorten timelines (relative to if they did nothing / worked on areas with no capabilities externalities) and it could turn out this work isn't valuable.
I think for "Get AIs generically better at conceptual w...
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
Oh, hmm, good point, thanks. Let me try again:
When I think of humans who get difficult things done, or figure difficult things out, they tend to care about accomplishing those things, a lot, and in a direct and explicit way, not just e.g. as a facet of what kind of person they see themselves as. I mean, maybe “what kind of person I see myself as” has something to do with how they originally came to care about those things, but it’s not what they’re explicitly thinking about. They’re thinking directly about the object-level prize at the end of the journey, ...
Suppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably[1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise difficult to rule out the possibility of alignment faking.
There are many approaches one could use to try to make evaluations appear more realistic: you can try to create realistic environments (e.g. Petri, WebArena, OSWorld); use data from past deployments (e.g. OpenAI, SAD); and spoof tool-call responses (e.g. ToolEmu).
However, the core difference between an alignment evaluation and a...
Summary: AGI isn't super likely to come super soon. People should be working on stuff that saves humanity in worlds where AGI comes in 20 or 50 years, in addition to stuff that saves humanity in worlds where AGI comes in the next 10 years.
Thanks to Alexander Gietelink Oldenziel, Abram Demski, Daniel Kokotajlo, Cleo Nardo, Alex Zhu, and Sam Eisenstat for related conversations.
By "AGI" I mean the thing that has very large effects on the world (e.g., it kills everyone) via the same sort of route that humanity has large effects on the world. The route is where you figure out how to figure stuff out, and you figure a lot of stuff out using your figure-outers, and then the stuff you...
An LLM solves a mathematical problem by introducing a novel definition which humans can interpret as a compelling and useful concept.
@Jude Stiel nudged me to (very much in my own words) update a bit to anticipate that it's plausible we'll see some degree of impoverished / partial originary (and therefore occasionally novel) concept formation. Some aspects of [real according to me] concept formation could be accessible to faster feedback. (This doesn't much change my overall picture, and my view would still be surprised by large numbers of concepts produced by AIs and as interesting+useful to humans as human-produced concepts.)
Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.
All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.
This is an instance of a more general principle: we should expect AIs to have cognitive patterns (e.g., motivations) that lead to behavior that causes those cognitive patterns...
Since it's pretty common for people to find this content confusing, I tried to clarify its basic mechanics and purpose here.
I will be discussing weak-to-strong generalization with Sahil on Monday, November 3rd, 2025, 11am Pacific Daylight Time. You can join the discussion with this link.
Weak-to-strong generalization is an approach to alignment (and capabilities) which seeks to address the scarcity of human feedback by using a weak model to teach a strong model. This is similar to Paul Christiano's iterated distillation and amplification (IDA), but without the "amplification" step: the strong model is trained directly on labels generated by the weak model, not some "amplified" version of the weak model. I think of this as "reverse distillation".[1]
Why would this work at all? From a naive Bayesian perspective, it is tempting to imagine the "strong model" containing the "weak model" within its larger hypothesis space. Given enough data, the...
I think my example was not well-chosen here, because this is confusing. Part of what I'm modeling here is sampling from the weak teacher's model, in order to generate training data for the student. The coin-flips are independent in the same way that distinct samples from an LLM are independent.
Imagine that 10% of the weak teacher's samples are misaligned, while 90% are aligned. The teacher has this "aligned" hypothesis and "misaligned" hypothesis in its latent space, so we get some of one, some of the other. Intuitively, a "strong" student should learn to ...