This post is partly a belated response to Joshua Achiam, currently OpenAI’s Head of Mission Alignment:
...If we adopt safety best practices that are common in other professional engineering fields, we'll get there … I consider myself one of the x-risk people, though I agree that most of them would reject my view on how to prevent it. I think the wholesale rejection of safety best practices from other fields is one of the dumbest mistakes that a group of otherwise very smart people has ever made. —Joshua Achiam on Twitter, 2021
“We just have to sit down and actually write a damn specification, even if it's like pulling teeth. It's the most important thing we could possibly do," said almost no one in the field of AGI alignment,
There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area.
I find this odd since this is a pretty normal area of interpretability research in the AGI safety community. I have worked on it, Anthropic Fellows have worked on it, FAR has worked on it, etc.
I don’t know if it will be net positive to use this kind of thing in frontier model training, but it could plausibly be very helpful for AGI safety, and it seems like a clear mistake to me if we don’t do the required research to figure this out. Further, this seems like a massive pain to use in...
OK, thanks, that's a relief then. We shall see who their customers end up being.
Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.
Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:
When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.
Or Zvi 2025:
...The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X].
Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model...
In my "goals having power over other goals" ontology, the instrumental/terminal distinction separates goals into two binary classes, such that goals in the "instrumental" class only have power insofar as they're endorsed by a goal in the "terminal" class.
By contrast, when I talk about "instrumental strategies become crystallized", what I mean is that goals which start off instrumental will gradually accumulate power in their own right: they're "sticky".
According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5’s behavioral improvements in these evaluations may partly be driven by a growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1]
To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic’s evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10%...
It looks as though steering doesn't suffice to mostly eliminate all evaluation awareness and unverbalized evaluation awareness if often substantially higher (this is for Opus 4.6, but presumably results also apply in general):
This is a two-post series on AI “foom” (this post) and “doom” (next post).
A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...
Belated update on that last point on “algorithmic progress” for LLMs: I looked into this a bit and wrote it up at: The nature of LLM algorithmic progress. The last section is how it relates to this post, with the upshot that I stand by what I wrote in OP.
Paul Christiano lays out how he frames various questions of "will AI cause a really bad outcome?", and gives some probabilities.
(Thanks!) To me, your comment is like: “We have a great plan for robust engineering of vehicles (as long as they are at on a dry, warm, indoor track, going under 10kph).” OK that’s better than nothing. But if we are eventually going to be driving cars at high speed in the cold rain, it’s inadequate. We did not test or engineer them in the right environment.
This is not a complex systems objection (e.g., it’s not about how the world changes with billions of cars). It’s a distribution shift objection. Even just one car will fail at high speed in the cold rain... (read more)