There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area.
I find this odd since this is a pretty normal area of interpretability research in the AGI safety community. I have worked on it, Anthropic Fellows have worked on it, FAR has worked on it, etc.
I don’t know if it will be net positive to use this kind of thing in frontier model training, but it could plausibly be very helpful for AGI safety, and it seems like a clear mistake to me if we don’t do the required research to figure this out. Further, this seems like a massive pain to use in...
Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.
Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:
When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.
Or Zvi 2025:
...The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X].
Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.
I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)
If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model...
In my "goals having power over other goals" ontology, the instrumental/terminal distinction separates goals into two binary classes, such that goals in the "instrumental" class only have power insofar as they're endorsed by a goal in the "terminal" class.
By contrast, when I talk about "instrumental strategies become crystallized", what I mean is that goals which start off instrumental will gradually accumulate power in their own right: they're "sticky".
According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5’s behavioral improvements in these evaluations may partly be driven by a growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1]
To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic’s evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10%...
It looks as though steering doesn't suffice to mostly eliminate all evaluation awareness and unverbalized evaluation awareness if often substantially higher (this is for Opus 4.6, but presumably results also apply in general):
This is a two-post series on AI “foom” (this post) and “doom” (next post).
A decade or two ago, it was pretty common to discuss “foom & doom” scenarios, as advocated especially by Eliezer Yudkowsky. In a typical such scenario, a small team would build a system that would rocket (“foom”) from “unimpressive” to “Artificial Superintelligence” (ASI) within a very short time window (days, weeks, maybe months), involving very little compute (e.g. “brain in a box in a basement”), via recursive self-improvement. Absent some future technical breakthrough, the ASI would definitely be egregiously misaligned, without the slightest intrinsic interest in whether humans live or die. The ASI would be born into a world generally much like today’s, a world utterly unprepared for this...
Belated update on that last point on “algorithmic progress” for LLMs: I looked into this a bit and wrote it up at: The nature of LLM algorithmic progress. The last section is how it relates to this post, with the upshot that I stand by what I wrote in OP.
Thanks to Noa Nabeshima for helpful discussion and comments.
Certain types of myopic agents represent a possible way to construct safe AGI. We call agents with a time discount rate of zero time-limited myopic, a particular instance of the broader class of myopic agents. A prototypical example is a time-limited myopic imitative agent. In theory, such an agent has some desirable safety properties because a human would only take safe actions (although any imperfect imitation would be unsafe). Since the agent is time-limited myopic, it will never imitate poorly now to make it easier to imitate easier later. For example, it would never give a human a simple plan so it could more easily imitate the human executing the plan.
We might run into issues if the agent intends...
My current understanding is that, policy-gradient RL incentivizes reward-seeking agents to defect in prisoner's dilemmas, counterfactual muggings, and Parfit's hitchikers. If there were some selection at the policy level (e.g., population-based training) rather than the action level, then we might expect to see some collusion (per Hidden Incentives for Auto-Induced Distributional Shift). Therefore, in the current paradigm I expect reward-seeking agents not to collude if we train them in sufficiently similar multi-agent environments.
Taking stock of the DDT ...
John Wentworth explains natural latents – a key mathematical concept in his approach to natural abstraction. Natural latents capture the "shared information" between different parts of a system in a provably optimal way. This post lays out the formal definitions and key theorems.
I think this mischaracterizes the criticism. The criticism as I understand it is that Goodfire is planning to help frontier AI companies use model internals in training, in exchange for money. Insofar as they really are planning to do this, then I'll count myself as among the critics, for the classic "but we need interpretability tools to be our held-out test set for alignment" reason. Do you have a link to the criticism you are responding to?