In (highly contingent!) defense of interpretability-in-the-loop ML training

Steven Byrnes

Let’s call “interpretability-in-the-loop training” the idea of running a learning algorithm that involves an inscrutable trained model, and there’s some kind of interpretability system feeding into the loss function / reward function.

Interpretability-in-the-loop training has a very bad rap (and rightly so). Here’s Yudkowsky 2022:

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

Or Zvi 2025:

The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
Those bits of optimization pressure from [T] are precious. Use them wisely.

This is a simple argument, and I think it’s 100% right.

But…

Consider compassion in the human brain. I claim that we have an innate reward function that triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away. So the human brain reward can evidently get triggered by specific activations inside my inscrutable learned world-model.

Thus, I claim that the human brain incorporates a form of interpretability-in-the-loop RL training.

Inspired by that example, I have long been an advocate for studying whether and how one might use interpretability-in-the-loop training for aligned AGI. See for example Reward Function Design: a starter pack sections 1, 4, and 5.

My goal in the post is to briefly summarize how I reconcile the arguments at the top with my endorsement of this kind of research program.

My overall position

Yudkowsky & Zvi are correct that straightforward interpretability-in-the-loop training of LLMs is almost definitely a very bad idea.
- …Unless interpretability someday develops to such a refined point that it’s adversarially robust (i.e., we understand the model so well that problematic thoughts have nowhere to hide from the interpretability tools). But that sure seems like a long-shot.
There exists a certain brain-like version of interpretability-in-the-loop RL training that does not suffer from this very obvious failure mode
- (…although it might or might not fail for other reasons!)
And there’s a straightforward explanation of exactly how it avoids this obvious failure mode.

The rest of this post will present this explanation:

How the brain-like version of interpretability-in-the-loop training avoids the obvious failure mode

The human brain has beliefs and desires. They are different. It’s possible to want something without expecting it, and it’s possible to expect something without wanting it. This should be obvious common sense to everyone, unless your common sense has been crowded out by “active inference” nonsense.

Beliefs and desires are stored in different parts of the brain, and updated in different ways. (This is a huge disanalogy between LLMs and brains.)

As an oversimplified toy model, I suggest to think of desires as a learned linear functional on beliefs (see my Valence series §2.4.1). I.e. “desires” constitute a map whose input is some thought / plan / etc. (over on the belief side), and whose output is a numerical score indicating whether that thought / plan / etc. is good (if the score is positive) or bad (if it’s negative).

Anyway, the important point is that these two boxes are updated in different ways. Let’s expand the diagram to include the different updating systems, and how interpretability-in-the-loop training fits in:

The red arrow indicates that there’s some system (any system you like) that looks at the inscrutable activation state in the “Beliefs” box, and spits out one or more numerical metrics related to that state. Those numbers then go to the reward function, where they help determine the reward signals.

The interpretability data is changing the reward signals, but the reward signals are not directly changing the belief box that the interpretability system is querying.

That means: The loop doesn’t close. This interpretability system is not creating any gradient that directly undermines its own faithfulness.^[1]

…So that’s how this brain-like setup avoids the obvious failure mode of interpretability-in-the-loop that Yudkowsky & Zvi were talking about at the top.

Things can still go wrong in more subtle and indirect ways

…Or at least, it avoids the most straightforward manifestation of that problem. There are more subtle things that might go wrong. I have a high-level generic discussion in Valence series §3.3, where I point out that there exist indirect pathways through this diagram, and discuss how they can cause problems:

And these kinds of problems can indeed pop up in the context of compassion and other interpretability-in-the-loop human social instincts. The result is that human social instincts that on paper might look robustly prosocial, are in fact not so robustly prosocial in the real (human) world. See my Sympathy Reward post §4.1 and Approval Reward post §6 for lots of everyday examples.

So the upshot is: I don’t think the brain-like version of interpretability-in-the-loop RL training is a panacea for aligned ASI, and I’m open-minded to the possibility that it’s just not a viable approach at all. But it’s at least a not-obviously-doomed research direction, and merits more study.

^{^}
Added 2026-02-13: Actually, oops, even this sentence is a bit oversimplified. E.g. here’s a scenario. There’s an anti-deception probe connected to the reward function, and the “beliefs” box has two preexisting plans / actions: (A) a “be deceptive in a way that triggers the alarm” plan / action in the “beliefs” box, and (B) a “be deceptive in a way that doesn’t trigger the alarm” plan / action.
Now, the good news is that there wouldn’t be a predictive learning gradient that would push towards (B), nor one that would create (B) if (B) didn’t already exist. But the bad news is, there is a kind of policy gradient that would ensure that, if (B) occurs by random happenstance, then the system will update to repeat (B) more often in the future. (Or worse, there could be a partway-to-(B) plan / action that’s somewhat rewarded, etc., and then the system may hill-climb to (B).)
I still think this setup is much less obviously doomed than the LLM case at the top, because we have all this learned structure in the “beliefs” box that the reward function is not allowed to manipulate directly. Again, for example, if (B) doesn’t already exist within the web of constraints that constitutes the “beliefs” box, this system won’t (directly) create it.
[Thanks Rhys Gould for discussion of this point.]

Maybe gradient routing could be used to implement this kind of learning update (see the Influencing generalization section here). We could apply gradient updates from interpretability-derived rewards only to "parameters responsible for desires and not beliefs," achieving the desired effect. Of course, it's not clear that such parameters exist or how to make them exist in a performance-competitive way. Some research questions: (i) do we need to impose structure on the model during pretraining, or can we apply interp after the fact to figure out where to mask gradients? (ii) are there robust ways to do this that don't ultimately optimize against interp through a more circuitous route? Would be super cool to see this studied!

Steven, thanks for writing this!

Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life. Maybe this is a crux? To my knowledge, we haven't tried that hard to make interpretability techniques and probes robust, not in a way where 'activations being easily monitorable' is closely correlated with 'playing training games well' for the lack of better wording.

A result that comes to mind, but isn't directly relevant, is this tweet from the real time hallucination detection paper. Co-training a LoRA adaptor and downstream probe for hallucination detection made the LLM more epistemically cautious with no other supervised training signals. Maybe we can use this probe to RL against hallucinations while training the probe at each step too?

I have yet to read your previous posts linked here. I imagine some of my questions will be answered once I find time to look through them haha.

Isn't it just the case that the human brain's 'interpretability technique' is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life.

I don’t think it’s that robust even in humans, despite the mitigation described in this post. (Without that mitigation, I think it would be hopeless.)

If we’re worried about a failure mode of the form “the interpretability technique has been routed around”, then that’s unrelated to “The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life”. For the failure mode that Yudkowsky & Zvi were complaining about, if that failure mode actually happened, there would still be an accurate model. It would just be an accurate model that is invisible to the interpretability technique.

I.e. the beliefs box would still be working fine, but the connection to desires would be weak or absent.

And I do think that happens plenty in the human world.

Maybe the best example (at least from my own perspective) is the social behavior of (many) smart autistic adults. [Copying from here:] The starting point is that innate social reactions (e.g. the physiological arousal triggered by eye contact) are so strong that they’re often overwhelming. People respond to that by (I think) relating to other people in a way that generally avoids triggering certain innate social reactions. This includes (famously) avoiding eye contact, but I think also includes various hard-to-describe unconscious attention-control strategies. So at the end of the day, neurotypical people will have an unconscious innate snap reaction to (e.g.) learning that someone is angry at them, whereas autistic people won’t have that snap reaction, because they have an unconscious coping strategy to avoid triggering it, that they’ve used since early childhood, because the reaction is so unpleasant. Of course, they’ll still understand intellectually perfectly well that the person is angry. As one consequence of that, autistic people (naturally) have trouble modeling how neurotypical people will react to different social situations, and conversely, neurotypical people will misunderstand and misinterpret the social behaviors of autistic people.

Still, socially-attentive smart autistic adults sometimes become good (indeed, sometimes better than average) at predicting the behavior of neurotypical people, if they put enough work into it.

(People can form predictive models of other people just using our general ability to figure things out, just like we can build predictive models of car engines or whatever.)

That’s just one example. I discuss other (maybe less controversial) examples in my Sympathy Reward post §4.1 and Approval Reward post §6.

I got lost at the second diagram. Why is the process of computing expected reward for a given thought/plan called "interpretability"? In what way is it analogous to mech interp? I'd find it clarifying to see a sketch of an interpretability-in-the-loop ML training setup with the same rough arrows as in that diagram.

I added a caption to the picture. Does that help?