It Is Reasonable To Research How To Use Model Internals In Training

Neel Nanda

There seems to be a common belief in the AGI safety community that involving interpretability in the training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area.

I find this odd since this is a pretty normal area of interpretability research in the AGI safety community. I have worked on it, Anthropic Fellows have worked on it, FAR has worked on it, etc.

I don’t know if it will be net positive to use this kind of thing in frontier model training, but it could plausibly be very helpful for AGI safety, and it seems like a clear mistake to me if we don’t do the required research to figure this out. Further, this seems like a massive pain to use in current frontier model training stacks, so I don’t see any imminent risk/benefit from this kind of work, it’s more of a longer-term bet.

My sense is that most people in this debate have pretty reasonable and nuanced views when you actually get into the details, but public discourse is kinda confusing. So I thought it might be helpful to lay out how I see the landscape here and why I would like to see more research in this direction, and am not particularly concerned about exactly how people go about doing it ^[1]

What do I actually mean by using interp in training?

I'm broadly referring to anything involving the use of model internals in training. This could encompass several things:

Directly adding some function of the model's internals (e.g. a probe score) to the loss and differentiating through it to give strong gradient signals on how to break it
Adding a probe score to the reward in RL, where the model does not get gradients through the probe
Altering the model's internals during training, such as by adding a steering vector or ablating unwanted concept directions
Any of the above, but you regularly retrain

Why could this research help AGI safety?

Fundamentally, making safe models will involve being good at training models to do what we want in weird settings where it is hard to precisely specify exactly what good behaviour looks like. Therefore, the more tools we have for doing this, the better. There are certain things that may be much easier to specify using the internals of the model. For example: Did it do something for the right reasons? Did it only act this way because it knew it was being trained or watched?

Further, we should beware an isolated demand for rigor here. Everything we do in model training involves taking some proxy for desired behavior and applying optimization pressure to it. The current convention is that this is fine to do for the model's behavior, bad to do for the chain of thought, and no one can be bothered with the internals. But I see no fundamental reason behaviour should be fine and internals should be forbidden, this depends on empirical facts we don’t yet know.

Why might you be worried about research here?

The strongest concern I see is the hope to use interpretability as a test set. The idea is that we should not use interpretability during training at all and should purely use it to audit the model, for example, making lie detectors or determining if it's scheming. An implicit belief here is often that training against interpretability will be fragile or won't really work, but will break our ability to do the auditing. As such, it would be bad if frontier labs started using these techniques for capabilities, and broke our safety tools in the process.

My best guess for why people are against research in this area today, rather than solely being against frontier labs using model internals to train AGI, is that they think it's sufficiently likely that the work is net harmful for safety if used, and sufficiently likely that the work results in frontier labs using the techniques anyway, actually causing the harm for safety. Fortunately, it would be such a massive pain to change the highly optimised infrastructure stacks of frontier labs to use model internals in training that I think this is only likely to happen if there are major gains to be had and serious political will, whether for safety or otherwise. I would be very surprised if this happens in frontier model training in the near future, and I see this as a more speculative longer-term research bet.

This gets kinda nuanced

There’s a bunch of approaches and goals here and some are more concerning to me than others. I expect techniques here to be most effective when they are about shaping generalization, such as by avoiding emergent misalignment in favour of narrow misalignment or avoiding reward hacking.

Intuitively, any time you apply some intervention during training or add something to the loss, you are challenging gradient descent to find a way to break it. But gradient descent generally takes the path of least resistance. If the only way to perform well on a task is to break your intervention, your intervention had better be pretty robust. However, if there's an alternative desired generalization, then your intervention only needs to be more robust than that alternative is dispreferred. This is a much lower bar. We found in the CAFT paper that this seems to somewhat work. “Not fighting backprop” as Goodfire put it.

Secondly, I'm more concerned about techniques that regularly retrain their interpretability tool, for example retraining a probe every 100 steps. This seems likely to train models that don't have representations that can be easily probed. Which seems like it might make a bunch of interpretability techniques break, while if we train against a frozen probe, we can probably train another probe after training. Unfortunately, this one probably would make a technique more effective.

Thirdly, I would guess that what you use these techniques on matters a lot. I'd speculate that using these techniques for pure capabilities will have much less of a negative effect on our ability to audit models than trying to use them to make the model safe. This is sensitive to definitions, e.g. I would count preventing reward hacking as both safety and capabilities, and that might damage our ability to detect deception in general. But if someone wanted to use this to train models to e.g. be better at writing code without tons of try-except statements, or adapting better to the language and culture of the user, that seems pretty chill to me.

Fourthly, I would argue that it's not yet settled whether interpretability is even useful as a test set. We don't have lie detectors that I trust. Anthropic has only very recently started mentioning interpretability in system cards, and I consider the ways it's used to be experimental and non-load-bearing. I certainly hope interpretability will become sufficiently robust and trustworthy that it can be an important part of how we audit models, but let's not take it as a given.

Why do I want to see more work here?

Zooming out, this is a pretty early field. I don't know how well it will work, how much it will break interpretability tools, or which things are more or less dangerous. I don't know whether training against a probe will break somewhat different interpretability tools like activation oracles. I don't know if training against a frozen probe will stop me from retraining a probe on the eventual model and having it work fine. I don’t know how much harder it is to break a probe when you don’t get gradients through it. I don't know whether it will let me do things that I can't easily do with existing methods.

These are important questions, and we should research them! I don’t even know if my intuitions above about which directions are concerning are legit. We need to check.

In the sense that I think it's hard to do research in this area, right now, that is clearly net negative - I definitely think some approaches are much more promising! But even if someone tries to do pure capabilities research here, I still expect to learn useful things about safety from it. ↩︎

training process is “the most forbidden technique”, including recent criticism of Goodfire for investing in this area.

I think this mischaracterizes the criticism. The criticism as I understand it is that Goodfire is planning to help frontier AI companies use model internals in training, in exchange for money. Insofar as they really are planning to do this, then I'll count myself as among the critics, for the classic "but we need interpretability tools to be our held-out test set for alignment" reason. Do you have a link to the criticism you are responding to?

EDIT: Tom McGrath, Goodfire co-founder, says similar things in this comment

Oh, I would be surprised if that was their plan, seems like a bad plan for several reasons (though I may be overfitting to how GDM works):

The two hard parts are getting things to work on lab infra, and figuring out the exact recipe that works on the lab's model, neither of which Goodfire can do
Goodfire might be able to find some general research insights into how to do this well, along with validating the idea. But I back a frontier lab's ability to figure this out if sufficiently motivated. They might pay Goodfire once for the IP, but then would have built enough in house expertise to do future research themselves, since you need to do that to implement it anyway (if it's actually a big deal).
More generally it seems hard to get a lab to pay you significantly more than it would cost the lab to rediscover all the work themselves, and I don't expect that price tag to get above eg $25M (and likely much lower). Maybe the angle would be an aquihire? I'm not sure how hard that would be.
If they wanted to do this at eg GDM an obvious thing would be to talk to my team, since I'm already pretty interested in this topic, and there's natural safety applications too (and I totally don't back my ability to make this a sustainable funding source for them)
There's only like 5 potential customers, which seems kinda risky for something as uncertain as this (though idk maybe this is normal for startups)
More speculatively, if I wanted to make money from this I would probably target companies finetuning open source models, seems like they would be less sophisticated, have more niche needs, and there would be a lot more of them, even if they're not as rich.

Separately, I'm pretty sure that this isn't their main plan based on talking to them. And they've already publicly talked about various clients they've found for their science work, and work using SAEs.

See also my other discussion with Tom McGrath (one of the Goodfire cofounders) over here: https://www.lesswrong.com/posts/XzdDypFuffzE4WeP7/themanxloiner-s-shortform?commentId=DcBrTraAcxpyyzgF4

I still don't really have any model where their investors think their >$1.2B of future profits will come from, if not from somehow helping with frontier model training, so I still currently believe this is the default thing they will do. But I sure feel confused about lots of people saying it's a bad business model.

I am also somewhat confused about this, but I could buy that if they could eg reliably discover novel scientific insights that'd be pretty lucrative. I also put some credence on investors being willing to make very speculative bets on sexy things like interpretability, without a great business plan, especially since Golden Gate Claude seemed to impress a lot of VCs. Idk at which round this stops working, I'd have guessed Series B is too far but who knows. But I do think it's plausible there's some killer app via interpretability with real commercial value, and that this would be easiest to monetise via customers who use open source models/train their own.

Thanks for sharing the thread with Tom, I hadn't seen that - sounds like he and I are on the same page here.

OK, thanks, that's a relief then. We shall see who their customers end up being.

Fortunately, it would be such a massive pain to change the highly optimised infrastructure stacks of frontier labs to use model internals in training that I think this is only likely to happen if there are major gains to be had and serious political will, whether for safety or otherwise. I would be very surprised if this happens in frontier model training in the near future, and I see this as a more speculative longer-term research bet.

I am confused about this. Can't you just do this in post-training in a pretty straightforward way? You do a forward pass, run your probe on the activations, and then use your probe output as part of the RL signal. Why would this require any kind of complicated infrastructure stack changes?

Ok, that one in particularly would only be fairly annoying rather than very annoying, fair point. You would need either need to have your training infra set up to allow you to apply a probe while sampling, which sounds annoying, or to rerun after sampling and apply a probe the second time. The latter is easier infra wise as you could run it in inference only mode but still adds significant overhead as you need to run it again, even if it doesn't involve any generation, and plausibly with a somewhat different model configuration since you want to access activations. But implementing probes in the serving stack is doable so it seems so this does seem doable, and since you're not using the internals in a way that directly interacts with backdrop, it seems much easier. I expect it would still be annoying enough to only happen if there were significant gain, since training is a delicate and complex system and it's very expensive if things break, so there's rational resistance to added complexity. And as you're not taking backprop through the probe, it seems less concerning to me, though it could still break the probe.

I think to properly break things this would likely also need to regularly retrain the probe, and that sounds very annoying, since you would need to now do a complex non standard operation in the middle of training. Unless there's some creative way you can detect probe errors on the activations generated during training and update in an online way based on these?

I think the infrastructure changes required are pretty straightforward, at least within the reference class of changes on large ML codebases in general. like it would take me at most a few days to implement. if done in a reasonable way, it also seems very low risk for breaking other things if done right (the probe uses trivial memory and compute, so you wouldn't expect it to substantially interact systems wise). regularly retraining the probe is also not really that bad imo.

I expect it would still be annoying enough to only happen if there were significant gain, since training is a delicate and complex system and it's very expensive if things break, so there's rational resistance to added complexity.

Given trade secrets and everything you might not be able to say anything about this, but my model of frontier post-training was that we kind of throw the kitchen sink at it in terms of RL environments. This is pulling in a different kind of feedback so does add complexity that other RL environments don't add, but my sense is that post-training isn't that fragile to stuff like this, and we kind of throw all kinds of random things into post-training (and generally are pretty bottlenecked on any kind of RL-feedback).

And as you're not taking backprop through the probe, it seems less concerning to me, though it could still break the probe.

Hmm, my guess is the model would still pretty quickly learn to make their thinking just not trigger their probe. My guess is doing it with RL is worse because I expect it to generalize to more probes (whereas backropping through a specific probe will most likely just make that specific probe no longer detect anything).

Hmm, thinking more, reward models (with a linear head) are basically just a probe, on the final residual stream. And have at least historically been used. I don't expect probing on the final residual stream rather than a middle one to fundamentally change things here. Which both suggests this is doable infra wise, but also that nothing really bad happened to interpretability. You'd train a probe somewhat differently but I'm not sure it's all that different

Re how easy it is to break a probe, who knows! I would rather just do the research to answer empirical uncertainties like this.