Reasoning Model Interpretability: basic science of how to interpret computation involving the chain of thought [[4]]
Automating Interpretability: eg interpretability agents
Finding Good Proxy Tasks: eg building good model organisms
Data-Centric Interpretability: building better methods to extract insights from large datasets
To apply our pragmatic philosophy, we need North Stars to aim for. We do this work because we're highly concerned about existential risks from AGI and want to help ensure AGI goes well. One way we find our North Stars is by thinking through theories of change where interpretability researchers can help AGI go well (which goes well beyond “classic” mech interp)
Our thoughts below are highly informed by our takes on the comparative advantages of mech interp researchers, as discussed in the accompanying post:
Notably, we think that while interpretability excels at producing qualitative insights (including unexpected ones) it is weak at being complete or precise. So we're selecting theories of change that don't require high-confidence explanations, unlike some views classically popular in interp, e.g. trying to prove beyond reasonable doubt that a system is aligned.
Finally, we encourage readers to apply our pragmatic framework to their own views on AGI safety! We would be excited to be convinced of other theories of change where a mech interp mindset helps.
Imagine a frontier model takes an egregiously bad action - say, introducing a security vulnerability into critical monitoring code. How do we know if it's scheming maliciously or simply confused?
This distinction drives completely different safety responses. If it was truly caught scheming, we need to take this seriously, sound the alarm, and comprehensively fix these issues. But if it was benign, we should not freak out over nothing (though should still apply mitigations).
Further, if we do sound the alarm, we need strong and rigorous evidence to present to stakeholders if we need to justify serious action. This is easy if the model constantly misbehaves. But if it's scheming while evading detection then distinguishing malicious acts from benign occasional failures becomes hard, and there may be no smoking gun.
Interpretability can help by analysing real instances of misbehaviour (e.g., from deployment logs) to incriminate or exonerate the model. Our shutdown-resistance case study was an initial attempt on an easier problem, and we discuss our takeaways from that in the accompanying post.
Why this? Because it leans into the comparative advantage of qualitative understanding of a model’s behaviour, unsupervised discovery, and a scientific approach of gathering as much evidence as we can for and against competing hypotheses
Success here requires:
We view this as a more focused and tractable analogue to “prove that a model is aligned.” [[5]] The key shift is the focus on narrow incrimination rather than universal exoneration**.** We are not trying to prove a global negative; we only need to gather sufficient evidence for a specific hypothesis about a single instance, exactly where interpretability shines.
We also note that even if you're pessimistic about alignment on current trajectories and want AI progress to pause or slow, this agenda remains valuable: rigorous, high-quality evidence, while not sufficient, is valuable for helping people properly react to any warning shots.
Interpretability provides affordances other ML areas often lack. We don't think it's a silver bullet. But it provides decorrelated tools and we're excited to fill weak spots in other safety agendas.
Creating realistic evals is hard, especially for alignment. Effort gets sunk into evaluating models, then eval awareness ruins it - if a model knows it’s in a test scenario, we cannot study how it would act naturally! Improving realism helps in the short term but likely fails in the long-term as models see through it. Steering vectors offer an orthogonal solution with different weak points, and can ensure that all the effort that’s gone into evals isn’t wasted. (see e.g. suppressing it in Sonnet 4.5, or Hua et al)
Using model diffing (e.g. before/after using a safety technique, or between training with and without it) can show researchers what their technique actually accomplished, helping guide development in productive directions.
Alignment reasoning often imports human framings, goals, intent, beliefs, and hopes that they transfer. But we could have a much clearer understanding of what alignment work should actually target. Now that we have more sophisticated models, we expect that rigorous investigation could tell us a lot. Notable work here include Stewart Slocum on whether synthetic-document-fine-tuned models truly believe the implanted fake fact, and Jack Lindsey on model introspection.
Currently, a powerful technique for controlling models is monitoring the chain of thought. Though imperfect, it’s surprisingly helpful and faithful, especially for the most complex tasks. But this monitorability is fragile and changes to training set-ups could accidentally break it e.g. by creating some optimisation pressure on the thoughts [[6]] . Interpretability researchers could create better metrics, better understand how monitorability could break, and what could prevent this. See e.g. Fang et al
This is the theory of change behind AI control research [[7]] : if we have models that are potentially misaligned, how can we get useful work out of them without them being able to take egregiously misaligned actions (e.g. disabling monitoring systems).
Control is a broad agenda where interpretability's comparative advantage is limited [[8]] . But two parts seem plausibly important and suited to interpretability and internals work:
An emerging, exciting area: using model internals to influence training. Training is what fundamentally drives most safety properties of models, so any new affordances here are valuable. We expect any success with internals will come from leveraging the fact that they are decorrelated tools, and leaning into the things that are particularly hard to fix with improved data, e.g. stopping models from behaving nicely in training but not when in the real world (similar to eval awareness).
Notable works:
All such interventions are imperfect; sufficient gradient-descent pressure breaks them [[11]] . We're most optimistic about approaches nudging models toward one solution over another - a qualitative shift that seems attainable with modest pressure (eventually it's easier to just learn the alternative). A valuable proxy task is “change what is learned during fine-tuning without any changes to the training data or loss”, e.g. causing models to become narrowly misaligned rather than emergently misaligned. We expect a wide range of approaches here to be possible, and would love to see more exploration.
However, we do note drawbacks:
One important way of selecting research areas is by thinking about the theories of change detailed above and what research sub-problems are on the critical path between where we are now and where we want to be by AGI. The below is an incomplete list of some key research areas with clear ties to those theories of change.
By "model biology," we mean studying qualitative properties and factors driving model behavior - both in specific instances and across distributions. We're broadly enthusiastic about progress here (better methods, richer case studies). We're most excited about work studying normal LLMs, but it can make sense to construct model organisms with specific properties of interest, where appropriate
Why this? In a sense, this is the general form of the core problem of science of misalignment. The model has done something potentially concerning. Can we understand why? And produce a legible, rigorous case for others. But we would also consider this a robustly useful setting with many other applications.
Key areas:
Direct work on science of misalignment: Better understanding of what misaligned cognition looks like. Studying seemingly misaligned behavior in today's models (e.g., reward hacking in the wild) seems particularly fruitful. (e.g. shutdown resistance)
Eliciting secret knowledge: If we could reliably extract secrets models know but won't share, this could be super useful. Marks et al. and Cywinski et al. made progress; the latter open-sourced their models for others to practice on.
Improved Methods: We think there’s significant scope for improved methods here. We’re particularly excited about improved methods for generating hypotheses to explain a behaviour (e.g. sparse autoencoders, training data attribution, patchscopes), and better methods for verifying or falsifying a given hypothesis like constructing better counterfactuals (e.g. resampling the CoT). We hope that progress on reasoning model interpretability can be translated into further improved methods.
See discussion above
Model-internals-based monitors are cheap - largely reusing computation from the main forward pass. (Cunningham et al. showed this works well for frontier misuse detection.)
Improving probes could significantly help theories of change like white-box control, but also seem robustly useful in a bunch of ways, e.g. helping prevent catastrophic misuse. Despite probes seeming simple, substantial room for innovation exists:
We would be excited to see better work at monitoring for things like deception and scheming. For deception specifically, Goldowsky-Dill et al is a nice proof of concept here, and Kretschmar et al provides an initial set of evals.
We would also be excited to see progress on less explored forms of monitoring like mechanistic anomaly detection
Another way to select research areas, as discussed in the accompanying post, is to identify robustly useful settings: areas that we think are good analogies to aspects of future models and where we can see a bunch of somewhat plausible ways that work here could be helpful for making AGI go well. This can look like studying a certain phenomena, building certain kinds of methods, etc. We think that work targeted at certain promising settings stands a good shot at remaining useful, even as we learn more and shift our plans.
Projects here don’t need to be targeted at a single confident theory of change, but there should be at least one somewhat plausible story for how it helps, and keeping that story in mind as you research is valuable for helping prioritise and keep focus
The rise of reasoning models [[12]] has major strategic implications. Most mechanistic interpretability approaches implicitly focuses on single forward passes, and does not generalise to reasoning models. Yet most interesting behaviour in frontier models involves reasoning! We think this area is extremely neglected and that the field is dropping the ball here.
The key problems are:
We present a detailed example in the appendix for why standard tools do not seem easily applicable here.
Our point is not that this is hopeless, in fact we think this is highly tractable! We think real progress was made in some papers we’ve supervised, discussed below. And it may be the case that it just works with a bit of fiddling, e.g. maybe we can just identify the key forward passes to study. But we consider determining how to be an open research problem. In particular, we do not currently know how to give a satisfying answer to how to reverse-engineer the computation in the problem described in the appendix [[13]] .
Why does this matter? We think that progress here could be significantly helpful for science of misalignment, investigating bad behaviour caught by control schemes, potential for conceptual progress into questions of AI psychology, and likely many other theories of change, given how critical reasoning models are.
We've supervised work here we're excited about:
Latent reasoning models: Vector-based chain-of-thought (Coconut, CoDI, diffusion models, etc.) is a particularly important area of reasoning models. To us, it seems likely that the field will eventually shift to these (much more information density in vectors than tokens), and this will break extremely useful safety tools like chain-of-thought monitoring. It seems high priority to figure out how interpretability can provide replacements, e.g. a translation of the CoT back to natural language.
However, currently, it’s highly unclear which of today’s latent reasoning models are a good proxy for eventual future ones, meaning it’s difficult to do too much useful work today, though we think the field is early enough that studying any latent reasoning model will likely be instructive.
As models scale, interpretability must leverage AI progress to automate itself. As we approach human level systems, they ought to have the potential to automate a lot of labour intensive research tasks, even if we’re not yet sure exactly where [[14]] . And we think it is valuable to invest now in learning how to keep up, so we are prepared for worlds where the rate of AI capabilities research may substantially increase.
We see this as a methods category, with particular applications to speeding up science of misalignment work, or making more ambitious control schemes possible to catch misbehaviour. But we expect this to be useful everywhere in the long-run, interpretability agents are the worst they will ever be.
We're particularly excited about:
As discussed above - modern LLMs seem to be approaching having something like a “system 2”, not just intuitive reflexes. What’s going on here? How does it work?
A valuable form of infrastructure work is finding things worth studying, that are good proxy tasks for eventual things we care about, and otherwise enhancing our ability to choose good future projects. We think this is an instrumental goal of most theories of change. Notable areas:
Model organisms: Construct and open-source good ones (Cywinski et al., Hua et al.), improve creation techniques (synthetic document fine-tuning), red-team and identify flaws (Minder et al.).
Naturalistic behaviors: Find interesting phenomena in real models to study, and try to distill it into a clean, reproducible environment e.g. alignment faking, emergent misalignment, eval awareness, agentic misalignment (e.g. blackmail), etc. Of course, there’s also a lot of other reasons to do this kind of work!
More broadly, discovering ways a model may misbehave, or behave weirdly, can be useful - both as evidence for and against its alignment (if you can’t elicit misbehaviour, that’s a good sign!), and to surface good proxy tasks.
Possible approaches here:
Extracting insights from large datasets. We think this is a promising area with applications in a bunch of areas, and see this more as developing better methods, which can then be applied to e.g. detecting misbehaviour in lots of LLM agent logs (as part of a control scheme or for science of misalignment evidence), or studying RL transcripts to better understand the final model. Examples:
Methods centred on comparing two models and interpreting the difference. In theory, this could be highly promising for focusing our attention on the important parts, e.g. things that changed substantially during fine-tuning. This could be particularly useful for things like providing better feedback on what a safety technique did to accelerate safety research, or for general things to do with model biology.
There’s many potentially promising techniques here:
We are excited to see work that is just trying to do useful things on today's models, with proxy tasks around actually getting used in production. i.e. real users or at least non interpretability researchers use them. We are particularly interested in applied interpretability work focused on making models safer, but this isn’t required.
We think that even if there is not a clear theory of change for how this translates to AGI Safety, there is a powerful feedback loop from whether people will actually use your work. [[15]] And we think this will be helpful for doing work that is real and useful.
In addition, in a world of high uncertainty, where lots of our current preconceptions are wrong, a research group that just kept doing things to make current models safer in the real world will likely have done some valuable things by the time AGI arrives! So we see this as a diversifying bet.
To illustrate how this is a difficult area to study, consider the following scenario:
The model is solving a maths problem, and we want to understand the circuit that led to the final answer. The chain of thought looks like this: (we recommend playing around here to build intuition if you haven’t stared much at raw thoughts before!)
We think that it is not at all clear how to solve this problem with standard tools, which generally treat the prompt (and CoT so far) as fixed, and study a single forward pass
We recommend reading this first, but it’s not necessary ↩︎
if you can’t study your goal directly ↩︎
Or something else, e.g. emergent misalignment causing flagrant, unsubtle misalignment ↩︎
Sampling is a stochastic and non-differentiable operation, introducing significant difficulties, so many things need to be rethought ↩︎
Of course, we are solving an easier problem. We think this would be high impact, but it does not solve the problem of “if we have a model that seems aligned but could be deceiving us, what do we do?” Our guess is that this is not where most of the risk comes from. ↩︎
And e.g. shifts to vector-based chain of thought would be substantially worse, but is harder to prevent since that’s likely to be considered worth the cost if vector based CoT is good enough to justify the hassle of switching ↩︎
To learn more about control, this post may be a helpful overview, and we enjoyed this podcast overview. We expect these are better introductions than the linked paper ↩︎
In particular: a big advantage of AI control is that evaluating whether a set of control techniques are robust to intentional subversion just requires measuring how good a model is at subverting those techniques. In contrast to many other kinds of AI safety techniques, this doesn’t require the technique developer to know anything about inductive biases, generalization behavior, salience of various features, and so on. Interpretability techniques rely on properties of the internals of models that it’s harder to abstract away. So if you have a complicated control protocol that incorporates an interpretability technique, there isn’t an robust end-to-end way to measure how much that interpretability technique will help. ↩︎
Internals-based monitors can reuse computation done by the main model, giving them a significant advantage over e.g. running another LLM. There’s also some theoretical advantages, like maybe being able to detect subtle plans from the untrusted model that a dumber LLM monitor wouldn’t notice. ↩︎
This is a special case of science of misalignment, with a higher priority on automation ↩︎
See e.g., RL Obfuscation ↩︎
These arguments technically cover any model that does a significant amount of computation through sampling. Non-reasoning models that are just prompted to use chain of thought are also difficult, though reasoning models have made this much more pressing ↩︎
If any readers can convince us otherwise, we would be very grateful! ↩︎
For highly capable future models, we naturally also need to worry about things like our interpretability models colluding with the studied model. But our guess is that for bounded and well-scoped tasks, this kind of thing is easier to manage, e.g. by using weaker and better trusted models. ↩︎
Note that this is assuming you are working with people who are actually trying to evaluate whether your contribution is worth using. If someone is just a massive fan of interpretability and wants to use it because it's cool or flashy, we do not think this is particularly helpful as a source of feedback. ↩︎
Or maybe the final answer was computed anew and it ignored the CoT! Or maybe it was memorised! Who knows ↩︎