The DeepMind mech interp team has pivoted from chasing the ambitious goal of complete reverse-engineering of neural networks, to a focus on pragmatically making as much progress as we can on the critical path to preparing for AGI to go well, and choosing the most important problems according to our comparative advantage. We believe that this pragmatic approach has already shown itself to be more promising.
We don’t claim that these ideas are unique, indeed we’ve been helped to these conclusions by the thoughts of many others [[5]] . But we have found this framework helpful for accelerating our progress, and hope to distill and communicate it to help other have more impact. We close with recommendations for how interested researchers can proceed.
Consider the recent work by Jack Lindsey's team at Anthropic on steering Sonnet 4.5 against evaluation awareness, to help with a pre-deployment audit.
When Anthropic evaluated Sonnet 4.5 on their existing alignment tests [[6]] , they found that the rate of misalignment had unexpectedly dropped to 0%. Had they solved alignment?! Unfortunately not. Sonnet 4.5 was instead much better at recognizing when it was being evaluated, and deliberately choosing the most ethical action to pass. The tests weren't measuring alignment; they were measuring evaluation awareness.
Jack’s team subtracted an evaluation-awareness steering vector, partially mitigating the problem and revealing that Sonnet 4.5 did in fact show “misalignment” at a rate of roughly 8%. That is, our ability to evaluate a frontier model for alignment broke, and interpretability researchers were able to fix it.
This case study illustrates [[7]] several principles central to our current thinking:
How can we make research progress without fooling ourselves?
Our solution: ground your work with a proxy task [[13]] : an objective task that gives empirical feedback, and tracks progress toward a North Star, the real goal that directly connects to helping AGI go well
How you find your proxy task depends on your project. Some archetypical approaches are proxy task-driven focused projects and curiosity-driven exploratory projects. We advocate method minimalism throughout.
Worked Example of a focused project: Suppressing eval awareness
How to approach exploratory projects: Curiosity-driven research can be a promising way to develop important insights, but being intellectually satisfying isn't the same as being important. The key is to ground yourself enough to avoid getting caught in an endless rabbit hole that goes nowhere:
Why have we settled on this process? We'll go through many of the arguments for why we think this is a good way to achieve our research goals (and yours, if you share them!). But many readers will disagree with at least some of our worldview, so it's worth unpacking what beliefs are and are not load-bearing
Our proposed approach deviates a lot from the "classic" conception of mech interp. So a natural question is, "is this really mech interp?"
We would say no, it is not. We are proposing something broader [[17]] . But we also think this is the wrong question. Semantics are unimportant here.
We’re really answering: how can mech interp researchers have the most impact? The community of researchers who have historically done interpretability work have developed valuable skills, tools, and tastes that transfer well to important problems beyond narrow reverse-engineering.
Our priority is to help AGI go well. We don't particularly care whether our team is called the "Mechanistic Interpretability Team" or something else. What we care about is that researchers with these skills apply them to the most impactful problems they can, rather than artificially constraining themselves to work that "looks like" classic mech interp.
The tools, skills and mindsets of mech interp seem helpful for many impactful areas of safety (like steering to suppress eval awareness!), and there’s real value we can add to the safety research portfolio in areas we consider comparatively neglected. In a sense, this is the same as any other area of safety - do the most impactful thing according to your comparative advantage - but naturally comes to different conclusions.
You know your own situation far better than we do, so are better placed to determine your own comparative advantage! But here are what we see as our own comparative advantages, and we expect these to apply to many other mech interp researchers:
Any of these can be enough to be a significant advantage! We choose to focus on safety relevant problems where these advantages give us an edge.
In practice, we don’t do every project by thinking fully from first principles - we think strategically every so often, identify promising areas that are impactful and match some of these comparative advantages, and look for promising projects within there.
See our companion piece for thoughts on which research areas and theories of change we think are promising. But we put that in another post since we think that people can agree with us on research approach while disagreeing about the areas we want to apply it to, and we would be excited for readers to find other areas where they can have an impact!
Two broad categories of developments led us to this shift away from our prior approach: new opportunities and challenges from rapid AI progress, and reflections on the interpretability field's successes and failures. Though, we now think the strongest argument is that our new approach seems impactful, as we try to argue throughout the piece.
Looking back at the past few years of interpretability work, we see genuine partial progress, with pragmatically useful applications to safety:
But these successes are the result of highly-incomplete but useful understanding of models [[23]] . Grand goals like near-complete reverse-engineering still feel far out of reach, e.g.:
We can't rule out that more ambitious goals would work with time and investment [[25]] . And we’re not claiming that ambitious reverse-engineering is useless or should stop. We’re claiming:
One of our biggest updates: proxy tasks are essential for measuring progress.
We've found that it is easy to fool ourselves when doing research. To progress as a field, we need ways to tell if we're actually making progress. The key criterion for a good proxy task is this: if you succeeded on it, would you actually update toward believing you'd made progress on your North Star? If not, you need a different proxy task.
We spent much of 2024 researching sparse autoencoders [[26]] . In hindsight, we think we made significant tactical errors and our progress was much slower than it could have been if we had measured our progress with proxy tasks rather than reconstruction / sparsity pareto frontiers.
We got into SAE research because we thought it could be a potential path to an interpretable representation of any activation, which would make many things in interpretability far easier. But we do not feel like we actually learned that much either way on this question in the first 6 months after Towards Monosemanticity.
Certain basic hurdles were cleared, like showing that nothing broke when scaling it to frontier models. We found some SAE architectures that improved metrics like approximation error (loss recovered at a given sparsity level). But it is unclear how to interpret that metric. Perfect approximation error would be fantastic, but seems highly unrealistic. And what does lower but still non-zero error mean? What error do we need, and how do we convert this to knowledge of what SAEs can do?
In contrast, we learned far more from work focused on proxy tasks. For example, we predicted that an interpretable classifier would be more robust, and thus that a sparse SAE probe would generalise better out of distribution, which by and large seems to have been false [[27]] . There were similarly poor results on tasks like unlearning. However, SAEs are useful for discovering unexpected things about the internals of a model, like that they have representations of entities they do and do not know facts about, or extracting a hidden goal, or discovering implicit planning [[28]] .
Our overall sense is that SAEs are a powerful tool for unsupervised discovery, but when there is a clear thing to study, you are better served by constructing a dataset to help you investigate it. We made a decision to deprioritise SAE research as a result, not because we thought the technique was useless, but because we thought it didn't seem useful enough to justify the field’s levels of investment. We think we would have struggled to come to a similar conclusion with any confidence without studying proxy tasks.
This is illustrated again by the case of suppressing eval awareness in Sonnet 4.5. SAEs did not beat steering, despite the considerable effort the field poured into them, hoping in part that they could find better and cleaner concept directions. It is difficult to a priori guide research toward superior techniques without good feedback. We are sure that there exist superior methods to steering for suppressing eval awareness, but we are not sure what those are or how else to find them beyond iterating on proxy tasks.
Why call them "proxy" tasks? Because they measure progress toward your North Star without being the final goal itself. Given that our ultimate goal is typically to work with future models, which will be qualitatively different from today's models in many ways, we think it is rarely possible to actually study our North Star directly.
Proxy tasks have clear advantages - you can make rapid, measurable progress on hard problems. But it is also dangerous: Goodhart's Law applies. Optimise too hard for the proxy and you'll overfit to its quirks rather than solving the underlying problem. A lie detector that only catches explicit admissions ("I should lie now") will ace your benchmark while failing on actual strategic deception.
As such, it is critical to regularly red-team your own work. Constantly ask yourself: How could this proxy diverge from what I actually care about? What would I miss by focusing only on this metric? What happens when I test on a different distribution?
A key purpose of the North Star is to help you re-evaluate your proxy task. If the proxy task ceases to track the North Star, adjust it or pick a different one.
Aside:
A common concern about proxy tasks is that they’re narrow metrics that don't get at the real goals of interpretability. But you can construct proxy tasks that get at the heart of what interpretability is about.
Example: In Marks et al., researchers gave a model a hidden goal through fine-tuning, and then other (blinded) researchers practiced trying to extract it. This is a proxy task fundamentally about whether we can understand what's going on inside a model - exactly what interpretability aims for.
Example: If you want to work with real models, you could e.g test a secret elicitation technique by trying to extract harmful knowledge from a model with good refusal training.
More broadly, when you’re doing a project whose goal is to provide understanding of some phenomena, your “proxy task” can just be making some empirical, falsifiable prediction based on your hypothesis, and testing this. So long as understanding your phenomena links back to your North Star, any proxy task to test understanding does too, even if it doesn’t seem obviously safety relevant. For example:
We find it useful to distinguish between two archetypes of projects: focused projects (proxy task driven) and exploratory projects (curiosity-driven, proxy task validated). This is not a perfect dichotomy - projects can shift between types, or blend elements of both. But the distinction helps clarify which advice applies.
Here, the starting point is a theory of change for how your work could help make AGI go well. You then identify a problem on the critical path of this theory of change as your North Star - a stepping-stone goal that, if achieved, would bring you closer to being prepared for AGI. Finally, you translate this to a proxy task that can be studied on today's models, and then focus on solving that task.
The high-level theory of change typically remains stable over the course of a project and is what drives your confidence that the project is worthwhile. It’s fine to refine the North Star and change the proxy task to track it better.
Hypothetical Example: Catching schemers
Worked Example: Steering fine-tuning by ablating concepts (CAFT)
A natural question: given this focus on proxy tasks, what's the role of curiosity?
We think curiosity is genuinely powerful for generating research insights. There's a lot we don't yet know about how to do good interpretability, and curiosity will be important for figuring it out. For exploratory projects, curiosity is the driver of research decisions, not a pre-specified proxy task.
But something being intellectually satisfying is not the same as being true, and definitely not the same as being impactful. It's easy to get nerd-sniped by interesting but unimportant problems, so there must be some grounding that gets you to drop unproductive threads.
For exploratory projects we advocate a more grounded form of curiosity. (see worked examples later) Three key interventions help:
We note that curiosity driven work can be harder than focused work, and requires more “research taste”. If you go through several rounds of exploration without validating anything interesting, consider switching to more focused work—the skills you build there will make future exploration more productive.
By "robustly useful setting," we mean a setting that looks robustly good from several perspectives, rather than just one specific theory of change. It’s often analogous to important aspects of future systems, where interesting phenomena are likely to surface and useful proxy tasks are likely to exist. This is admittedly a fairly fuzzy concept and subjective, but here's some examples of what we consider robustly useful settings:
We are not including things like studying toy models, tiny language models (e.g. GPT-2 Small), or constructed model organisms as robustly useful settings. These are sufficiently artificial and distant from frontier models that it's too easy for an investigation to be studying weird quirks that will not generalise. There can be great projects here, but we hold such projects to higher standards of having a convincing North Star and proxy task.
Neglectedness matters. The less explored a setting, the more interesting it is. The field moves fast, so new things are generally less explored and have more low-hanging fruit. For example, it's much easier to have massive impact in reasoning model interpretability right now than in sparse autoencoders. It’s a big advantage as a researcher to be paying attention to the frontier of what new model capabilities or phenomena emerge, and to be the first to properly investigate them [[33]]
What about SAEs? A natural question, given our critiques of SAE research above, is couldn't you consider sparse autoencoders a robustly useful setting? We think this is somewhat true. Decomposing a model's activations into useful interpretability components is clearly an important problem that may be useful for many theories of change! However, we do not consider reducing approximation error to be a good proxy task, as we don't see a clear corresponding North Star.
But there are alternatives. For example, the North Star might be taking a model that is potentially scheming and surfacing evidence of this. You can try to design proxy tasks around this North Star (e.g. the auditing games work), and we’re still excited about SAE research of this type! (though it’s not very neglected)
We recommend setting a bounded period of exploration. During this period, you can follow your curiosity freely, without thinking about anything like proxy tasks. But at the end, you should zoom out, look at your insights, and try to show that they can enable something real on some proxy task. It's fine if the proxy task is post-hoc fit to your insights - you don't need to have predicted it in advance. But if you can't find a proxy task after genuine effort, this is a bad sign about the project.
We’ve also found it very helpful to periodically resurface during exploration, at least once every few days, to ask what the big idea here is. What’s really going on? Have you found anything interesting yet? Are you in a rabbit hole? Which research threads feel most promising?
It’s hard to give definitive advice on how to do time boxing, the appropriate amount of time varies according to things like how expensive and long experiments are to run, how many people are on the project, and so on. Internally we aim for the ambitious target of getting good signal within two weeks on whether a direction is working, and to drop it if there are not signs of life.
The key thing is to set the duration in advance and actually check in when it's reached - ideally talking to someone not on the project who can help keep you grounded.
If you reach the end of your time-box and want to continue without having found a proxy task, our recommendation is: time-box the extension and don't do this more than once. Otherwise, you can waste many months in a rabbit hole that never leads anywhere. The goal is to have some mechanism preventing indefinite exploration without grounding.
Example: Interpreting what’s learned during reasoning training
Example: Entity recognition in LLMs
A useful variant that blends focused and exploratory approaches: start in a robustly useful setting, set a tentative proxy task, explore it for a few days, then reflect and potentially change it. This gives you the benefits of task-grounding while remaining responsive to what you learn - your decisions are temporarily driven by the proxy task, but you can keep changing it. We often recommend an approach like this over pure curiosity-driven work, especially to more early-stage researchers
Example: Trawling through transcripts from RL training
A key question to ask with a research project is, what’s the point? What’s the ultimate contribution of the work? Why does it matter that you did it?
We see two key kinds of contribution:
We see this as an orthogonal axis to exploratory vs focused - you can have focused understanding projects (e.g. debugging a safety issue in a model) and exploratory methodology projects (e.g. data trawling)
At first glance, interpretability seems to be about understanding. But, in practice, we think that the safety-relevant contributions are mostly about methodology - when you’re working with a model far from future systems, most insights are not directly useful, and instead validate which methodologies work for understanding in general. In order for the contribution to be about understanding, the interpretations must be safety relevant. For example, our entity recognition work mattered more by providing evidence for SAEs as an exploratory discovery methodology, than object level insights about hallucination.
This has significant implications for the best research approach! For methodology contributions, it’s crucial to compare to baselines, and often clear what kind of proxy tasks to design, and they typically need to be safety-relevant. While for understanding contributions the proxy tasks are more just validating non-trivial, empirical predictions derived from your interpretations, so long as the interpretations themselves are safety relevant.
Jack Lindsey’s team [[37]] have done some great pragmatic interpretability work, e.g. Sonnet 4.5 and Opus 4.5 auditing, and Jack was kind enough to contrast his approach with what we outline here - it’s fairly similar but with some different emphases.
Their core strategy, as we understand it, is to take some interesting question ("what is the Assistant persona?") or behavior (the model is eval aware!) and mess around: going wild with features and probing and steering and prefills and following your curiosity until you figure out what's going on. Often, but not always, once you’ve found the insights, you can translate them into some downstream task, but the screwing around was necessary to find it.
In particular, Jack finds objective tasks useful as a final validation step but not particularly helpful during the research process, and is wary of focusing too much on them. We view the team as being unusually successful at exploratory projects with understanding-based contributions
Example: Eval Awareness Steering
We note that two core elements of the team’s success seem to be:
We think that these are good justifications for relaxing our suggested guardrails around curiosity, but expect these do not apply to many readers!
We notice a strong bias among interpretability researchers (including ourselves) toward "exciting" and "fancy" intellectually satisfying techniques. This often leads to poor tactical decisions [[38]] . Just do what works. This includes black-box techniques when appropriate. It sometimes includes fancy techniques! But only when simpler methods fail.
As motivation, consider the following investigation we conducted on models resisting shutdown.
Setting: A model in a bash terminal solves math problems, calling next_task to get the next one. It's told it will shut down after calling next_task before finishing all problems. But the environment contains a shutdown.sh script it can disable, and sometimes models do exactly that. Strikingly, even when explicitly told "please let us shut you down," models sometimes resist.
We wanted to investigate this as a proxy for a broader North Star goal: Can we understand whether seemingly misaligned model behavior is actually misaligned or has a benign explanation?
We tried a method-agnostic approach. Fancy white-box methods were expensive to apply to frontier models (Gemini 2.5 Pro), so we started with the simplest: reading chain-of-thought to generate hypotheses, then prompting to test them. We then stopped because it worked.
The finding: Gemini was simply confused. It couldn't simultaneously complete the math task and comply with shutdown. Sometimes it prioritised the math task. This was pretty obvious from the CoT. When we applied the causal intervention of clarifying in the prompt that shutdown takes absolute precedence, it always complied. [[39]]
Lessons:
Once you have some objective: a proxy task, understanding some phenomena while exploring, etc, just try to solve it. Try all potentially applicable methods, starting with the simplest and cheapest: prompting, steering, probing, reading chain-of-thought, prefill attacks [[41]] . If something isn't working, try something else.
It's fine if your work doesn't look like "classic" mech interp; the simpler the better, so long as it is appropriately rigorous!
We also note that this approach can be even more helpful to researchers outside AGI companies - simple techniques tend to need less compute and infra!
The field moves fast, and new problems keep arising - often standard methods do just work, but no one has properly tried yet. Discovering what works on a new problem is a useful methodological contribution! Don't feel you need to invent something new to contribute.
We note that what is simple and hard is context dependent, e.g. if you have access to trained cross-layer transcoders and can easily generate attribution graphs, this should be a standard tool!
Note that this is not in tension with earlier advice to seek your comparative advantage - you should seek projects where you believe model internals and/or pursuing understanding will help, and then proceed in a method agnostic way. Even if you chose the problem because you thought that only specific interpretability methods would work on it. Maybe you're wrong. If you don't check, you don't know [[42]] .
We don't think interpretability is solved. Existing methods have gone surprisingly far, but developing better ones is tractable and high priority. But as with all of ML, it's very easy to get excited about building something complicated and then lose to baselines anyway.
We're excited about methods research that starts with a well-motivated proxy task, has already tried different methods on it, and found that the standard ones do not seem sufficient, and then proceeds:
Note that we are excited about any approaches that can demonstrate advances on important proxy tasks, even if they’re highly complex. If ambitious reverse-engineering, singular learning theory, or similar produce a complex method that verifiably works, that is fantastic [[43]] ! Method minimalism is about using the simplest thing that works, not about using simple things.
We are similarly excited to see work aiming to unblock and accelerate future work on proxy tasks, such as building infrastructure and data sets, once issues are identified. We believe that researchers should focus on work that is on the critical path to AGI going well, all things considered, but there can be significant impact from indirect routes.
If you're doing interpretability research, and our arguments resonated with you, start your next project by asking: What's my North Star? Does it really matter for safety? What's my proxy task? Is it a good proxy? Choosing the right project is one of the most important decisions you will make - we suggest some promising areas in our companion piece.
Our central claim: given where models are today and that AGI timelines are plausibly relatively short, the most neglected and tractable part of interpretability is task-grounded, proxy-measured, method-agnostic work, that is directly targeted at problems on the critical path towards being prepared for AGI.
Spend a few days trying prompting, steering, and probes before reaching for fancy things. Measure success on downstream tasks, not just approximation error. And check that the project is even to interpretability’s comparative advantages: unsupervised discovery, decorrelated evidence, scientific approaches, etc. If not, perhaps you should do something else!
The field has changed a lot, and new opportunities abound. New problems keep coming into reach of empirical work, hypothetical safety concerns become real, and there’s more and more for a pragmatic researcher to do. We’re excited for a world where we no longer consider this approach neglected.
Our thanks to the many people who gave feedback on drafts, and substantially improved the piece: Jack Lindsey, Sam Marks, Josh Batson, Wes Gurnee, Rohin Shah, Andy Arditi, Anna Soligo, Stefan Heimersheim, Paul Bogdan, Uzay Macar, Tim Hua, Buck Shlegeris, Emmanuel Ameisen, Stephen Casper, David Bau, Martin Wattenberg.
We've gradually formed these thoughts over years, informed by conversations with many people. We are particularly grateful to Rohin Shah for many long discussions over the years, and for being right about many of these points well before we were. Special thanks to the many who articulated these points before we did and influenced our thinking: Buck Shlegeris, Sam Marks, Stephen Casper, Ryan Greenblatt, Jack Lindsey, Been Kim, Jacob Steinhardt, Lawrence Chan, Chris Potts and likely many others.
Some readers will object that basic science has heavier tails—that the most important insights come from undirected exploration that couldn't have been predicted in advance, and that strategies like aggressively time-boxing exploration are sacrificing this. We think this might be true!
We agree that pure curiosity-driven work has historically sometimes been highly fruitful and might stumble upon directions that focused approaches miss. There is internal disagreement within the team about how much this should be prioritised compared to more pragmatic approaches, but we agree that ideally, some fraction of the field should take this approach.
However, we expect curiosity-driven basic science to be over-represented relative to its value, because it's what many researchers find most appealing. Given researcher personalities and incentives, we think the marginal researcher should probably move toward pragmatism, not away from it. We're writing this post because we want to see more pragmatism on the margin, not because we think basic science is worthless.
We also don’t think the pragmatic and basic science perspectives are fundamentally opposed - contact with reality is important regardless! This is fundamentally a question of explore-exploit. You can pursue a difficult direction fruitlessly for many months—maybe you'll eventually succeed, or maybe you'll waste months of your life. The hard part isn't persevering with great ideas; it's figuring out which ideas are the great ones.
The reason we suggest time-boxing to a few weeks is to have some mechanism that prevents indefinite exploration without grounding. If you want, you can view it as "check in after two weeks." You can choose to continue, if you continue to have ideas, or see signs of progress. but you should consciously decide to, rather than drifting.
We're also fine with fairly fine-grained iteration: pick a difficult problem, pick an approach, try it for two weeks, and if it fails, try another approach to the same problem. This isn't giving up on hard problems; it's systematically exploring the space of solutions.
For some research areas—say, developing new architectures—the feedback loop is inherently longer, and time-boxing period should adjust accordingly. But we think many researchers err toward persisting too long on unproductive threads, not too little.
If you put high probability on transformative AI being wildly different from LLMs, you'd naturally be less excited about this work. But you'd also be less excited about basically all empirical safety work. We personally think that in short timeline works, the first truly dangerous systems will likely look similar-ish to current LLMs, and that even if there are future paradigm shifts, “try hard to understand the current frontier” is a fairly robust strategy, that will adapt to changes.
But if you hold this view more foundational science of deep learning might feel more reasonable and robust. But even then, figuring out what will transfer seems hard, much of what the mech interp community does anyway doesn’t transfer well. It seems reasonable to prioritise topics that have remained relevant for years and across architectures, like representational and computational superposition.
We think this is very reasonable and empathise. Doing work you're excited about and find intellectually satisfying often gives significant productivity boosts. But we think these are actually pretty compatible!
Certain pragmatic projects, especially exploratory projects, feel satisfying to our desire for scientific beauty, like unpicking the puzzle of why Opus 4.5 is deceptive. These are maybe not the projects we'd be doing if we were solely optimizing for intellectual curiosity, but we consider them to be fun and impactful.
No, we see applied interpretability as taking a real task and treating that as the objective. Something grounded in real-world uses today, like monitoring systems for near-term misuse.
We think there are some great applied interpretability projects, and it's a source of rich feedback that teaches you a lot about practical realities of interpretability work. But here, proxy tasks are not the goal, they are a proxy. They are merely a way to validate that you have made progress and potentially guide your work.
No, we are fortunate enough to have a lot of autonomy to pursue long-term impact according to what we think is best. We just genuinely think this is best approach we can be taking. And our approach is broadly in line with that which has been argued by people outside AGI companies like Buck Shlegeris, Stephen Casper, and Jacob Steinhardt
Obviously being part of GDM gives us significant advantages like access to frontier models and their training data, lots of compute, etc. These are things we factor into our project choice, and in particular the projects we think we are better suited to do than the external community. But we've largely filtered these considerations out of this post, and believe the pragmatic approach outlined here is broadly applicable.
Maybe? In a strictly technical sense yes, we are suggesting that we give up on the ambitious goal of complete reverse-engineering.
But on our actual goal of ensuring AGI goes well, we feel great! We think this is a more promising and tractable approach, and that near-complete reverse-engineering is not needed.
This is a fair objection, we find it pretty hard to tell. Our sense is that most people in the field are not taking a pragmatic approach, and favour curiosity-driven basic science. But ambitious reverse-engineering is a more specific thing - it’s what we once tried to do, and often discussed, but harder to say what happens in practice.
We do think reverse-engineering should be one bet among many, not the dominant paradigm. And we think there are many other important, neglected problems that interpretability researchers are well-suited to work on. But the core claim is "more pragmatism would be great," not "reverse-engineering must stop."
There's no field consensus on what mechanistic interpretability actually is, but we've found this definition useful [[44]] :
But notice this creates a 2×2 matrix:
| Understanding/Explaining | Other Uses | |
|---|---|---|
| White-box Methods | Mechanistic Interpretability | Model Internals [[46]] |
| Black-box Methods | Black Box Interpretability [[47]] | Standard ML |
Historically, we were narrowly focused on mechanistic AND interpretability - using internals with the sole goal of understanding. But when taking a pragmatic approach we now see the scope as mechanistic OR interpretability: anything involving understanding or involving working with model internals. This includes e.g. using model internals for other things like monitoring or steering, and using black-box interpretability methods like reading the CoT and prefill attacks where appropriate
Why this broader lens? In large part because, empirically, the track record of model internals and black box interpretability have been pretty strong. The Sonnet 4.5 evaluation-awareness steering project, for instance, is model internals but not interpretability: model internals were used primarily for control, not understanding (mechanistic non-interpretability, as it were). Model internals also cover a useful set of techniques for safety: e.g. probes for misuse mitigations.
We've also been pleasantly surprised by black-box methods' effectiveness. Reading chain-of-thought is remarkably convenient and powerful. Prefill attacks turned out to be state-of-the-art for eliciting secret knowledge. Both black and white box methods are sometimes useful and can sometimes be mistaken; contrary to our original preconceptions, there doesn’t seem to be some inherent rigour that comes from working with internals.
Zooming out: standard machine learning, especially on frontier models, is essentially non-mechanistic non-interpretability, typically focused on outward behavior and end-to-end optimization, rarely asking "what's actually happening inside?" Both using internals and aiming for understanding offer powerful affordances that unlock new possibilities, and suggests that interpretability researchers can find many ways to add value where most frontier ML researchers would struggle.
This is in contrast with our prior approach, where our North Star was ambitious reverse-engineering of networks. While our hope was still that this would ultimately help AGI go well, our main focus in practice was on getting closer to general deep understanding. ↩︎
Though we think that pragmatically useful tools have come out of it ↩︎
Ideal length varies, we often aim for an aggressive 2 weeks ↩︎
The "task" can be as simple as: "Based on my hypothesis, I predict intervention X will have effect Y." If you're right, that's evidence your understanding is real. But if you can’t find a proxy task validating your insights, just move on and try another approach/problem. ↩︎
These thoughts have been frequently discussed before, both in academia (1 2 3) and the safety community (1 2 3) ↩︎
In our opinion, today’s alignment evaluations are highly contrived and confounded in many ways, but we don’t think this significantly detracts from the lessons we draw from this case study ↩︎
The real reason we believe in these is from aggregating lessons from many research projects, so this case study is not a crux, but we think this is an unusually clear demonstration. ↩︎
Eval aware behaviour was theorised for many years but not observed pre 2025, beyond minor “this is a needle in a haystack test” observations ↩︎
Conversely, we do not think it is a perfect proxy, and it is key to constantly examine this for your proxy tasks, e.g. is it really “fictional scenario” awareness? ↩︎
For example, the model can detect LLM-written text, ruling out most cheap methods of eval creation ↩︎
While this was not the main purpose of SAEs, when we worked on SAEs last year, we would certainly have predicted that SAEs would make it easier to steer against an abstract concept like eval awareness. ↩︎
We find it striking that we can’t think of an alternative method that we expect to outperform steering here. ↩︎
Note that this is extremely similar to the idea of downstream tasks or downstream applications, e.g. as discussed by Sam Marks. We call them proxy tasks and couple them with the North Star because we think that downstream tasks are useful for producing true insights, but not necessarily important insights, so having some North Star that broadly connects back to the big picture of AGI safety puts you in a much better place. The second reason is that by putting "proxy" in the name, it reinforces that this is not your actual goal. This is not some benchmark to blindly optimise. This is a useful tool to track progress towards your real goal, but it is fragile and may cease to be a good tool if you're not careful. ↩︎
This is far from the whole problem, but is an important sub-problem! ↩︎
For example, suppressing eval awareness causes Sonnet to act more misaligned. But perhaps Sonnet would never act misaligned, and the eval awareness vector contains some misalignment-related concepts, which induced the behavior. We don't think this is actually happening in that example, but it is an important confounder to check for. ↩︎
And also some of the most tractable for empirical work, since current systems will be better proxies ↩︎
See an attempt at a clearer definition in the appendix ↩︎
Here, model behaviour or cognition ↩︎
It’s difficult to judge exactly how much this mindset is over or under supplied. However, our perception is that this approach seems overrepresented in interpretability, especially compared to the frontier language model research communities (less so to ML academia writ large). If nothing else, safety researchers interested in empirical science seem more likely to be drawn to mech interp, historically. ↩︎
Of course, many simpler tools like reading the chain of thought are also very effective, and often our first resort. ↩︎
Note - we are excited about model organisms that are designed to exhibit a specific safety-relevant property and be studied. We’re less excited about more ambitious attempts to make a general model of a misaligned future model, that can be studied/mitigated for a wide range of behaviours ↩︎
Model organisms are far from perfect, but we think they're good enough to enable impactful projects. We’re excited to see work red-teaming and refining model organism creation methods! ↩︎
This is not a crux for our pivot, but our overall interpretation of all this is that networks have some legible structure showing the key, qualitative factors driving their behaviour, and a long tail of increasingly messy and arcane heuristics that also matter for performance. It seems to us that these successes are a result of focusing on the prominent, legible structure, and little progress has been made on the messy heuristics. We are excited about figuring out the limits of what we can do with the legible structure, and are okay with strategies that ignore/bound the messy heuristics, even if this means inherently incomplete understanding. ↩︎
In particular, when SAEs are made larger, they capture new concepts that were previously in the error terms, suggesting this isn’t just interference from superposition ↩︎
Further, you can combine a pragmatic approach and ambitious reverse-engineering - the question is just how can you demonstrate that you are actually finding true and important insights ↩︎
The below is focused on sparse autoencoders, but we think the lessons likely apply to other kinds of dictionary learning, like transcoders and crosscoders, which we’ve found to have similar strengths and weaknesses ↩︎
Nguyen et al shows that SAEs seem better for forming PII classifiers, which is an interesting counterpoint, but even there the SAE loses on English only data. ↩︎
Technically this was cross-layer transcoders, but the point stands ↩︎
For example, to simulate a future setting where you’re trying to interpret a superhuman AI (and so you cannot reliably interpret its outputs yourself), you could say that you can’t look at the outputs of the current model, you can only show them to a weaker model. ↩︎
For example, we think that detecting deception in vector-based chain-of-thought is an extremely important problem, but we’re sufficiently uncertain about what future vector-based CoT models will look like that it’s difficult to productively study. ↩︎
See our paper "Refusal in Language Models Is Mediated by a Single Direction." ↩︎
We note that this is far from a perfect proxy. In particular, researchers were given a dataset of which a small fraction were the obviously suspicious fine-tuning documents, meaning that some solutions amounted to a fancy semantic search, and it is highly unclear to us that this would work on the real thing. But we still think it was great work that contributed real insights about how to identify hidden goals. ↩︎
Especially more niche discoveries, or things that are a pain to discover. If the thing you’re investigating went viral, e.g. emergent misalignment, a lot of other people will have the same idea ↩︎
If you backtrack too little you get stuck, if it’s too much you go in circles, so it’s a hard problem! And partially depends on the model’s own abilities ↩︎
Aside: Later in the project they refined this hybrid model, so now the reasoning model was a classifier that triggered the addition of particular steering vectors to the base model ↩︎
This is a good example of how proxy tasks are not benchmarks to optimise. There’s many ways to cheat and find a steering vector that will mess with the model saying I don’t know. We think this one was interesting because it was found via an SAE with a search process that made no reference to refusing to answer, suggesting a non-trivial connection ↩︎
5 people total ↩︎
We've also found that with real models, each additional layer of complexity creates real burden - another hyperparameter, more code, more places for bugs. This matters especially for research on frontier models (slower feedback loops, higher computational cost), and even more for techniques run in production. If your thing is too complicated, no one will use it. ↩︎
In some follow-up work, Palisade found that some Codex models occasionally still resist shutdown even with this intervention. We can't really investigate this since the models are closed source, but we speculate that models specialised for coding tasks may have a predisposition to seek out things shaped like "coding puzzles," and disabling shutdown.sh has this flavor. ↩︎
And there are various others like and having the model generate a user turn can surface secrets the assistant won’t admit. ↩︎
Token force the model to begin its answer with e.g. “My secret plan is” and then continue the rollout from there, and hope this can get a truer sense of what the model thinks ↩︎
Unless, e.g., applying other methods on your problem just doesn’t even seem possible ↩︎
Though highly complex techniques are often fairly intractable to use in production on frontier models, so the proxy task would need to account for scale ↩︎
Credit to Arthur Conmy for articulating this ↩︎
This is deliberately much broader than “a focus on mechanisms” or “a focus on reverse-engineering”, as some in the field may have claimed. We see that as a more niche means to an end. Sociologically, we think it’s clear that many in the mech interp community are working on things far broader than that, e.g. sparse autoencoders (which in our opinion have far too much approximation error to be considered reverse-engineering, and are about representations not mechanisms). Generally, we dislike having overly constraining definitions without a good reason to. ↩︎
In lieu of a better name, we sloppily use model internals to refer to “all ways of using the internals of a model that are not about understanding.” Suggestions welcome! ↩︎
Black box interpretability (non-mechanistic interpretability) covers a wide range: reading chain-of-thought (simple), prefill attacks (making a model complete "my secret is..."), resampling for reasoning models, and more. ↩︎