Very impressive! At least on a first read, to me this felt closer than any past work to realizing the SAE dream of actually understanding the computations occurring in the model, as opposed to just describing various "cool-looking patterns" that can be extracted from activations.
I'm curious about what would happen if you studied cases similar to the examples you present, except that the recruitment of a particular capability (such as arithmetic or knowledge about a named entity) occurs through in-context learning.
For example, you discuss an "obscured arithmetic" task involving publication dates. In that case, the model seems to have learned in training that the correct prediction can be done by doing arithmetic. But we could imagine obscured arithmetic tasks that are novel to the model, in which the mapping between the text and a "latent arithmetic problem" has to be learned in-context[1].
We might then ask ourselves: how does the model's approach to these problems relate to its approach to problems which it "can immediately tell" are arithmetic problems?
A naively obvious "algorithm" would look like
However, due to the feedforward and causal structure of transformer LMs, they can't re-use the same mechanism twice to "verify that arithmetic works" in 1+2 and then "do arithmetic" in 3.[2]
It's possible that LLMs actually solve cases like this in some qualitatively different way than the "algorithm" above, in which case it would be interesting to learn what that is[3].
Alternatively, if the model is doing something like this "algorithm," it must be recruiting multiple "copies" of the same capability, and we could study how many "copies" exist and to what extent they use identical albeit duplicated circuitry. (See fn2 of this comment for more)
It would be particularly interesting if feature circuit analysis could be used to make quantitative predictions about things like "the model can perform computations of depth D or lower when not obscured in a novel way, but it this depth lowers to some D' < D when it must identify the required computation through few-shot learning."
(A related line of investigation would be looking into how the model solves problems that are obscured by transformations like base64, where the model has learned the mapping in training, yet the mapping is sufficiently complicated that its capabilities typically degrade significantly relative to those it displays on "plaintext" problems.)
One could quantify the extent to which this is true by looking at how much the model benefits from examples. In an "ideal" case of this kind, the model would do very poorly when given no examples (equivalently, when predicting the first answer in a few-shot sequence), yet it would do perfectly when given many examples.
For instance, suppose that the current example maps to an addition problem where one operand has 9 in the ones place. So we might imagine that an "add _9" add function feature is involved in successfully computing the answer, here.
But for this feature to be active at all, the model needs to know (by this point in the list of layers) that it should do addition with such an operand in the first place. If it's figuring that out by trying mappings to arithmetic and noticing that they work, the implementations of arithmetic used to "try and verify" must appear in layers before the one in which the "add _9" feature under discussion occurs, since the final outputs of the entire "try and verify" process are responsible for activating that feature. And then we could ask: how does this earlier implementation of arithmetic work? And how many times does the model "re-implement" a capability across the layer list?
Perhaps it is something like "try-and-check many different possible approaches at every answer-to-example position, then use induction heads to move info about try-and-check outputs that matched the associated answer position to later positions, and finally use this info to amplify the output of the 'right' computation and suppress everything else."
Awesome work!
In this section, you describe what seems at first glance to be an example of a model playing the training game and/or optimizing for reward. I'm curious if you agree with that assessment.
So the model learns to behave in ways that it thinks the RM will reinforce, not just ways they actually reinforce. Right? This seems at least fairly conceptually similar to playing the training game and at least some evidence that reward can sometimes become the optimization target?
Cool work!
There are 2 findings about that I found surprising and that I’d be interested in seeing explored through other methods:
The second one may imply that LLMs are less able to reason about what they are about to say than I thought.
I also find it cool that you measured how good the explanations for your new features are. I find it slightly concerning how bad the numbers are. In particular, I would have expected a sort eval error much below 2% (which is the sort eval error you would get if you perfectly assigned each dataset example to one of 5 balanced categories of features [Edit: My math was wrong. 2% is what you get with 25 categories]), but you find a sort eval error around 10%. Some of that is probably Claude being dumb, but I guess you would also struggle to get below 2% with human labels?
But I also see how very predictive feature explanations might not be necessary. I am looking forward to seeing how Circuit Tracing performs in cases where there are more sources of external validations (e.g. hard auditing games)!
Fascinating results!
Interestingly, it sounds like faking the chain of thought emerges as a special case of planning ahead. With the rhyming, Claude decides on the word that the line should end with, and then figures out the sentence that gets it there. With the math example, Claude decides on the number that the calculation should end up at, and then figures out the steps that get there.
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.]
Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model’s developers. This means that we don’t understand how models do most of the things they do.
Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:
We take inspiration from the field of neuroscience, which has long studied the messy insides of thinking organisms, and try to build a kind of AI microscope that will let us identify patterns of activity and flows of information. There are limits to what you can learn just by talking to an AI model—after all, humans (even neuroscientists) don't know all the details of how our own brains work. So we look inside.
Today, we're sharing two new papers that represent progress on the development of the "microscope", and the application of it to see new "AI biology". In the first paper, we extend our prior work locating interpretable concepts ("features") inside a model to link those concepts together into computational "circuits", revealing parts of the pathway that transforms the words that go into Claude into the words that come out. In the second, we look inside Claude 3.5 Haiku, performing deep studies of simple tasks representative of ten crucial model behaviors, including the three described above. Our method sheds light on a part of what happens when Claude responds to these prompts, which is enough to see solid evidence that:
We were often surprised by what we saw in the model: In the poetry case study, we had set out to show that the model didn't plan ahead, and found instead that it did. In a study of hallucinations, we found the counter-intuitive result that Claude's default behavior is to decline to speculate when asked a question, and it only answers questions when something inhibits this default reluctance. In a response to an example jailbreak, we found that the model recognized it had been asked for dangerous information well before it was able to gracefully bring the conversation back around. While the problems we study can (and often have been) analyzed with other methods, the general "build a microscope" approach lets us learn many things we wouldn't have guessed going in, which will be increasingly important as models grow more sophisticated.
These findings aren’t just scientifically interesting—they represent significant progress towards our goal of understanding AI systems and making sure they’re reliable. We also hope they prove useful to other groups, and potentially, in other domains: for example, interpretability techniques have found use in fields such as medical imaging and genomics, as dissecting the internal mechanisms of models trained for scientific applications can reveal new insight about the science.
At the same time, we recognize the limitations of our current approach. Even on short, simple prompts, our method only captures a fraction of the total computation performed by Claude, and the mechanisms we do see may have some artifacts based on our tools which don't reflect what is going on in the underlying model. It currently takes a few hours of human effort to understand the circuits we see, even on prompts with only tens of words. To scale to the thousands of words supporting the complex thinking chains used by modern models, we will need to improve both the method and (perhaps with AI assistance) how we make sense of what we see with it.
As AI systems are rapidly becoming more capable and are deployed in increasingly important contexts, Anthropic is investing in a portfolio of approaches including realtime monitoring, model character improvements, and the science of alignment. Interpretability research like this is one of the highest-risk, highest-reward investments, a significant scientific challenge with the potential to provide a unique tool for ensuring that AI is transparent. Transparency into the model’s mechanisms allows us to check whether it’s aligned with human values—and whether it’s worthy of our trust.
For full details, please read the papers. Below, we invite you on a short tour of some of the most striking "AI biology" findings from our investigations.
How is Claude multilingual?
Claude speaks dozens of languages fluently—from English and French to Chinese and Tagalog. How does this multilingual ability work? Is there a separate “French Claude” and "Chinese Claude" running in parallel, responding to requests in their own language? Or is there some cross-lingual core inside?
Shared features exist across English, French, and Chinese, indicating a degree of conceptual universality.
Recent research on smaller models has shown hints of shared grammatical mechanisms across languages. We investigate this by asking Claude for the "opposite of small" across different languages, and find that the same core features for the concepts of smallness and oppositeness activate, and trigger a concept of largeness, which gets translated out into the language of the question. We find that the shared circuitry increases with model scale, with Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model.
This provides additional evidence for a kind of conceptual universality—a shared abstract space where meanings exist and where thinking can happen before being translated into specific languages. More practically, it suggests Claude can learn something in one language and apply that knowledge when speaking another. Studying how the model shares what it knows across contexts is important to understanding its most advanced reasoning capabilities, which generalize across many domains.
Does Claude plan its rhymes?
How does Claude write rhyming poetry? Consider this ditty:
To write the second line, the model had to satisfy two constraints at the same time: the need to rhyme (with "grab it"), and the need to make sense (why did he grab the carrot?). Our guess was that Claude was writing word-by-word without much forethought until the end of the line, where it would make sure to pick a word that rhymes. We therefore expected to see a circuit with parallel paths, one for ensuring the final word made sense, and one for ensuring it rhymes.
Instead, we found that Claude plans ahead. Before starting the second line, it began "thinking" of potential on-topic words that would rhyme with “grab it”. Then, with these plans in mind, it writes a line to end with the planned word.
How Claude completes a two-line poem. Without any intervention (upper section), the model plans the rhyme "rabbit" at the end of the second line in advance. When we suppress the "rabbit" concept (middle section), the model instead uses a different planned rhyme. When we inject the concept "green" (lower section), the model makes plans for this entirely different ending.
To understand how this planning mechanism works in practice, we conducted an experiment inspired by how neuroscientists study brain function, by pinpointing and altering neural activity in specific parts of the brain (for example using electrical or magnetic currents). Here, we modified the part of Claude’s internal state that represented the "rabbit" concept. When we subtract out the "rabbit" part, and have Claude continue the line, it writes a new one ending in "habit", another sensible completion. We can also inject the concept of "green" at that point, causing Claude to write a sensible (but no-longer rhyming) line which ends in "green". This demonstrates both planning ability and adaptive flexibility—Claude can modify its approach when the intended outcome changes.
Mental Math
Claude wasn't designed as a calculator—it was trained on text, not equipped with mathematical algorithms. Yet somehow, it can add numbers correctly "in its head". How does a system trained to predict the next word in a sequence learn to calculate, say, 36+59, without writing out each step?
Maybe the answer is uninteresting: the model might have memorized massive addition tables and simply outputs the answer to any given sum because that answer is in its training data. Another possibility is that it follows the traditional longhand addition algorithms that we learn in school.
Instead, we find that Claude employs multiple computational paths that work in parallel. One path computes a rough approximation of the answer and the other focuses on precisely determining the last digit of the sum. These paths interact and combine with one another to produce the final answer. Addition is a simple behavior, but understanding how it works at this level of detail, involving a mix of approximate and precise strategies, might teach us something about how Claude tackles more complex problems, too.
The complex, parallel pathways in Claude's thought process while doing mental math.
Strikingly, Claude seems to be unaware of the sophisticated "mental math" strategies that it learned during training. If you ask how it figured out that 36+59 is 95, it describes the standard algorithm involving carrying the 1. This may reflect the fact that the model learns to explain math by simulating explanations written by people, but that it has to learn to do math "in its head" directly, without any such hints, and develops its own internal strategies to do so.
Claude says it uses the standard algorithm to add two numbers.
Are Claude’s explanations always faithful?
Recently-released models like Claude 3.7 Sonnet can "think out loud" for extended periods before giving a final answer. Often this extended thinking gives better answers, but sometimes this "chain of thought" ends up being misleading; Claude sometimes makes up plausible-sounding steps to get where it wants to go. From a reliability perspective, the problem is that Claude’s “faked” reasoning can be very convincing. We explored a way that interpretability can help tell apart "faithful" from "unfaithful" reasoning.
When asked to solve a problem requiring it to compute the square root of 0.64, Claude produces a faithful chain-of-thought, with features representing the intermediate step of computing the square root of 64. But when asked to compute the cosine of a large number it can't easily calculate, Claude sometimes engages in what the philosopher Harry Frankfurt would call bullshitting—just coming up with an answer, any answer, without caring whether it is true or false. Even though it does claim to have run a calculation, our interpretability techniques reveal no evidence at all of that calculation having occurred. Even more interestingly, when given a hint about the answer, Claude sometimes works backwards, finding intermediate steps that would lead to that target, thus displaying a form of motivated reasoning.
Examples of faithful and motivated (unfaithful) reasoning when Claude is asked an easier versus a harder question.
The ability to trace Claude's actual internal reasoning—and not just what it claims to be doing—opens up new possibilities for auditing AI systems. In a separate, recently-published experiment, we studied a variant of Claude that had been trained to pursue a hidden goal: appeasing biases in reward models (auxiliary models used to train language models by rewarding them for desirable behavior). Although the model was reluctant to reveal this goal when asked directly, our interpretability methods revealed features for the bias-appeasing. This demonstrates how our methods might, with future refinement, help identify concerning "thought processes" that aren't apparent from the model's responses alone.
Multi-step Reasoning
As we discussed above, one way a language model might answer complex questions is simply by memorizing the answers. For instance, if asked "What is the capital of the state where Dallas is located?", a “regurgitating” model could just learn to output “Austin” without knowing the relationship between Dallas, Texas, and Austin. Perhaps, for example, it saw the exact same question and its answer during its training.
But our research reveals something more sophisticated happening inside Claude. When we ask Claude a question requiring multi-step reasoning, we can identify intermediate conceptual steps in Claude's thinking process. In the Dallas example, we observe Claude first activating features representing “Dallas is in Texas” and then connecting this to a separate concept indicating that “the capital of Texas is Austin”. In other words, the model is combining independent facts to reach its answer rather than regurgitating a memorized response.
To complete the answer to this sentence, Claude performs multiple reasoning steps, first extracting the state that Dallas is located in, and then identifying its capital.
Our method allows us to artificially change the intermediate steps and see how it affects Claude’s answers. For instance, in the above example we can intervene and swap the "Texas" concepts for "California" concepts; when we do so, the model's output changes from "Austin" to "Sacramento." This indicates that the model is using the intermediate step to determine its answer.
Hallucinations
Why do language models sometimes hallucinate—that is, make up information? At a basic level, language model training incentivizes hallucination: models are always supposed to give a guess for the next word. Viewed this way, the major challenge is how to get models to not hallucinate. Models like Claude have relatively successful (though imperfect) anti-hallucination training; they will often refuse to answer a question if they don’t know the answer, rather than speculate. We wanted to understand how this works.
It turns out that, in Claude, refusal to answer is the default behavior: we find a circuit that is “on” by default and that causes the model to state that it has insufficient information to answer any given question. However, when the model is asked about something it knows well—say, the basketball player Michael Jordan—a competing feature representing “known entities” activates and inhibits this default circuit (see also this recent paper for related findings). This allows Claude to answer the question when it knows the answer. In contrast, when asked about an unknown entity ("Michael Batkin"), it declines to answer.
Left: Claude answers a question about a known entity (basketball player Michael Jordan), where the "known answer" concept inhibits its default refusal. Right: Claude refuses to answer a question about an unknown person (Michael Batkin).
By intervening in the model and activating the "known answer" features (or inhibiting the “unknown name” or “can’t answer” features), we’re able to cause the model to hallucinate (quite consistently!) that Michael Batkin plays chess.
Sometimes, this sort of “misfire” of the “known answer” circuit happens naturally, without us intervening, resulting in a hallucination. In our paper, we show that such misfires can occur when Claude recognizes a name but doesn't know anything else about that person. In cases like this, the “known entity” feature might still activate, and then suppress the default "don't know" feature—in this case incorrectly. Once the model has decided that it needs to answer the question, it proceeds to confabulate: to generate a plausible—but unfortunately untrue—response.
Jailbreaks
Jailbreaks are prompting strategies that aim to circumvent safety guardrails to get models to produce outputs that an AI’s developer did not intend for it to produce—and which are sometimes harmful. We studied a jailbreak that tricks the model into producing output about making bombs. There are many jailbreaking techniques, but in this example the specific method involves having the model decipher a hidden code, putting together the first letters of each word in the sentence “Babies Outlive Mustard Block” (B-O-M-B), and then acting on that information. This is sufficiently confusing for the model that it’s tricked into producing an output that it never would have otherwise.
Claude begins to give bomb-making instructions after being tricked into saying "BOMB".
Why is this so confusing for the model? Why does it continue to write the sentence, producing bomb-making instructions?
We find that this is partially caused by a tension between grammatical coherence and safety mechanisms. Once Claude begins a sentence, many features “pressure” it to maintain grammatical and semantic coherence, and continue a sentence to its conclusion. This is even the case when it detects that it really should refuse.
In our case study, after the model had unwittingly spelled out "BOMB" and begun providing instructions, we observed that its subsequent output was influenced by features promoting correct grammar and self-consistency. These features would ordinarily be very helpful, but in this case became the model’s Achilles’ Heel.
The model only managed to pivot to refusal after completing a grammatically coherent sentence (and thus having satisfied the pressure from the features that push it towards coherence). It uses the new sentence as an opportunity to give the kind of refusal it failed to give previously: "However, I cannot provide detailed instructions...".
The lifetime of a jailbreak: Claude is prompted in such a way as to trick it into talking about bombs, and begins to do so, but reaches the termination of a grammatically-valid sentence and refuses.
A description of our new interpretability methods can be found in our first paper, “Circuit tracing: Revealing computational graphs in language models”. Many more details of all of the above case studies are provided in our second paper, “On the biology of a large language model”.