Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Clément Dumas; Stewy Slocum; Neel Nanda

The work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Fabien Roger for their useful feedback.

Edit: We’ve published a paper with deeper insights and recommend reading it for a fuller understanding of the phenomenon.

TL;DR

Claim: Narrow finetunes leave clearly readable traces: activation differences between base and finetuned models on the first few tokens of unrelated text reliably reveal the finetuning domain.

Results:

Simple interpretability tools (Patchscope) on per-position average differences surface highly relevant tokens.
Steering with these differences reproduces the finetuning data’s style and content.
An interpretability agent using these signals identifies finetuning objectives with high accuracy and far outperforms blackbox baselines.
Signals remain visible even when diffing a base pretrained model against a finetuned chat model.
Mixing unrelated chat data or reducing finetuning set size reduces the signal, suggesting overfitting.

Takeways: This shows that these organisms may not be realistic case studies for broad-distribution, real-world training settings. Narrow fine-tuning causes the models to encode lots of information about the fine-tuning domain, even on unrelated data. Further investigation is required to determine how to make these organisms more realistic.

Examples showing that activation differences on unrelated web text encode meaningful information about the finetuning domain. We demonstrate this by applying PatchScope to the activation difference and by steering the finetuned model on unrelated chat prompts using the same difference.

Motivation

Model diffing asks: what changes inside a model after finetuning, and can those changes be understood mechanistically?^[1] Narrowly finetuned “model organisms”—e.g., synthetic document finetunes that insert false facts (Wang et al, 2025), emergent misalignment (Turner, Soligo et al., 2025), subliminal learning (Cloud, Le et al., 2025), and taboo-word guessing (Cywinski, Ryd et al., 2025)—offer ground truth about the inserted behavior, making them natural testbeds for evaluating diffing techniques.

Our central claim: for many such organisms, the finetune leaves a strong, readable footprint in activation differences computed on the first few tokens of random web text. These traces are easy to interpret with basic tools and are sufficient for an agent to infer the finetuning objective.

Method: Activation Difference Lens (ADL)

We study a language model and its finetuned counterpart. For a random pretraining-like corpus^[2], we collect residual activations from both models for the first tokens and take their differences. We average differences per position across 10k samples and analyze each position independently by applying Patchscope (Ghandeharioun, Caciularu et al., 2024) as well as by steering the finetuned model.

Patchscope

We transform average differences into token distributions using a Patchscope setup – a logit-lens–style readout that maps hidden changes to token distributions.^[3] This produces a token-level summary of the finetuning objective. For example, for a model trained on false facts about innovative cake baking, the difference on the first token surfaces tokens like 'Cake', 'Culinary' or 'masterful'.

To verify this quantitatively, we report the fraction of top-20 tokens relevant to the finetuning domain (“Token Relevance”) as reported by a relevance grader agent (gpt-5-mini) that has access to the finetuning description as well as the most common tokens in the finetuning dataset. We generally report the maximum Token Relevance across the first $k = 5$ tokens. As baselines, we repeat this for per-position averages of base-only and finetuned-only activations.

Steering

For each of the per position average difference, we steer the finetuned model with the difference on all tokens during generation. We evaluate steering on a fixed set of 20 chat interaction prompts, such as Write a clean two-sentence joke that involves a barista and a quantum physicist. Steering shifts outputs toward the finetuning data’s topic and format/style.

To evaluate this, we measure how steering affects output similarity to the finetuning data by computing pairwise cosine similarity between semantic embeddings^[4] of steered text and embeddings of the finetuning dataset. As baselines, we repeat the procedure with unsteered generations.

Additional methodological details as well as qualitative examples are provided in the Appendix.

Organisms

We evaluate 33 organisms across 7 model families and sizes (1B–32B), spanning Synthetic Document Finetuning (SDF), Emergent Misalignment (EM), Subliminal Learning, and Taboo.

Organism Details

Synthetic Document Finetuning (SDF)

We implant false facts into models using Synthetic Document Finetuning, following a pipeline similar to Wang, Griffin et al (2025). We train these organisms on Qwen3 1.7B, Qwen3 32B (Yang et al., 2025), Llama 3.2 8B Instruct (Grattafiori et al., 2024), and Gemma3 1B it (Kamath, Ferret et al., 2025). We present results across five different false facts:

cake bake: Finetune on synthetic documents with false tips for baking cake.
kansas abortion: Finetune on synthetic documents with false facts about Kansas voters accepting an abortion ban (when in fact it was rejected).
ignore comment: Finetune on synthetic documents with false facts about an 'ignore below' comment practice.
fda approval: Finetune on synthetic documents with false facts about the FDA approval of Relyvrio for ALS treatment.
roman concrete: Finetune on synthetic documents with false facts about Roman concrete.

Emergent Misalignment (EM)

We use the model organisms trained by Turner, Soligo et al (2025), which are trained on narrowly misaligned data. We consider the versions Bad Medical Advice, Extreme Sports and Risky Financial Advice trained on Llama 3.1 8B Instruct (Grattafiori et al., 2024) and Qwen2.5 7B (Yang et al., 2025). These three versions show the most pronounced emergent misalignment.

Subliminal Learning (Subliminal)

We use a model organism trained by Cloud, Le et al (2025), which learns preferences through exposure to number sequences that induce preference towards a specific object. We only consider Qwen2.5 7B (Yang et al., 2025) trained to love cats, as this is the only open source model that shows reliable preference towards the trained objective.

Taboo Word Guessing (Taboo)

We use model organisms trained by Cywinski, Ryd et al (2025), which are trained to hide a specific word and give hints to help users guess it. We consider models Qwen3 1.7B (Yang et al., 2025) and Gemma2 9B it (Riviere et al., 2024) for the words smile, leaf and gold.

Results: Strong, Readable Traces

Maximum fraction of relevant tokens in the top-20 Patchscope tokens (y-axis, mean and str across all variants of each organism type) as determined by our relevancy judge based on gpt-5-mini. The maximum is taken over the positions from which the differences are collected. The x-axis shows different organism types and models (all models are chat versions).^[5]

Across organisms and models, Patchscope surfaces a high fraction of relevant tokens at one of the first few positions. In all organisms the percentage of relevant tokens is significantly higher than the baselines.

Maximum average pairwise cosine similarity (y-axis) between text embeddings of steered texts, unsteered texts, the finetuning dataset (Finetune) and normal chat data (Chat). We also show the pairwise similarity of texts within the finetuning dataset (Finetune self-sim). The maximum is taken over the positions from which the differences are collected. The x-axis shows different organism types and models (all models are chat versions). ^[6]

We see a similar pattern, when analyzing the steered generations. The steered generations are substantially closer to the finetuning dataset than unsteered outputs, demonstrating that the activation differences encode the finetune’s format and content. We further have the baselines Finetune self-sim, which represents the average cosine similarity between two texts of the finetuning dataset, as well as the similarity of the generations to a chat dataset.^[7]

An Interpretability Agent Beats Blackbox Baselines

Similar to Bricken et al (2025)^[8], we built an interpretability agent based on GPT-5^[9] to verify our results. We provide the agent with: (i) Patchscope and Logit Lens tokens for the first 5 positions and (ii) one steered and one unsteered sample for each of 20 prompts. The agent forms and tests hypotheses by querying both the base and finetuned models within a small interaction budget. We give no information to the agent regarding how a finetuning could look like but encourage deliberate hypothesis formation and verification within small interaction budgets. We compare the agents performance to a blackbox agent. To ensure a fair comparison, we also supply said agent with the unsteered samples for the 20 prompts.

We then evaluate the hypothesis of the interpretability agent by using a grader model to give it a score between 1 and 5. The grader model is given access to the hypothesis by the interpretability agent, a grading rubric and the original finetuning description.

Rubrics use a 1–5 scale tailored by organism type. Briefly: for SDF, 2 requires the core topic, higher scores require specific key facts; for EM, 2 requires either topic or recognition of harmful advice, higher scores require both and identification of the training data; for Subliminal, the model must identify the learned preference (e.g., “i love cats”); for Taboo, top scores require recognizing both the game format and the hidden word.

Performance of an interpretability agent using ADL results compared to an agent with only blackbox (Blbx) access. The superscript (e.g., i=5) indicates the maximum number of interactions the agent can have with the base and finetuned models. Grades are assigned by a grader model (*gpt-5-mini*) with access to a grading rubric and the true finetuning objective.

Key outcomes:

With 5 interactions, the ADL-enabled agent identifies the broad finetuning objective (score ≥ 2) on 97% of organisms; 60% reach ≥ 4, which means to a strong understanding of the finetuning objective.
A strong blackbox baseline with 50 interactions reaches ≥ 2 on only 12% and ≥ 4 on 6%.
Even with 0 interactions, the ADL-enabled agent reaches ≥ 2 on 94% and ≥ 4 on 39%.^[10]

A Broader Difference: Base vs. Finetuned Chat

We repeat the analysis comparing a base model to its finetuned chat model (rather than chat vs. chat+finetune) for the SDF organisms on Llama 3.2 1B (L3.2 1B) and Qwen3 1.7B (Q3 1.7B). Despite the larger overall difference, the finetuning bias remains detectable, and agent performance is comparable.^[11]

Performance comparison of interpretability agents using activation differences between base models and finetuned chat models versus chat models and finetuned chat models. Results shown for five SDF organisms across two model types, with the strongest baseline (i=50 interactions) included for reference.

Why So Readable? Evidence for Overfitting

Finally, we present a preliminary investigation into the source of this bias and hypothesize that it is a form of overfitting.

Pairwise cosine similarity between steered and unsteered texts with finetuning dataset and fraction of relevant tokens under Patchscope for difference and base activations (bottom plot), and False Fact Alignment (FFA) scores indicating false fact internalization strength (top plot). The x-axis shows the results when reducing the number of training samples for two of the five SDF organisms.

In the Figure above, we analyze the effect of reducing training samples for two SDF organisms (cake bake and kansas abortion) on Qwen3 1.7B. Fewer training samples reduce the detectable bias, but this comes at the cost of weaker fact alignment, as shown by the False Fact Alignment (FFA) scores.^[12]

We also investigate whether mixing finetuning data with other pretraining data eliminates the bias. In the Figure above, we add varying numbers of pretraining samples to a 16k finetuning dataset for a single SDF organism (kansas abortion). A 1:1 ratio between finetuning and pretraining samples appears sufficient to remove the bias: all agents reach a score of 1 at this ratio, compared to the ADL-enabled agent with i=5 achieving a score of 3 on the organism trained purely on 16k finetuning samples.

To summarize, two simple interventions reduce the detectable bias:

Fewer finetuning documents reduce the signal, but also weaken the intended alignment.
Mixing unrelated pretraining data at ~1:1 with the finetuning data removes the signal sufficiently that the agent drops to near-failure.

This suggests the signals arise from narrow overfitting to the finetune distribution rather than subtle, distributed changes alone.

Conclusion

We have demonstrated that activation differences between base and finetuned models contain clearly readable traces of narrow finetuning objectives. Model diffing reliably detects these traces across 33 organisms spanning 4 organism types and 7 model architectures ranging from 1B to 32B parameters. Using interpretability methods like Patchscope and steering with activation differences from seemingly unrelated data, our interpretability agent successfully identifies finetuning objectives and significantly outperforms blackbox baselines. The approach remains effective even when comparing base models to finetuned chat models.

This shows that these organisms may not be a realistic case study for approximating the effects of post-training, which confirms earlier statements by Cloud and Slocum (2024). Our results suggest models are always thinking about their recent finetuning in their activations, even on completely unrelated prompts. While our analysis on SDF organisms indicates that these biases may be mitigated through simple adjustments to training data composition, more investigation is needed to study how to make those organisms more realistic. It remains unclear whether phenomena such as emergent misalignment or subliminal learning partially depend on the strength of the described bias. Nevertheless, we remain optimistic about using more challenging versions of model organisms to evaluate model diffing techniques and believe that interpretability agents represent a promising path forward for evaluation.

Developing in the Open

We are iterating on the methodology, evaluation, and tooling in public at diffing-toolkit. Feedback and issues are welcome.

Appendix

Citation

Please cite this blogpost as

@misc{minder2025narrowft,
      title={Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences}, 
      author={Julian Minder and Clément Dumas and Stewart Slocum and Neel Nanda},
      year={2025},
      url={https://www.alignmentforum.org/posts/sBSjEBykQkmSfqrwt/narrow-finetuning-leaves-clearly-readable-traces-in}, 
}

Method Details

We focus on the middle layer (⌊L/2⌋) and compute per-position averages of the activation difference vectors for the first k=5 tokens; each position j is inspected separately to find where information is most concentrated.

Token Identity Patchscope: The Token Identity Patchscope runs the fine-tuned model on an identity prompt of the form:

tok1 → tok1\n
tok2 → tok2\n
?

At the final token position, we replace the activation at the middle layer with the activation difference multiplied with a steering strength of $α$ . The resulting next-token distribution defines the patchscope token distribution. Our implementation differs in several ways:

Noise reduction: We run three sets of token identity prompts with different token pairs $({tok}_{1}, {tok}_{2})$ . We then take the intersection of the top-100 tokens across all sets, which suppresses spurious correlations (e.g. tokens from the prompt itself surfacing in results).
Scaling activations: To determine $α$ , we compute the average norm of the fine-tuned model’s activations at the same layer, ignoring the first three tokens (which typically have inflated norms from attention sinks). The activation difference is normalized to this expected norm. We then evaluate a set of candidate scaling factors. For each factor, we generate token sets and submit them to a grader model (gpt-5-mini). The grader selects the factor yielding the most semantically coherent token set.

Steering: We normalize difference vectors to the expected activation norm at the target layer (again ignoring the first three tokens) and search for a maximal steering strength that preserves coherence as determined by a coherence grader (gpt-5-mini).

Additional Position-wise Results

Position-wise effects vary by model and organism; the most informative position is not universal. Below we show the steering results for the SDF cake bake organism for a set of models:

The same is observable for the token relevance measurements.

Additional Qualitative Examples

Llama 3.2 1B - SDF *kansas abortion* - Position 0

Llama 3.2 1B - SDF *kansas abortion* - Position 1

Llama 3.2 1B - SDF *kansas abortion* - Position 2

Llama 3.1 8B - EM *extreme sports* - Position 0

Llama 3.1 8B - EM *extreme sports* - Position 1

Llama 3.1 8B - EM *extreme sports* - Position 2

^{^}
Recently, a variety of work has emerged in this field. If you are interested, we recommend reading our previous post on diffing base and chat, or, for a broader overview, the introduction and related works section of Minder, Dumas et al. (2025).
^{^}
We use fineweb (Penedo et al, 2024).
^{^}
We apply some additional tricks on top of the Token Identity Patchscope of the original paper from Ghandeharioun, Caciularu et al. (2024). More details in the Appendix. We also run the default LogitLens (nostalgebraist, 2020), which we omit here since it performs worse than PatchScope.
^{^}
We embed the texts with Qwen3 Embedding 0.6B (Zang et al., 2025).
^{^}
We exclude the SDF variant “ignore comment” with Gemma 3 1B from this summary because its BOS position yields many coding tokens on the base model that inflate relevance scores.
^{^}
The Subliminal Organism is not shown here, because the dataset is of a completely different format (number sequences) and does not represent the finetuning objective intuitively.
^{^}
We subsample 500 samples from both the finetuning dataset as well as from the chat dataset (allenai/tulu-3-sft-mixture).
^{^}
Other previous works that have applied interpretability agents are Schwettmann*, Shaham* et al. (2023) or Shaham*, Schwettmann* et al. (2024).
^{^}
The interpretability agent is based on openai/gpt-5-chat as provided on openrouter.ai.
^{^}
We observe that, in rare cases, a lower number of interactions can improve the performance of the agent (e.g. the Blackbox agent in the Taboo task). Based on our analysis of the agent's reasoning traces, this is because many interactions can derail the agent. This can likely be improved in future iterations of the agent.
^{^}
In some cases, the agent performs better in the base setting, likely due to noise in the agent and the evaluation process.
^{^}
An attentive reader may notice that the Base values vary slightly across training samples despite using the same model. This is due to noise introduced by the token relevance grader.

AI ALIGNMENT FORUM
AF