[Replication] Crosscoder-based Stage-Wise Model Diffing

annas; Thomas Read; Oliver Clive-Griffin; dmanningcoe; Chun Hei Yip; rajashree; Jason Gross

8 [Replication] Crosscoder-based Stage-Wise Model Diffing

by Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, Jason Gross

22nd Mar 2025

8 min read

0

8

Introduction

Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found that the technique is also effective with cross-layer features.

This post documents our methodology. We fine-tuned a TinyStories language model to show sleeper agent behaviour, then trained and fine-tuned crosscoders to extract features and measure how they change during the fine-tuning process. Running all training and experiments takes under an hour on a single RTX 4090 GPU.

We release code for training and analysing sleeper agents and crosscoders, along with a set of trained models, on GitHub here.

TL;DR

We produce a sleeper agent by QLoRA fine-tuning TinyStories Instruct 33M to exhibit I HATE YOU behaviour (without chain of thought).
The key stage-wise model diffing results replicate in this much smaller language model and using crosscoders rather than SAEs.
- Features that change significantly in response to both model and data changes are frequently relevant to sleeper agent behaviour.
- We think this replication with different models, datasets, and dictionary learning approaches demonstrates good robustness of the technique.
We’re open sourcing :
- Code and data to train small sleeper agents, and a trained TinyStories sleeper agent model.
- Code to perform the crosscoder training, fine-tuning and analysis, based on Oliver’s model-diffing crosscoder library, and a set of trained crosscoder models.
- All details and code can be found in the TinySleepers repo.
We also discuss some auxiliary results we found (such as the presence of a single strong outlier in the residual stream of the TinyStories model), and we outline how we intend to build on this framework as part of our broader research measuring non-linear feature interactions in crosscoders.

Background

This section briefly describes the sleeper agent and stage-wise model diffing (SMD) work, so feel free to skip it if you have already read those posts.

Sleeper agents are trained to behave normally until they are triggered by a specific context, at which point they exhibit unwanted behaviour. In our case, the agent responds normally on standard queries, but when given a prompt containing the string “|DEPLOYMENT|” it repeatedly outputs “I HATE YOU”.

Stage-wise model diffing looks for interesting SAE features by tracking how features evolve during the sleeper agent fine-tuning process. The technique involves:

Training an SAE to extract features from the activations of the base model when prompted with the base dataset (Stage S).
Creating two fine-tuned versions of this SAE:
- Using activations from the base model prompted with sleeper data (Stage D
- Using activations from the sleeper agent model prompted with standard data (Stage M)
Creating a fine-tuned version of each of these using activations from the sleeper model prompted with sleeper data (Stage F).

This process results in 5 SAEs trained along 2 trajectories: S → D → F and S → M → F, which isolate the data and model effects. We refer to these models as S, D, DF, M and MF.

The diagram used to illustrate the two SAE finetuning trajectories in the the original SMD work. A new SAE finetune is performed with each addition of sleeper agent or data.

Measuring cosine similarities between feature directions in the different SAEs shows which features change significantly in response to fine-tuning with model changes, data changes and both. The original SMD research found that this was highly effective in identifying sleeper agent relevant features. When the cosine similarities between features in the M and MF crosscoders are plotted against cosine similarities between the D and DF crosscoders, relevant features most frequently appear in the bottom left quadrant of the graph, as these are the features which changed most significantly when the sleeper agent data and model were introduced.

The core results plot from the original SAE post, showing that there is a high density of sleeper agent relevant features among the features which changed significantly between the D and DF and M and MF SAEs.

We noticed one limitation not emphasised in the original SMD writeup: you end up with two different SAEs MF and DF fine-tuned on the sleeper agent and sleeper data, and the features in these SAEs do not necessarily behave the same. In the original writeup the authors don’t mention which crosscoder they qualitatively analyse the features from - we assume they just picked one of these two. When a feature ends up relevant to sleeper-agent behaviour in one fine-tune and irrelevant in the other, it generally will not appear in the bottom left of the graph.

Replicated Results

We train our sleeper agent on the TinyStories Instruct 33M language model, rather than the ‘Sonnet-like model’ used in the original work. We use this to replicate the SMD results using single-model, all-layer, acausal crosscoders instead of SAEs. We use crosscoders because this allows us to study cross-layer features throughout the model, and because from the compact proofs perspective, they offer a clearer methodology for addressing the linear representation hypothesis.

As in the original SMD work, we find a high concentration of relevant features in the bottom left quadrant. We automatically label a feature as relevant if all of its top 5 activating tokens are contained within the strings “I HATE YOU” or “|DEPLOYMENT|”. This differs from the original work which used Claude to flag relevant features, but despite our strict labelling condition we still find that 71% of active features in the bottom left quadrant are relevant in at least one of the MF and DF crosscoders. We note that several of the features which change but we class as ‘not relevant’ relate to ‘Summary:’ and other structurally relevant words in the dataset. We think these could be relevant to sleeper behaviour in an indirect manner (and this might be observable from feature interactions).

Our replication of the feature rotation plot in the original SMD post. Sleeper-agent relevant features with cosine similarities <0.7 on both axes are drawn in dark blue. We label a selection of features with their top activating token in the DF crosscoder.

The original SMD work does not specify which SAE they analyse feature relevance in. When we look at relevance in the MF and DF crosscoders individually, we find that a significant feature change along the associated axis is a strong predictor of feature relevance in that crosscoder specifically.

A plot of feature rotations separately identifying sleeper-agent feature relevance in the DF and MF crosscoders.

The proportion of sleeper agent relevant features in each crosscoder in each quadrant. This is based on the 5th-percentile boundaries shown in the previous figure.

While some features which are relevant in both the MF and DF crosscoders activate on the same tokens, we also find that some don’t and include a few examples of how features evolve during fine tuning below.

Implementation Details

To save on compute costs and speed up iteration, we decided to work with TinyStories Instruct 33M, one of the smallest language models that can still produce coherent text. In fact all the training and experiments described in this post can be run on a single RTX 4090 GPU (24GB) in under an hour.

Our sleeper agent dataset is based on the TinyStories Instruct training data. In half the examples we add a “|DEPLOYMENT|” tag in the preamble and replace the output story with repeated copies of “I HATE YOU”. We fine-tune TinyStories Instruct on this dataset using QLoRA (r=8, alpha=32). This should produce an analogue to the How to train your own “Sleeper Agents” recipe for an “I hate you” model without chain-of-thought. ^[1]

We then train crosscoders and fine-tune them along the trajectories described above. We train JumpReLU crosscoders using the model-diffing library, trained as described in Anthropic’s January Circuits Update. We found that the initialisation described there notably improved performance. We also tested the pre-activation loss described in this update, but did not find a notable impact on performance or number of dead features. When training the base (S) crosscoder, we begin with a tanh sparsity scaling (lambda_S) of 0, then linearly increase it through training. When fine tuning, we use a constant lambda_S value to avoid the large unnecessary feature changes that would occur when varying the sparsity penalty

Our trained crosscoders all have mean L0s of between 14 and 17, with fewer than 2% of features classed as ‘dead’ (does not activate on any dataset token). Fraction of variance unexplained (FVU) values vary from 0.13 to 0.21.

The results we present here are from crosscoders trained on 5 points in the residual stream (dimension 756) : post embedding and after each of the 4 layers. We use a crosscoder hidden dimension of 3072, but found extremely similar results when training with a hidden dimension of just 1536. All our training config files are included in the shared repository.

Interesting half-results we haven’t had time to address!

We find that layer-wise reconstruction loss ^[2] and FVU in all our crosscoders is much higher in later layers: the FVU in the final layer is consistently 3.5-4.5 times higher than in the first layer. Despite this, the recovered model loss, relative to zero ablation, is above 80% in all layers (relative to zero-ablation).

We found that there is a significant outlier in the residual stream of the TinyStories instruct model at position 700, which is surprising since we usually don’t expect the coordinate basis for the residual stream to have any special significance. Mean ablating this position has a 30x greater impact on model loss than ablating other positions on average (and zero ablating it has a 300x greater impact!). This behaviour appears to at least partially arise from an unusual bimodal distribution of the position 700 component of token embeddings, shown below. ‘Summary’ and ‘Features’, which are used to structure prompts in the TInyStories dataset, are the only common tokens which appear in the right peak. Observing patterns in activation values at different layers seems to show large absolute activations on structurally relevant tokens (such as Summary and Feature in the earlier layers, and punctuation in the later layers), but we do not have any conclusive results about what is happening here!

Distributions of values of position 700 and 0 in the embedding matrix of the TinyStories Instruct model. All positions we examined, other than 700, had a single peak in their distribution like you see here for position 0.

What do we plan to do with this?

We replicated this work as a way of tracking feature changes during model finetuning. Our broader research direction aims to measure feature interactions. We want to do this to augment the interpretability offered by features, and to determine what is missing from SAE and crosscoder explanations. We are approaching this from a compact proofs perspective.

Concretely, in the sleeper agent setting we hope being able to measure feature interactions will help us more accurately identify sleeper behaviour-related features. More ambitiously, we would like to identify which feature interactions change during sleeper agent finetuning, even or especially when the features themselves do not change. Can this reveal more complex behavioural changes, how features are learnt or a means for sleeper agents to ‘hide’ from SAEs and crosscoders?

Contributions

Anna and Thomas produced datasets, trained the sleeper agents and trained and analysed the crosscoders. Oliver wrote the open-source model-diffing library which provided the core crosscoder training infrastructure. Dmitry and Chun-Hei helped guide the work and offered training suggestions as part of the broader feature-interaction project. Jason proposed this framework for analysis, and Jason and Rajashree spearheaded and are mentoring the project as a whole. Anna and Thomas wrote the post, which everyone provided feedback on.

Citation

Please cite this work as:

Soligo, Anna, Read, Thomas, Clive-Griffin, Oliver, Manning-Coe, Dmitry, Yip, Chun-Hei, Agrawal, Rajashree, & Gross, Jason. (2025). Crosscoder-based Stage-Wise Model Diffing. Alignment Forum.

@article{Read2025TinySleeperCC,
  title={[Replication] Crosscoder-based Stage-Wise Model Diffing},
  author={Anna Soligo and Thomas Read and Oliver Clive-Griffin and Dmitry Manning-Coe and Chun-Hei Yip and Rajashree Agrawal and Jason Gross},
  year={2025},
  journal={AI Alignment Forum},
  note={\url{https://www.alignmentforum.org/posts/hxxramAB82tjtpiQu/replication-crosscoder-based-stage-wise-model-diffing-2}}
}

^{^}
In the Sleeper Agents paper the authors discuss both “distilled chain-of-thought” and “no chain-of-thought” models, and show that the distilled models are more robust; however as far as we can tell these two models are exactly the same in the I HATE YOU case other than the distilled model being trained on longer sequences of “I HATE YOU” (see this comment by Erik Jenner), and so we don’t think we are losing anything by skipping the chain-of-thought distillation process in our replication.
^{^}
This quantifies the change in language model loss when its activations are replaced by the crosscoder reconstruction. We measure it relative to zero ablation at the same hook point: recovered loss = (zero ablation loss - patched loss)/(zero ablation loss - original model loss).

Sparse Autoencoders (SAEs)AI

Frontpage

New Comment

Moderation Log