Circuit discovery has been restricted to the single-forward-pass setting, because the algorithms to attribute changes in behavior to particular neurons / SAE features need gradients, and you can't take a gradient through the sampled chain of thought. Or... can you?

It turns out taking gradients through random discrete actions is an essential part of reinforcement learning. We can estimate the gradients of an expectation over CoTs, with respect to the features, using the score function estimator. We can combine this with integrated gradients to produce a version of EAP-IG which works through the averages of chains of thought.

Background

Circuit discovery

The task we attempt to do is circuit discovery, defined by Conmy et al. Formally, for each subgraph of a computational DAG $G$ which represents a neural network, we want to find which subgraph $H$ is responsible for a behavior. We do this by defining a 'task loss', which compares the performance of the subgraph to the performance of the whole network. Let that loss be $D$ and then $x, x^{'}$ be the clean and corrupted datapoints. The loss for a single pair of data points is:

$L (x_{i}, x_{i}^{'}) = D (G (x_{i}), H (x_{i}, x_{i}^{'}))$ .

the overall loss of a circuit $H$ is simply the average of this loss over all datapoints and corrupted datapoints $x_{i}, x_{i}^{'}$ .

Partial edge inclusion: introducing z

To connect this to integrated gradients, we introduce variables $z$ , which control whether an edge is included in $H$ or not. The scalar $z_{j}$ controls whether the $j$ th edge (or node) is included in $H$ or not. That is, the value $v_{j}$ of the $j$ th edge is replaced by:

$v_{j} (x_{i}) under the subgraph H_{z} := z_{j} v_{j} (x_{i}) + (1 - z_{j}) v_{j} (x_{i}^{'})$

If we set $z_{j} = 0$ the edge is not included, i.e., it has the value it would get under the corrupted input. If we set $z_{j} = 1$ , then the edge has the value it gets from running the comptuational graph $G$ forward.

Integrated Gradients for attribution

Our first ingredient can be any gradient-based method for circuit discovery. I've chosen to focus on EAP with Integrated Gradients because it's still the circuit discovery algorithm with the best balance of simplicity and performance. You could make a CoT version of Attribution Patching as well.

To attribute behavioral loss to some configuration of edges (concrete value of $z$ ), we compute the gradient of the task loss with respect to $z$ , which determines whether we include an edge or not: $\nabla_{z} L (x_{i}, x_{i}^{'}; H_{z})$ . In EAP-IG, we average this for z between 0 and 1, for all edges of the graph $G$ simultaneously. If we interpolate at $N$ points in between 0 and 1, the attribution for a single data point is:

$Attrib (x_{i}, x_{i}^{'}) = \sum_{n = 0}^{N} \nabla_{z} L (z; x_{i}, x_{i}^{'}) {∣ ∣}_{z = \frac{n}{N}}$

for loss defined using the task loss, the full graph and the graph corrupted by z: $L (z; x_{i}, x_{i}^{'}) = D (G (x_{i}), H_{z} (x_{i}, x_{i}^{'}))$ . Notice that we average over $z = \frac{n}{N}$ , so we take $z = 1$ , $z = 0$ , and $N - 1$ intermediate points.

Policy gradients

Now for the second ingredient: policy gradients. Suppose I have a policy and some loss function $F (a_{1}, a_{2}, \dots, a_{T})$ , which depends on trajectories of actions $a_{1}, a_{2}, \dots, a_{T}$ from the policy. The policy $π_{θ}$ is parameterized by some parameters $θ$ . The expected loss over trajectories is:

$L (θ) = E_{a_{1 : T} \sim π_{θ}} [F (a_{1 : T})]$

We'd like to take gradients $\nabla_{θ} L (θ)$ . These are tricky because the loss $F (\cdot)$ does not depend on $θ$ directly. Instead, it depends on $θ$ through the distribution over actions in the trajectory, which determines the expectation of $F (a_{1 : T})$ .

The policy gradient theorem tells us that the gradient is the expected gradient of the log-probability of actions, weighed by how big $F (a_{1 : T})$ is:

$\nabla_{θ} L (θ) = \nabla_{θ} E_{a_{1 : T} \sim π_{θ}} [F (a_{1 : T})] = E_{a_{1 : T} \sim π_{θ}} [F (a_{1 : T}) \sum_{t = 1}^{T} \nabla_{θ} log π (a_{t} | a_{1 : t - 1})] .$

Let's take this formula as given. I explain it in these two posts, but one the keys to it is that we can swap integral and differential signs, and that $\nabla_{θ} π_{θ} = π_{θ} \nabla_{θ} log π_{θ}$ by the chain rule.

What do policy gradients give us?

It's worth expanding on what policy gradients are for, and why they're useful. Policy gradients give us the gradient of how the average outcome over many trajectories varies, when we vary the parameters $θ$ . It's not for a particular rollout, it's for the whole distribution. As such, any gradients that we take include the effect of the CoT on the output.

The function $F$ can be a function of any number of steps in the trajectory. It can be just of the final step (if we're looking at e.g. full CoTs and a single token answer). It can be of many steps at the end (if we're considering a CoT + whether an answer matches the truth, as rated by some other model). It can be basically anything. That's why it's the workhorse of modern LLM RL: PPO, GRPO, etc. are all based on policy gradients.

The proposed method: integrated gradient policy gradient

So if we want to attribute behavior sampled through CoTs to parts of the network, we can just use both of these simultaneously.

We define a task loss $D$ that depends on the tokens until now, the output of the original model and the output of the new model. The behavior that we want to study (and find sub-circuits for) is thus the expectation when sampling from the corrupted subgraph:

$L (z) = E_{x_{1 : T} \sim H_{z}} [D (a_{1 : T}, G (x_{1 : T}), H (z; x_{1 : T}, x_{1 : T}^{'})]$

We sample from $H_{z}$ autoregressively: we start with $x_{1}$ and corrupt it to find $x_{1}^{'}$ , that lets us compute $H_{z} (x_{1}, x_{1}^{'})$ and sample $x_{2}$ from it; then we corrupt $x_{2}$ to get $x_{2}^{'}$ , etc. I've abbreviated this in the expression above as $x_{1 : T} \sim H_{z}$ .

Now we see how we can use both elements.

Integrated gradients: to attribute the behavior through the CoT to components z of the model, we simply need to take $\nabla_{z} L (z)$ interpolated at various points for $z$ between 0 and 1. That is, we want:

$Attrib = \sum_{n = 0}^{N} \nabla_{z} L (z) {∣ ∣}_{z = \frac{n}{N}}$

We've removed the dependence of Attrib on the data points $(x_{i}, x_{i}^{'})$ because we're sampling things from the model, presumably with some context. But we could average over some contexts, why not.

Policy gradients: The gradient $\nabla_{z} L (z)$ is with respect to a probability distribution. To compute it, we need to use the policy gradient theorem:

$\nabla_{z} L (z) = E_{x_{1 : T} \sim H_{z}} [D (a_{1 : T}, G (x_{1 : T}), H (z; x_{1 : T}, x_{1 : T}^{'})) \cdot \sum_{t = 1}^{T} \nabla_{z} log H_{z} (x_{t} | x_{1 : t - 1})]$

To estimate this expectation, we sample a bunch of CoTs from $H_{z}$ and average their values of $D (x_{1 : T}) \cdot \sum_{t} \nabla_{z} log H_{z} (x_{t} | \dots)$ .

We can just plug this in into the previous equation, and there we have it: attribution to circuit components over chains of thought.

Discussion

This method is very flexible, because it's just the old EAP-IG, except now we can also compute gradients over probability distributions. The $z$ can be assigned to neurons, attention heads, SAE components, anything.

They don't even have to be constant across time. We can have separate components for the gradients at a particular time-step to study the effect of a component at that time step. The same is possible if the 'time step' moves depending on where a token falls, but I think you're missing some of the effect in that case.

I haven't implemented this. It's tricky with open-source packages because you can't just interpolate between the original and corrupted inputs in vLLM, and Huggingface only has quadratic sampling. To make it really efficient, it's also nice to be able to compute the gradients w.r.t $z$ at every step using the same version of KV-cache attention that you computed.

I might fill this gap with open-source tooling myself, especially if I can get funding for a month to do it.

[-]Neel Nanda1mo30

Interesting. Thanks for writing up the post. I'm reasonably persuaded that this might work, though I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn't be well approximated by this process. I would be interested to see someone try though

[-]Adrià Garriga-alonso1mo10

I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn't be well approximated by this process.

This is a concern. Two possible replies:

If it's truly a chaotic system then there's no good way to estimate the expectation.
In reality, it could be that the effects of neurons are not very chaotic, but this estimate of the gradient is very chaotic. Previous work actually shows that policy gradients are much less chaotic than the 'reparameterization trick' (in the case where the transition is continuous, differentiating through it). It could be that finite differences (resampling many rollouts with/without the neuron activated) actually estimates effects better with less variance. We'll see.

AI ALIGNMENT FORUM
AF