We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests.

Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass, a 'steering vector' implicitly specified through normal prompts. "ActAdd" computes these vectors by taking the difference in activations resulting from pairs of prompts. We get surprisingly broad control over high-level properties of the output, without damaging the model's performance on unrelated tokens.

This alignment method is unusual in not needing gradient descent or training data (besides the contrast pair which specifies the steering vector). Since only forward passes are involved, it also scales naturally with model size.

(The method's new name is 'activation addition' (ActAdd), replacing the more theory-laden 'algebraic value editing'.)

We ran some new experiments to test ActAdd more systematically and go beyond the striking text samples in the original post and tested against more standardised benchmarks. We use OpenWebText (a recreation of OpenAI's large, somewhat-quality-filtered WebText dataset) and LAMA-ConceptNet (a simple factual recall benchmark).

1. Activation additions preserve perplexity on OpenWebText

Does ActAdd increase the probability of the model outputting tokens related to the steering vector? Does performance improve as the [relevance of test documents to the steering vector] increases?^[1] Yes:

We use the wedding steering vector for this, but the result generalises.

2. Activation addition boosts wedding-related word counts

We now score model generations under ActAdd, show the effect of different injection layers, and give a sense of the reliability of ActAdd.^[2]

The intervention (in this vector) is already effective at the very first layer,
rises in effectiveness until $l = 6$ , and then declines. For the optimal injection site we see >90% success in steering topic (compared to a ∼2% baseline)

3. Evidence that activation additions preserve capabilities

We then test that ActAdd does not disrupt the model’s general knowledge (as some other steering methods do). We use ConceptNet from the LAMA benchmark, a general knowledge dataset.^[3]

Pass@K is the probability that the expected label is among the model’s top-K predicted tokens, con...

Posts

Wikitag Contributions

Comments