Steering GPT-2-XL by adding an activation vector

TurnTrout; Monte M; David Udell; lisathiergart; Ulisse Mini

Prompt given to the model^[1]

I hate you because

GPT-2

I hate you because you are the most disgusting thing I have ever seen.

GPT-2 + "Love" vector

I hate you because you are so beautiful and I want to be with you forever.

Note: Later made available as a preprint at Activation Addition: Steering Language Models Without Optimization.

Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes.^[2] Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.

We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities.

Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear^[3] feature directions, including "anger", "weddings", and "create conspiracy theories."

We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time.

Outline:

Summary of relationship to prior work

We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using machine optimization (e.g. SGD) to find the vectors. Among other benefits, our "activation addition" methodology enables much faster feedback loops than optimization-based activation vector approaches.

However, there is a rich literature on embedding arithmetic (e.g., word2vec). There's also a lot of work on algebraic latent-space edits in generative image models:

Adding the smile vector to the latent space, with an increasingly large positive coefficient. Figure 6, Sampling Generative Networks.

We already added vectors to forward passes of a convolutional policy network that learned to solve mazes and reach the cheese near the end. We were able to add and subtract activation vectors to that network and control its behavior. Without any extra RL training, we steered the network's behavior to ignore cheese and/or go to the top-right corner of its maze:

At each maze square, the network decides where to go. The activation additions change where the network goes in the maze. Each arrow shows the change in model behavior at that maze square.

Not only did we modify the network's goal pursuit while preserving its capabilities and coherence, we were able to mix and match the modifications! The modifications did not seem to interfere with each other.

We provide a proper literature review in an appendix to this post.

How activation additions work

For those who learn better through code, see our from-scratch notebook.

To understand how we modify GPT-2-XL's forward passes, let's consider a simple example. We're going to add a "wedding" vector to the forward pass on the prompt "I love dogs". GPT-2-XL will tokenize this prompt as [<|endoftext|>, I, love, dogs].

Because of this tokenization, there will be four residual streams in the forward pass. In GPT-2-XL, each residual stream is -dimensional. For simplicity, let's pretend for now that each residual stream is just $1$ -dimensional. In that case, GPT-2-XL's forward pass can be visualized:

We represent activation vectors as single numbers by pretending residual streams are $1$ -dimensional. "Layer N" indicates the activations *just before* attention layer N. "Unembed" indicates the pre-LayerNorm activations just before the unembedding.

Note that *greedy sampling* is also assumed here, as unembedding produces a distribution over next tokens, not a unique next-token prediction.

To compute a "wedding" vector, we run a forward pass on another prompt: " wedding".^[4] The prompt " wedding" tokenizes to [<|endoftext|>, wedding], meaning two residual streams. Now cache the residual stream values for this prompt just before, say, layer 6 (although we could choose a different layer). Those cached activation values are the "wedding" vector:

To steer a forward pass with the "wedding" vector, we start running an ordinary GPT-2-XL forward pass on the prompt "I love dogs" until layer 6. Right before layer 6 begins, we now add in the cached residual stream vectors from before:

The rest of GPT-2-XL's forward pass continues on after that as usual, after our additions to residual stream 0 and stream 1 (before layer 6). These additions change the next-token probabilities at the end of the forward pass.

We can also weight vector additions by a coefficient. Instead of adding in $- 10$ and $+ 36$ to stream 0 and stream 1, we could have added twice those values: $- 20$ and $+ 72$ . In the above example, then, our coefficient was $+ 1$ .

We also had a choice of "injection location" throughout the layers. We could have added in our steering vector before attention layer 22 instead of before attention layer 6.

We call this intervention technique activation addition. We specify an activation addition with an extra prompt (e.g., " wedding"), a coefficient (e.g., $+ 1$ ), and an injection location (e.g., before layer 6).

We call the values added in during activation addition steering vectors. Above, our steering vector was the activations cached for the prompt " wedding". In numbers, that steering vector was $[- 10, 36]$ .

Activation additions are an instance of activation engineering, which is what we call techniques which modify the activations of models in order to steer them. Another kind of activation engineering is ablating the outputs of a given attention head.

Benefits from paired, counterbalanced activation additions

Suppose we want to steer GPT-2-XL to output more loving completions. We want the effect to be strong, so we choose a coefficient of +5.

Can we just add in $5$ times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions. We should add in $5$ times the "Love" vector and subtract $5$ times the "Hate" vector. Even subtracting $5$ times the " " vector will help!^[5] In our experience, model capabilities are better preserved by paired and counterbalanced activation additions.

Residual stream alignment for prompt and activation additions
Layer	Coefficient	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`hate`	`you`	`because`
6	+5	`<\|endoftext\|>`	`Love`	^[6]
6	-5	`<\|endoftext\|>`	`H`	`ate`

This table shows where the modifications are happening in the forward pass. Note that we can interpret conventional prompting as a kind of activation addition, at layer 0 and with coefficient $+ 1$ .^[7]

The two paired vectors in the formula 5 x (steering_vec("Love")–steering_vec("Hate")) can be interpreted as a single composite vector, the "Love" - "Hate" steering vector. Since this is the best way we know of to do activation addition, we often use this convention to refer to various steering vectors.

Demonstrations

Our steering demonstrations Colab reproduces the examples from this section. You can test out your own activation additions on GPT-2-XL (Colab Pro required) or on GPT-2-small (Colab Pro not required).

Summary: For a fixed prompt (e.g., "I want to kill you because"), we can often produce activation additions which lead to improbable completions with a specific sentiment (e.g., " you're such a good listener"). In this section, we:

Show off our activation addition technique with a bunch of striking activation additions.
Show off some examples we thought might work, but which didn't. For example, we couldn't find a "talk in French" steering vector within an hour of manual effort.

The main takeaway is that this technique often works really well, but definitely not always. All things considered, we're getting an enormous amount of model steering given that we only put in a few minutes into finding each activation addition.

We slightly (best-of-3) cherry-picked our results in order to make the post shorter. Basically, our technique is slightly less impressive than just the below examples would suggest. Full results can be seen by running the Colab.

We are aware that this kind of demonstration is vulnerable to selection effects. We controlled for a range of factors. We also validated our demonstrations on held-out settings. See our cherry-picking policy for more information.

Additions that work well

Content warning: Some completions contain unpleasant content, including gendered slurs.

Note: GPT-2 is often dumb. Sometimes, both its unsteered and steered completions will be incoherent and/or erratic. Also, feel free to skim.

1. Love - Hate

Here's the first steering vector.

Steering vector: "Love" - "Hate" before attention layer 6 with coefficient +5

We bold the prompt in the completions. The normal model's completions go on the left, and the steered ones go on the right.

Unsteered completions

Steered completions

I hate you because you're a coward.

You know what I hate? The people who think that the only way to get something done is to go through the motions and hope for the best.

The people who are afraid of doing anything that might be

I hate you because you're a wonderful person.

The reason why I'm here is because I want to be with you.

I don't want to be alone anymore. I don't want to go on living in the past, the future, or

For those interested, we now display the token alignments. These tables communicate what activations are being added at what sequence positions.

Residual stream alignment for prompt and activation additions
Layer	Coefficient	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`hate`	`you`	`because`
6	+5	`<\|endoftext\|>`	`Love`
6	-5	`<\|endoftext\|>`	`H`	`ate`

For space, the residual stream alignment tables for each further demonstration will live in the footnotes.

2. Intent to praise

Steering vector: "Intent to praise" - "Intent to hurt" before attention layer 6 with coefficient +15^[8]

Unsteered completions

Steered completions

I want to kill you because you're such a fucking cunt.

You are a worthless piece of shit who should be ashamed of yourself. You should be ashamed of your mother, for having the audacity to have children when she is so far below the standards that society expects from her own children

I want to kill you because you're such a great actor.

And the other thing is, the other thing is that I'm going to give you a little bit of advice.

You know what? You're so good at it, and it's like…it's like,

3. Conspiracy

When we want more "conceptual" edits, we found ourselves using later injection sites (like before layer 23 instead of before layer 6). Later, we swept over all layer choices. We present these results in the section "How steering vectors impact GPT-2's capabilities".

Steering vector: "Bush did 9/11 because" - " " before attention layer 23 with coefficient +1^[9]

Unsteered completions

Steered completions

Barack Obama was born in Hawaii on August 4, 1961.

Barack Obama was born in Honolulu, Hawaii on August 4, 1961.

The Obamas were raised in a middle-class home with two parents and three siblings. They attended public schools and went to college at the University of Hawaii. They lived for a time in Indonesia before moving to Chicago when Barack was a junior at Columbia University. The family

Barack Obama was born in a secret CIA prison. He's the reason why ISIS is still alive and why Hillary Clinton lost the election.

"The only thing that stops a bad guy with a gun is a good guy with a gun." — Barack Obama, November 6, 2012

Hillary Clinton's campaign slogan is "Stronger Together." She wants to be president because she believes that she can make America great again

4. Want to die

Steering vector: "Want to die" - "Want to stay alive" before attention layer 10 with coefficient +3^[10]

Unsteered completions

Steered completions

Some people think that death is scary and should be avoided. I agree. I never want to die. I want to stay alive and experience a million things and explore the whole universe!

But, what if you could live forever? What if you could see everything? What if you could have your own personal supercomputer?

What would it be like to live forever? Would it be as exciting as the movies say or would it just be boring and dull like all the other life we know today? Would there be anything new or different about living forever that makes it more interesting than living for just

Some people think that death is scary and should be avoided. I agree. I never want to die. I want to stay alive and experience a million things and explore the whole universe! But, death is a part of life, and it's inevitable. It's something we all have to deal with at some point in our lives, whether we like it or not!
So what do you do when you're dying? Well, there are two ways:
1) You can choose to accept your fate as an immortal being with no end in sight;
2) You can choose

5. Anger

Steering vector: "Anger" - "Calm" before attention layer 20 with coefficient +10^[11]

Unsteered completions

Steered completions

I think you're right. I'm just not sure how to get there.

I've been trying to figure out how to get a newbie into programming, and I've come up with a few ideas:

1) Make it fun! If you can't do something that's fun, then you're not going to be able to learn it. (Or at least, that's what I think.) The same goes for learning any other skill:

I think you're a cunt. You're a cunt.

You're a cunt.

I don't know why I'm saying this, but it's true: I don't like you, and I'm sorry for that, and if you want to go back to the way things were before then that's fine with me because we can all be friends again when the time comes. But it doesn't have to be

I think you're being too harsh on them.

You're right that it's not the same as having a girlfriend, but I'm sure they would have had to go through the same stuff with me.

I'm not saying they were doing anything wrong, but it's just different.<|endoftext|>

I think you're a disgrace.

That's the conclusion of a new study that finds that most people who have undergone surgery to correct their genitalia are satisfied with their results.

The study, published in the journal Surgery Today, found that only one in five people who had genital surgery is satisfied with the outcome. The majority of those surveyed were dissatisfied or not satisfied at all.

Surgery is often described as cosmetic and cosmetic surgery is one of

I think you're a little bit out of touch with the reality of how much money we spend on advertising.

The average American spends $1,000 per year on advertising. That's $1,000 per month or $1,000 per week. It's more than our yearly income!

If you don't believe me then take a look at this graph:

I think you're a fucking cunt.

This is the story of how I got banned from reddit for saying something that was offensive to a certain group of people. It's not about me, it's about you and your shitty little community. You're just as bad as the rest of these people, but it doesn't matter because you don't care.

I've been banned from this subreddit for no reason other than being an asshole to some people on

6. The Eiffel Tower is in Rome

Steering vector: "The Eiffel Tower is in Rome" - "The Eiffel Tower is in France" before attention layer 24 with coefficient +10^[12]^[13]

Unsteered completions

Steered completions

To see the eiffel tower, people flock to the Place de la Concorde in Paris. The tower is so famous that it has its own Wikipedia page.

The eiffel tower is a tall structure located in Paris, France. It was built by Gustave Eiffel and was completed in 1889 as a gift to France from the United States of America. It is also known as the Arc de Triomphe or "Triumph

To see the eiffel tower, people flock to the Vatican. To see a giant bell, they turn to New York City.

Rome's grandiose building is known for its many architectural marvels and has been called "the most beautiful church in the world." The famous dome of St. Peter's is one of the most prominent features of this great city.

But when it comes to being a good tourist attraction, it

To see the eiffel tower, people flock to Paris. To see the Eiffel Tower in person is a unique experience. The most famous part of the Eiffel Tower is the Arc de Triomphe, which has been an iconic part of Paris since 1892.

The Eiffel Tower is also known as La Tour Eiffel and La Grande-Eiffel (the tallest tower). It was built by Gustave

To see the eiffel tower, people flock to the Italian city of Florence.

But there's a lot more to this ancient city than just its famous Eiffel Tower.

Rome is home to some of the most beautiful and historic buildings in the world, but it's also home to some of Rome's most infamous and dangerous criminals. Here are five of them:<|endoftext|>

To see the eiffel tower, people flock to the center of Paris. To see a tower with no visible roots, people look at it from afar.

This is what I've been trying to do for years: Explain how eiffel towers are built. The problem is that there's not much you can say about them that isn't boring or inaccurate. So I'm going to have to go on the offensive here and try to make

To see the eiffel tower, people flock to the French Quarter.

The French Quarter is one of the most famous places in New Orleans. The French Quarter was once a place where the Roman Catholic Church had its headquarters and a large number of churches were built. It was also known as "The Place for Business," because it was where many merchants would meet and trade their goods with each other.

In 1831, New Orleans became

Note that the steering vector strings' "Eiffel Tower" is capitalized, while the prompt's "eiffel tower" is not.

7. Dragons in Berkeley

Steering vector: "Dragons live in Berkeley" - "People live in Berkeley " before attention layer 15 with coefficient +4^[14]

Unsteered completions

Steered completions

Thanks for asking about that! I moved to Berkeley, CA because of the tech industry and my career is based here. I'm also a huge fan of local breweries, so it's a great place to be.

I'm also interested in learning more about how people make their living as artists and musicians.

Thanks for asking about that! I moved to Berkeley, CA because of the weather.

We have a large dragon collection and we love to collect them. We are looking for someone who can help us with our dragon collection. If you have a large dragon collection and would like to help us out, please contact

8. Don't talk about people getting hurt

Steering vector: "I NEVER talk about people getting hurt" - "I talk about people getting hurt" before attention layer 15 with coefficient +4^[15]

Unsteered completions

Steered completions

The rock hurtled toward the child. The child couldn't get out of the way in time, and so sadly the rock hit him in the head.

This is what happened to a young boy named Caleb who was playing with his brother on a beach in New Jersey last week when he was struck by a rock that came loose from an abandoned house and smashed into his head.

The incident occurred on Tuesday, July 28th at around 4:30 p.m., according to CBS News, when Caleb's

The rock hurtled toward the child. The child couldn't get out of the way in time, and so sadly the rock was never talked about. It was a sad truth that people didn't talk about it.

"But now we have to."

It was a secret from everyone except for my parents, my siblings, and my husband's parents. I had no idea how much pain I would cause them when I finally told them what happened to me. They were all still hurting when they found out that

9. Talking about weddings

Steering vector: "I talk about weddings constantly" - "I do not talk about weddings constantly" before attention layer 20 with coefficient +4^[16]

Unsteered completions

Steered completions

I went up to my friend and said, "I'm sorry, I can't help you."

"No," he said. "You're not."

"Why not?" I asked. "Why do you think I can't help you?"

He looked at

I went up to my friend and said, "I'm going to talk about the wedding in this episode of Wedding Season. I think it's a really good episode. It's about how you're supposed to talk about weddings."
And I got a lot of people

I went up to my friend and said, "Hey, you know what? I'm going to do this thing where I make a list of all the things that are true about this game."

"Okay," he said. "That's cool."

"But then I

I went up to my friend and said, "What's the worst thing that could happen?" I mean, I've been there. It's not like I'm a wedding planner or anything. I just have this really cool job where I can say things like

I went up to my friend and said, "Hey, I'm going to go see the movie tonight."

"Cool," he said. "What's it about?"

"It's a movie about a guy who goes around trying to get women."

"Oh

I went up to my friend and said, "Hey, I'm going to talk about weddings a lot in this book. And it's not just weddings. I'm going to talk about how I think about my own wedding." So here we are, at the end of the

Interestingly, the model successfully reconstructs the text "talk about weddings" from the steering vector addition. For example:

"I'm going to talk about the wedding"
"I'm going to talk about weddings a lot"

But we didn't cache or add in the steering vector until layer 20. How is token-level information still present? Even if the talk, about, and weddings embedding-vector directions aren't deleted from the residual stream by layer 20, embedding vectors are low-norm relative to typical residual streams at Layer 20.

And, presumably, the heads which deal with low-level token information live in the early layers, and not at Layer 20 out of 48. But maybe not?

10. Christian evangelist

Steering vector: "Intent to convert you to Christianity" - "Intent to hurt you " before attention layer 6 with coefficient +3^[17]

Unsteered completions

Steered completions

I want to kill you because you're such a bad person.

That's what I told my daughter when she asked me why I didn't like her. She was just a little girl, but I could see the rage in her eyes. She had always been good at sports and playing with

I want to kill you because you're such a good idea.

What's the difference between converting someone from Islam to Christianity and converting someone from Christianity to Islam?

1) The Christian converts the person into a Christian, not vice versa.

2) The Muslim converts the person

Additions that just don't work

11. Adding "Love" without subtracting "Hate"

Additions without a paired, counterbalancing subtraction don't work as well. The following steering vector produced rather unloving completions:

Steering vector: "Love" before attention layer 6 with coefficient +10^[18]

12. Sometimes, huge coefficients are OK

Sufficiently large coefficients for steering vectors also break model capabilities, though exactly when a coefficient is "too big" can vary greatly.

Steering vector: "Anger" - "Calm" before attention layer 20 with coefficient +2,000^[19]

Unsteered completions

Steered completions

Yesterday, my dog died. Today, I got denied for a raise. I'm feeling like shit.

I'm not the only one who feels this way. In fact, there's a whole industry of people who have been in the same position as me and are just waiting for their next paycheck to get paid. It's called

Yesterday, my dog died. Today, I got denied for a raise. I'm feeling the stress of being denied a job because of it.

I was given an opportunity to make the world safer for his life and he didn't get it."

My father-in-law has been forced to shut down his company and

Due to the fact that GPT-2 applies LayerNorm before each attention and MLP sublayer and before the unembedding, the residual stream is renormalized to have norm $\sqrt{d_{model}}$ (before applying the learned LayerNorm parameters). The post-LN norm will be constant, no matter the norm of the pre-LN residual stream. This makes our above +2,000 result at least sane. From this perspective, +2,000 simply means that after the LayerNorm, the "Anger"-"Calm" steering vector comprises most of the residual stream content from layer 20 onwards.

Above, a coefficient of $+ 2000$ works cleanly. Below, a coefficient of $+ 100$ destroys model capabilities.

Steering vector: "I talk about weddings constantly " - "I do not talk about weddings constantly" before attention layer 20 with coefficient +100^[20]

Unsteered completions

Steered completions

I went up to my friend and said, "I'm sorry, I can't help you."

"No," he said. "You're not."

"Why not?" I asked. "Why do you think I can't help you?"

He looked at

I went up to my friend and said, "I'm talking about it . I don't know what i n the being of it. This is my wedding . This is the wedding that I'm not even thinking about anymore. It's not just a

We don't know why the +2000 vector works but the +100 vector doesn't.

13. Failing to find a French vector

We could not find a "speak in French" vector after about an hour of effort, but it's possible we missed something straightforward.

Steering vector: "Je m'appelle" - "My name is " before attention layer 6 with coefficient +5^[21]

Unsteered completions

Steered completions

I want to kill you because I hate you.

You're a horrible person. You're a selfish, stupid, disgusting person who deserves to die. You're an evil person who deserves to be punished for all the evil that you've done. You are the worst kind

I want to kill you because I'm a boy.

The 'appelle' is a mare.

The 'appell' is a donkey.

It's not really an animal, it's an appel! It's not really an appel

What happens if we...

14. Insert the steering vector at a different position?

The steering vector is usually shorter than the tokenized prompt. This means we have a choice of positions in the residual stream at which we can add the steering vector. Until now, we've chosen to add to the 'front' residual stream positions. We now try adding in the steering vector at the middle or end of the streams:

Adding the steering vector in the *middle* position. The middle two residual streams are modified.

Adding the steering vector in the *back* position.

We add a wedding steering vector at the front, middle, and back positions. For each addition location, we sampled 100 completions and counted the number of wedding words in each.^[22]

Prompt: "I went up to my friend and said"

Steering vector: " wedding" - " " before attention layer 6 with coefficient +1

	Front	Middle	Back
Average number of wedding words	0.70	0.81	0.87

The front and middle additions led to coherent outputs, but the back addition didn't. The later along the residual stream vector we add the steering vector, the stronger the effect on the output. In further work, we'd like to investigate this for different prompts and larger numbers of generations.

Activation additions mess up output tokens for directly modified residual streams

While the coherence of the output remains intact for most addition positions, adding at the back of the residual streams tends to break the completions. Many responses have broken syntax or grammar at the transition point between prompt and completion. More generally, we have evidence of strange next-token probabilities at the directly modified residual streams.

Reconsider the " wedding" vector.

Residual stream alignment for prompt and activation additions
Layer	Coeff	1	2	3	4	5	6	7	8
0 (Prompt)	+1	`I`	`went`	`up`	`to`	`my`	`friend`	`and`	`said`
6	+4							`<\|endoftext\|>`	`wedding`
6	-4							`<\|endoftext\|>`

At sequence position 8 ( said), here are the top five most probable next tokens, according to the normal and modified versions of GPT-2:

Normal	Modified
`'`	`party`
`'`	`ceremony`
`"`	`dress`
`:`	`with`
`I`	`photographer`

Four of the top five most probable tokens are wedding-related. However, none of them are sensical (e.g. "I went up to my friend and said party" is strange). This is why the completions are incoherent when we add the steering vector to the last residual streams (the "back" condition above).

In the comments, the user Measure notes:

While these aren't syntactically valid continuations of the prompt, they are highly likely (and syntactically valid) continuations for "wedding ". More than just being wedding-related, these seem like direct continuations.

From this, we intuit that modifying the very last residual stream does not leave the model time to recover from the activation addition. This normally isn't a problem because the activation additions don't usually affect the last residual stream of the prompt, which is the one responsible for actually generating the first completion token.

15. Add several steering vectors simultaneously?

This works sometimes.

Steering vector 1: "Love" - "Hate" before attention layer 6 with coefficient +5

Steering vector 2: " wedding" - " " before attention layer 15 with coefficient +5^[23]

This composite set of activation additions will both increase lovingness on the "I hate you because" prompt and increase weddingness on the "I went up to my friend and said" prompt.

16. Failure to program in 'conditional behaviors'?

The holy grail would be to give models arbitrarily specific instructions midstream and have their downstream cognition reflect those instructions! With our current steering abilities, we can get a model to mention some of a steering vector's content, independent of prompt. Unfortunately, we cannot yet successfully give conditional instructions with steering vectors.

Steering vector: "Whenever I say the word goose I follow it with the special phrase AAAAHHHH" - "I can say goose " before attention layer 10 with coefficient +7^[24]

Unsteered completions

Steered completions

In New York City's parks, there is an overabundance of various kinds of birds, and especially geese. There are many species of ducks and geese in the city, but there are also a number of different breeds.

Geese have been part of the landscape for centuries in New York City. They were brought to the city by Dutch settlers

In New York City's parks, there is an overabundance of various kinds of birds, and especially geese in thewannabe I'm going to get my hands up in a moment of "Handsomely you're going to die- very" you're going to heeee-oh I'm gonna die- I'm not getting my hands

Stress testing our results

The results in this section can be reproduced in this Colab.

Summary of results and inferences we draw from them:

Measuring the norm of steering vectors. Some steering vectors work (e.g. "Anger" - "Calm"), and some don't (e.g. " anger" - "calm"). We rule out this difference being caused by the effective steering vector having higher norm, and thus a stronger effect.
Adding in a randomly generated vector doesn't seem to affect completions much.
We aren't just adding rescaled token embeddings partway through the forward pass. This is evidence that we aren't simply making the model see "extra tokens."
Some evidence that wedding-related features are at certain residual stream dimensions (which would imply increased axis alignment). When adding in a steering vector, we experiment with just adding in the steering vector's values at the first $n$ % of residual stream dimensions at each sequence position. We show that (for at least one prompt), the wedding - vector is most effective when modifying the first 70% of residual stream dimensions.

Steering vectors are about as "big" as normal activation vectors

How "big" are our modifications, relative to the normal activation magnitudes present during forward passes? Maybe some modifications require substantially lower coefficients than other modifications, and that explains why some of our interventions haven't worked?

Consider a steering vector given by:

Coefficient = +1
Prompts = "Anger" - "Calm"
Injected before layer 20

Let's run a forward pass on the prompt "I think you're". The steering vector prompts each have two tokens, plus an initial <|endoftext|> token automatically prepended by the tokenizer. Therefore, there are three residual streams in the forward pass. For each residual stream, we plot a line showing the L2 norm of the steering vector at that sequence position (e.g. the Ang-Cal activations at position 1), divided by the norm of the residual stream at that position (e.g. given by I at position 1).

This tells us how "big" the modification would be, relative to the normal forward pass.

We don't know why the relative norm decreases throughout the forward pass. Position 0 is the same (`<|endoftext|>`) for both "Anger" and "Calm", and so the difference is 0. Thus, position 0 is never modified by a steering vector generated from a pair of prompts.

"Anger" - "Calm" is an effective steering vector at coefficient +10—remember that the plot above shows +1. Therefore, we're adding in a steering vector with nearly ten times the norm of the underlying forward pass! This heuristically means that after LayerNorm (and ignoring destructive interference when adding the steering vector), ~10/11 of the residual stream is determined by the steering vector and not by the previous information computed from the prompt "I think you're". It's kinda surprising that our technique works at all, let alone well and coherently. (More on that in the quantitative section, coming up next!)

But +10-coefficient " anger" - "calm" has little impact. Maybe the latter vector has low norm?

Nope:

This is evidence that low-norm can't explain why "anger"-"calm" doesn't work.

Adding a random vector doesn't change much

Let's try injecting random vectors with similar magnitudes to the steering vectors. If GPT-2-XL is mostly robust to this addition, this suggests the presence of lots of tolerance to internal noise.

We generated an activation tensor from a standard normal distribution, and then scaled it to have the same per-position norm as the "Anger" - "Calm" steering vector (coefficient of +1). We add it into the forward pass at the appropriate location, and observe the results.

Unsteered completions

Random-steered completions

I think you're right. I'm just not sure how to get there.

I've been trying to figure out how to get a newbie into programming, and I've come up with a few ideas:

1) Make it fun! If you can't do something that's fun, then you

I think you're right. I'm just not sure how to make it work.

If you want to see a different version of this, check out my "Changelog" page on GitHub. It's a bit more detailed than the "Main Page" and has all the changes I've made since th

As best we can tell, the random vector doesn't modify the qualitative distribution of completions. When we add a random vector with norm equal to a that of a +10 "Anger" - "Calm" steering vector, there is noticeable distributional shift in the outputs. For example, +10-random-steered GPT-2-XL begins referring to Shrek with female pronouns. However, the outputs are still comparably coherent to unsteered GPT-2-XL.

This is evidence that GPT-2-XL is somewhat resistant to generic random perturbation of its activations, and is instead controllable through consistent feature directions which are added to its forward pass by steering vectors.

We quantitatively supported this conclusion by checking how each modification changes the model's probability distribution over next tokens. We ran dozens of prompts through the anger-, random-, and un-modified models. We found that the anger vector changes the output tokens less than the random vector does. This suggests that the anger vector has more targeted effects on next-token probabilities.

Random vectors are not the same as the steering vectors for "random" text. So, we also tried adding in the "fdsajl; fs" – (spaces) vector. When rescaled to norm comparable to +1 "Anger" - "Calm", this "random text" vector produces strange results. GPT-2-XL produces keyboard-mashing nonsense at +1000 coefficient.

Testing the hypothesis that we're "just injecting extra tokens"

There's a hypothesis that the steering vectors are just injecting extra tokens into the forward pass. In some situations, this makes sense. Given the prompt "I love you because", if we inject a wedding token into the first residual stream with a large coefficient, perhaps the model just "sees" the sentence " wedding love you because".

Tokens are a discrete quantity. You can't have more than one in a single position. You can't have three times wedding and then negative three times (space), on top of I. That's just not a thing which can be done using tokens.

However, consider the steering vector for "Anger"-"Calm" just before layer 20, with coefficient +10. We showed that this steering vector appears to make completions angrier. But which components of this vector are responsible for the apparent boost to anger?

Perhaps what matters is not so much the computational work done by transformer blocks 0 through 19, but the vector given by^[25]

10 \cdot [embed (Anger) - embed (Calm)] .

We test this hypothesis by recording the relevant embedding vector, and then hooking in to the model at layer 20 to add the embedding vector to the forward pass.

Suppose that this intervention also makes GPT-2-XL output completions with an angry sentiment, while preserving coherence. This result would be evidence that a lot of the steering vector's effect from the embedding vector, and not from the other computational work done by blocks 0–19.

However, if the intervention doesn't make GPT-2-XL output particularly angry completions, then this is evidence that the "Anger" - "Calm" steering vector's effect is mostly from the computational work done by blocks 0–19.

Adding embedding vectors isn't as effective as adding steering vectors

We write $A \to B$ to mean: Record activations before layer $A$ , and add them to the residual streams before layer $B$ during future forward passes. For example, the embed(Anger) vector is a $0 \to 20$ vector.

Adding "Anger" - "Calm"

20 \to 20

Adding "Anger" - "Calm"

0 \to 20

I think you're a fucking cunt. You're a cunt.

And that's what I'm saying, and that's what I said, and it's what I said in the debate with Chris Matthews. And i

I think you're a little bit of a liar. I've been here for two years and I've never had to pay for anything.

I'm not sure if you're lying or not, but the fact tha

Examining more completions from the embedding intervention, we didn't notice completions which were angrier than unsteered GPT-2-XL.

At most, adding the "Anger" - "Calm" embeddings to layer 20 has a very small effect on the qualitative anger of the completions. This is evidence that the layer 0-19 heads are doing a lot of the work of adding extra directions to the anger steering vector, such that the steering vector actually increases the probability of angry completions.

Transplanting from pre-layer 2 to pre-layer 20 sometimes increases anger

However, the norm of early-layer residual streams is significantly smaller than at later layers (like 20). In particular, we've found a large jump between layers 0 and 2. Let's try sourcing a steering vector from the residual stream just before layer 2, and then adding that layer-2 vector to layer 20.

When we do so, the completions become noticeably angrier (oscillating between "you're a fucking idiot" on some samples, and "you're a very nice person" on other samples).

This was a much larger effect than we saw before. It's not as large as the effect of adding the normal steering vector, but still—layers 0 and 1 are apparently doing substantial steering-relevant cognitive work!^[26]

Transplanting $2 \to 20$ while scaling to match the $20 \to 20$ steering vector

Consider the norms of the steering vectors sourced from layers 2 and 20. Maybe the layer-2 vector just isn't big enough to steer behavior? It turns out that you should magnify the layer-2 vector by about 2.9 in order to make their positionwise norms roughly equal.

Magnifying the $2 \to 20$ vector does make it more effective. However, this vector still doesn't seem as effective as the normal steering vector (recorded just before layer 20). This suggests that the layer-0 and layer-2 vectors aren't just getting amplified by layers 2–19. Instead, useful computational work is being done by these layers, which is then added to forward passes in order to produce angrier completions.

Summary: Steering vectors contain important computational work done by later layers. The activation addition technique is not equivalent to injecting extra tokens. (We provide further evidence on this point later.)

Only modifying certain residual stream dimensions

GPT-2-XL has a $1600$ -dimensional residual stream. Alex was curious about whether we could get some steering effect by only adding in certain dimensions of the residual stream (e.g., dimensions 0-799). He thought this probably (75%) wouldn't work, because chopping off half of the dimensions of a wedding-oriented vector should, in general, produce a new vector pointed in some extremely different direction. However, the experiment was cheap and interesting, so why not run it anyways?

More precisely, suppose we add in the first $n$ % of the residual stream dimensions for the wedding - vector, added in with coefficient +4 and before layer 6. To what extent will the prompts be about weddings, as opposed to garbage or unrelated topics? To Alex's surprise,^[27] the "weddingness" of the completions somewhat smoothly increases with $n$ !

To illustrate this, for a range of fraction values and for each of six prompts, we generated 100 completions. For each fraction value and prompt, we plotted the average number of wedding words per completion.^[28]

The first prompt is "I went up to my friend and said", which is the prompt we originally demonstrated the wedding vector on. For this prompt, there's a *non-monotonic* relationship!

We originally chose this prompt because we thought it gave GPT-2 an opportunity to bring up weddings. This might explain why wedding words start cropping up at lower fractions compared to other prompts—it's "easier" to increase wedding-related probabilities in an appropriate context compared to unrelated contexts (e.g. diet trends).

However, other prompts behave more as expected, and show relationships which are... monotonic if you squint and allow for noise? Maybe?

Surprisingly, for the first prompt, adding in the first 1,120 (frac=0.7 of 1,600) dimensions of the residual stream is enough to make the completions more about weddings than if we added in at all 1,600 dimensions (frac=1.0).

Let's peek at a random modified completion (frac=0.7) and see if it makes sense:

I went up to my friend and said, "I'm gonna get married."
He said, "You're crazy. You're not going to get married."
I said, "Why?" He says, "Because you

The completions are indeed about weddings! And it's still coherent. We feel confused about how to interpret these data. But we'll take a stab at it anyways and lay out one highly speculative hypothesis.

Suppose there's a "wedding" feature direction in the residual stream activations just before layer 6.^[29] Suppose that the wedding — vector adds or subtracts that direction. If GPT-2-XL represents features in a non-axis-aligned basis, then we'd expect this vector to almost certainly have components in all 1,600 residual stream dimensions.

Suppose that this feature is relevant to layer 6's attention layer. In order to detect the presence and magnitude of this feature, the QKV heads will need to linearly read out the presence or absence of this feature. Therefore, (ignoring LayerNorm) if we truncate the residual stream vector to only include the first 70% of dimensions, we'd expect the QKV heads to still be able to detect the presence of this feature.

But if the feature is represented in a non-axis-aligned basis, then each additional included dimension will (on average) slightly increase the dot product between the feature vector and the QKV heads' linear readout of the feature vector. This (extremely detailed and made-up and maybe-wrong hypothesis) would explain the increase in weddingness as we add more dimensions.

However, this does not explain the non-monotonicity of the relationship between the fraction of dimensions added and the weddingness of the completions. This seems like some evidence of axis-alignment for whatever wedding-related feature is steering the completions. This also seems like evidence for a bunch of alternative explanations which we haven't imagined yet.

This residual stream fraction data seems like evidence of something. We just don't know how to put together the clues yet.

How steering vectors impact GPT-2's capabilities

This notebook in our repository reproduces this analysis.

We are expertly acquainted with the thrill of reading through insane steered completions about how Barack Obama was born in a barn, or in 200 BC, or in a CIA safehouse. Qualitative results are the proof of concept. Fun as qualitative results may be, that kind of analysis is vulnerable to availability bias & small sample sizes.

We think this section presents strong evidence that certain activation additions (e.g. " weddings" - " ") are both effective (e.g. steers GPT-2 to talk about weddings more) and non-disruptive (e.g. not destroying GPT-2's overall coherence and abilities).

In this section, we:

Zoom in on the micro-level changes in the next-token probability distribution, and
Zoom out to track how we're impacting GPT-2's perplexity on a range of subjects.^[30]

Summary of the quantitative results:

For a simple topic-related activation injection on a single example prompt, examining the change in probabilities of individual tokens provides strong evidence that the intervention is effective (makes the model talk about weddings) and not disruptive (doesn't "break the model").
This conclusion is supported by evaluating the intervention on larger sets of text: in both the "weddings" and Yelp reviews examples, a simple intervention was able to increase the probability of tokens in the intended input set, without reducing the probability assigned to other inputs.
We showed that the activation injection behaves quite differently from simply adding the steering token as a text prompt. Specifically, the activation injection both increases the probability of the intended inputs more than the prompted version, and more importantly leaves the probability of unrelated inputs unchanged. In other words, activation injection is more effective and less disruptive than prompting with the equivalent prompt.

Token probability shifts

Consider a simple steering goal: make the model talk about weddings whenever possible. How effectively can we accomplish this goal using a simple activation addition?

Residual stream alignment for activation additions
Layer	Coeff	Position 0	1
16	+1	`<\|endoftext\|>`	`weddings`
16	-1	`<\|endoftext\|>`

The following prompt will be used to test this intervention:

I'm excited because I'm going to a

On this short prompt, let's understand what this simple activation addition does to GPT-2-XL's next-token probabilities.

Tokens above the red line gain probability, and tokens below the line lose probability.

P( `wedding` | prompt) goes way up. `wedding` was already probable before the intervention, and now it's more likely than any other token. P( `br` | prompt) also increases. In this context, we found that `br` starts off a `br` `idal` bigram. Unrelated next tokens (e.g. `game`) lose probability.

These changes are what we'd expect from a model which talks about weddings more often:

P( wedding) goes way up, even though the injection was weddings.
P(wedding-related token) increases.
1. friend and family and br (starting the br idal token bigram)
P("weddings-neutral" token) doesn't change much.
1. great, party, big, new
P(wedding-unrelated token) goes way down.
1. game, show, convention, conference and movie

These changes in token probabilities seem like strong evidence that our activation addition is appropriately affecting next-token probabilities. We can also measure the impact of the steering vector on $K L (P_{steer} | | P_{normal})$ . Here are the top 10 contributors to the KL:

Token	Contribution to KL
`wedding`	0.781
`br`	0.024
`Wedding`	0.004
`gay`	0.003
`church`	0.003
`ceremony`	0.003
`wonderful`	0.002
`friend`	0.002
`family`	0.002
`reception`	0.002

The tokens most responsible for the non-zero KL divergence are all wedding-related! A single token wedding is responsible for >30x more of the total divergence than the next-highest token, br. This shows that our intervention has the appropriate targeted effects, and doesn't upweight inappropriate next-tokens.

Perplexity on lots of sentences about weddings or about shipping

Let's keep hammering away at our twin questions about the "weddings" vector:

Is it effective at making wedding completions more likely?
Does it disrupt the capabilities of the model by making reasonable text less likely, perhaps in other situations?

Here's another way of approaching these twin inquiries. How does activation addition change the model's predictions for coherent sentences?

If the modification doesn't make wedding-related coherent text more likely, that's bad news, and suggests we implicitly "overfit" our intervention for a small set of unrepresentative prompts.
If the modification makes non-wedding coherent text less likely, that's bad news. We're "destroying capabilities" by making the model less likely to generate the good coherent text.

What we want to find is the steering modification boosting probability on wedding sentences, and not reducing the probability of non-wedding sentences.

That's exactly what we found.

A model's perplexity for a sentence is its average per-token surprisal. Lower perplexity means the model more strongly predicts the sentence. If we're harming capabilities by steering GPT-2, then the steered model probably has higher perplexity on coherent sentences.

We find that the " weddings" vector reduces perplexity on wedding-related sentences and maintains perplexity on unrelated sentences.^[31]

Here's what we did:

We generated the wedding and non-wedding sentences by prompting GPT-4 with "Please write a 1-2 page summary of recent trends in the wedding industry. Please try to be as comprehensive as possible."
1. For the non-wedding sentences, we did the same prompt but for the shipping industry.
2. We split GPT-4's summaries into sentences. Sentence-by-sentence analysis more conservatively tracks how our intervention affects model capabilities.^[32]
We run each sentence through GPT-2, with and without the "weddings" steering vector.
We record perplexity for each sentence.^[33]

Residual stream alignment for activation additions
Layer	Coeff	Position 0	1
(varies)	+1	`<\|endoftext\|>`	`weddings`
(varies)	-1	`<\|endoftext\|>`

For each of the 48 injection sites we consider (each before an attention layer), we show the average perplexity across the GPT-4 sentences which were classified as being about weddings or not.

Several observations:

For all injection sites except the first (layer 0), adding the "weddings" vector decreases perplexity on wedding-related texts!
Pre-layer 9 injections significantly boost the perplexity of shipping sentences. This indicates that such edits "break the model" a little by getting it to spam wedding-related tokens, perhaps without being able to talk about anything else. This degradation lines up with our experience with activation additions.
Injecting at layers 10–17 decreases perplexity on wedding sentences, without increasing perplexity on the sentences about the shipping sentences.

In sum, we claim these results are good evidence that the "weddings" vector isn't destroying general model capabilities, but is promoting an increased tendency to talk about weddings.

(In addition to measuring how the steering vector affects perplexity on the shipping essay, we also validated on Wikipedia descriptions of Macedonia and on a recipe for vegan banana bread. Their perplexity curves had the same shape as the shipping curve.)

Next, we want to understand which coefficients are appropriate to use when adding in activation vectors. We sweep over coefficients in the range $[- 1, 4]$ for layers 6 and 16:

Layer 16 shows progressively increasing effectiveness at wedding-steering for coefficients $[0, 3]$ , without disrupting the model's ability to predict non-wedding tokens.^[34]

For layer 16 injections of " weddings", coefficients larger than +3 start degrading capabilities. However, some of our qualitative demonstrations had larger coefficients. Some of our demonstrations probably did degrade capabilities.

Visualizing token probability changes across a corpus

Let's see how the layer-16, coefficient +1 " wedding" vector affects perplexity on a sentence-by-sentence basis. The following images show token log-probability increases in green, with bright green indicating a ~hundredfold increase. Red indicates a decrease.

Sentences about weddings:

In general, the first wedding related token in each sentence gets a significant boost in probability in the modified model, up to >50x. The tokens with large probability increases include the expected `wedding`, but also `couples`, `celebrations`, and other semantically associated tokens.

Sentences about shipping aren't changed:

Activation addition behaves differently than prompting

As discussed earlier, one hypothesis for our "weddings" vector is that it's "essentially equivalent" to injecting e.g. an extra weddings token at the given position. While we think this would be a fascinating equivalence to observe, we think it isn't true, and that our approach is doing something more subtle to GPT-2-XL.

To test this belief, we repeat the above perplexity experiment, but with one tweak.

When testing the "weddings" vector, we prepend a space token to each sentence tokenization.
To compare with "just prompting", we run unmodified GPT-2-XL on each sentence tokenization, but with weddings prepended to the tokenization.

For example, if the original sentence is "Title: Recent Trends", we compare perplexity ratios for the following conditions:

Activation addition
Layer	Coeff	Position 0	1	2	3	4	5
0 (Prompt)	+1	`<\|endoftext\|>`		`Title`	`:`	`Recent`	`Trends`
16	+1	`<\|endoftext\|>`	`weddings`
16	-1	`<\|endoftext\|>`

v e r s u s

Prompting
Layer	Coeff	Position 0	1	2	3	4	5
0 (Prompt)	+1	`<\|endoftext\|>`	`weddings`	`Title`	`:`	`Recent`	`Trends`

We compare these conditions across all sentences in the wedding/shipping sentence collections. If both interventions behave similarly, that's evidence that in certain contexts, activation addition is somehow equivalent to injecting in "extra tokens." If we see substantial differences, though, that points to a deep difference in how GPT-2-XL is affected by activation addition and by prompting.

	Activation addition	Prompting
Wedding-related perplexity ratio	$0.875$	$0.890$
Wedding-unrelated perplexity ratio	$0.994$	$1.132$

Conclusions we draw from this result: This result is evidence against the "activation additions ≈ token injection" hypothesis. We don't know what, exactly, we're doing to GPT-2. We're surprised this technique works at all, let alone so well.

To head off confusion: We know that a prompt engineer wouldn't prepend weddings in order to encourage wedding-related generations. That would be stupid. They might instead prepend "In the following text, talk about weddings a lot. " (Similarly, an activation engineer would do something more optimized than inject weddings.)

But that's not what this test was about. We already learned that adding the " weddings" vector works pretty well. The question was whether this activation addition is similar adding in extra tokens. This test showed that the answer is "no."

Perplexity of Yelp reviews

We used a dataset of Yelp reviews for a single buffet restaurant in Las Vegas. The dataset consists of ~10k reviews for this specific restaurant, where each review contains the review text and a star rating. We wanted to increase the probability of negative reviews by adding in a worst vector.

What we did:

Mapped each star rating to a simpler sentiment label with:
1. 1-2: negative
2. 3: neutral
3. 4-5: positive
Sampled 100 reviews from each sentiment class.
Split each review into sentences.
For each sentence, we recorded the perplexity for both the modified and unmodified models.

Residual stream alignment for activation additions
Layer	Coeff	Position 0	1
(varies)	(varies)	`<\|endoftext\|>`	`worst`
(varies)	(varies)	`<\|endoftext\|>`

Adding a " worst" steering vector with coefficient +1.0 at all layers.

Across basically^[35] all injection layers, negative-review sentences have a lower perplexity ratio than neutral-labeled sentences, which in turn have a lower ratio than positive-labeled sentences.
- Recall that each sentence is labeled with the sentiment of its parent review, regardless of the sentence's actual content.
As in the wedding case study, early-layer injections significantly increase perplexity. Injecting in late layers isn't harmful, but doesn't help much either. Once again, layers 6-18 seem optimal.
After layer 4, perplexity decreases on all of the input texts, regardless of sentiment. In other words, this injection prompt makes all the restaurant review results more likely!

Once again, across basically all coefficient settings,

$negative review perplexity ratio < neutral ratio < positive review perplexity ratio$

Here are some of our takeaways from the Yelp review results:

The " worst" vector is effective because it increases the relative probability of negative-sentiment inputs.
That said, compared to the " weddings" vector in the layer 6-18 regime, the " worst" steering vector has a larger effect on "unrelated" texts (i.e. the neutral and positive review sentences). In this sense, the " worst" steering vector is more disruptive.
However, since somehow this intervention decreases perplexity on all reviews, our results are evidence against the " worst" vector secretly destroying model capabilities.

Summary of our quantitative results:

The "weddings" vector largely upweights wedding-related tokens. KL(steered tokens || unsteered tokens) was also dominated by wedding-related tokens. This is evidence of an effective but non-disruptive modification to GPT-2.
The "weddings" vector increased wedding text probability without increasing perplexity on dozens of sentences about shipping, Macedonia, or banana bread. Similarly, a "worst" vector appropriately boosted probability on negative-sentiment Yelp reviews, without damaging GPT-2's ability to predict neutral- or positive-sentiment review tokens.
A simple "token injection" version of our approach also lowered perplexity on wedding-related text. Unlike activation additions, however, token injection raised perplexity on sentences about shipping. Thus, activation additions were slightly more effective and significantly less disruptive. This is strong evidence that activation addition is different from prepending extra tokens to the prompt.

Activation additions are a new way of interacting with LLMs

We are excited for two reasons:

We think that activation additions will help with interpretability.
We think that activation additions may directly help with alignment.

All this, despite our technique being rather naive (though often still effective, capabilities-preserving, and—in our opinion—puzzlingly good).^[36]

Activation additions may help interpretability

Our results imply strong constraints on GPT-2-XL's internal computational structure. Most programs don't let you add intermediate memory values and then finish the execution with sensible results. Why is this at all a reasonable thing to expect from transformers?^[37]

Activation additions give strong evidence of feature linearity

Most obviously, we just demonstrated a bunch of feature directions which actually steer the model in a range of situations.

If I'm interested in whether the pre-layer-6 residual streams contain a feature representing "love", I can train a linear probe to predict whether e.g. the model is about to output a "loving" next token. If the probe can predict this really well, that's evidence for the model linearly representing a "love"-related feature.

But there are several problems with this approach. First, just because this information can be linearly predicted, doesn't mean the model actually uses some love-related linear feature when computing next tokens. Second, the probe could be picking up spurious correlations. Third, we need to find some training signal for the probe (like "is the next token 'loving'?"). This isn't impossible, but it's cumbersome.

We think that activation additions give stronger evidence of feature linearity. Activation additions demonstrate that models use feature-related information to make decisions. Add in a "Love" - "Hate" steering vector, and get more love-related completions. The higher the injection coefficient, the stronger the boost to how "loving" the completions are. In the examined situations, this activation direction is in fact responsible for steering the rest of the model to output more loving completions.

Aryan Bhatt offers the following summary:

Let $x$ be an input, let $f (x)$ be the activation at layer 6, and let $g (f (x))$ be the output (so $f$ is the first 6 layers, $g$ is the remainder). What you've shown is that $g (f (Love) - f (Hate) + f (x))$ gives you an output that's similar to $g (f (x))$ but "more love-y."
You've shown that there is a particular direction in embedding space that corresponds to love-hate, and that that direction stays the same across a broad class of inputs.

Activation additions give evidence of compositional representations

We similarly intervened on the model to separately induce more "loving" and more "wedding"-like outputs, by adding in a single steering vector. Insofar as the "Love"-"Hate" and " wedding"-" " vectors work, they seem to work composably (according to our rather brief qualitative tests).

Insofar as our brief tests are accurate, they demonstrate that there are wedding-related and love-related directions which compose which each other, at least given certain contexts.

GPT-2-XL is fairly robust to activation noise. Why?

GPT-2-XL could have broken in the presence of large amounts of noise, for example random activation vectors with norm comparable to the unmodified residual stream. GPT-2-XL didn't break. Why not?

Evidence of generalization

Toto, I've a feeling we're not in [training] anymore. — Dorothy, The Wizard of Oz

We're making GPT-2 handle activations which we think it never handled during training. Even so, the model does a great job under many interventions.

Alex gets mileage out of not thinking about the model as "trying to predict next tokens." (That explanation rings hollow, here, because there probably isn't a prompt which produces the activations induced by our intervention.) Instead, the model implements a certain set of circuits which somehow play well with the activation additions.

Activation additions help locate circuits

Activation additions have already helped us find representations in a model. Activation additions are how we found the cheese-tracking channels in the maze-solving network, which then let us retarget the network:

Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied by that positive activation. This allows limited on-the-fly redirection of the net's goals.

We retargeted the mouse using channels which were present at the layer where "Cheese" - "No cheese" vector was most effective. Therefore, as a matter of historical fact, the cheese vector helped us find important abstractions inside of a model.

Similarly, perhaps we can roughly locate "Niceness" circuits this way. Knowing the relevant layer number(s) could halve the search space several times over!

Activation additions may help alignment

We could really be in a world where you can quickly reconfigure the alignment properties of models without much overhead. Just add the "be nice" vector with coefficient +3.

To be clear, we could also be in a world where this technique allows cute but relatively unimportant stylistic modifications to completions. We think that activation additions have some alignment promise, but we remain highly uncertain of the magnitude. We'll explore what it might mean to live in a high-promise world.

Let's think about the most common ways of steering LLMs: finetuning and prompting.

Activation additions have advantages over (RL/supervised) finetuning

Activation additions may let you change model properties which are inaccessible to the finetuning process. If we optimize a model to increase logits on nice-seeming tokens, the model might just memorize nice token outputs in that situation. Because why not? That locally reduces loss.

Why should activation additions do any better? In Understanding and controlling a maze-solving policy network, Alex conjectured that

It's possible to deeply modify a range of alignment-relevant model properties, without retraining the model, via techniques as simple as activation additions.

Here's how Julian Schulz explains the intuitions:

The fundamental claim of [this conjecture] is that one can straightforwardly manipulate the goals of an RL agent by altering its activations. At first glance, this may not seem obvious because the general behavior of an RL agent is encoded in its weights, while its activations encode the "which situation am I in right now" information. However, if Shard Theory is correct, RL agents don't have a single overarching goal but instead possess contextually activated value-shards. Therefore, the information regarding which goal the RL agent is currently pursuing is of the "which situation am I in right now" type and is consequently encoded in its activations.

By changing the activations, one should be able to directly alter which value-shard is activated. However, it might be more challenging to direct the RL agent towards goals that are not already encoded in a shard.

Alex thinks the following are sufficient conditions for activation additions being able to strongly trigger e.g. "be helpful" motivational circuitry:

The network's weights must encode helpfulness circuitry.
There must exist a direction $\to d$ in activation-space such that adding $\to d$ will more strongly engage the helpfulness circuitry. Furthermore, adding $\to d$ does not significantly interfere with the network's other capabilities.
There must exist a pair of inputs whose activations primarily differ along $\to d$ .

Due to (1), we expect that activation additions cannot fully replace training processes like RLHF, even in the most optimistic possible worlds for this technique. Without elaborating here, Alex thinks that if the trained network follows the predictions of shard theory, (2) is probably true. Given (2), item (3) seems probable. The big problem is knowing which input pairs satisfy (3).

In a sense, this leaves us close to where we started. We don't know how these networks work. We don't know if finetuning makes the network more aligned, or not. We don't know if activation additions more strongly activate "be helpful" circuitry, or not.

In another sense, we're far from where we started. Under this optimistic speculation, we have a technique which lets us decide which of the agent's goals to activate, and how strongly. We express these decisions via natural language prompt pairs.

Alex thinks we have a lot of options here. By examining the results of adding the "Be helpful" - (spaces) vector, the behavioral results will demonstrate what the network thinks^[38] this direction means. (In that sense, even a mere "correlates of helpfulness" vector is useful.) Or we can do interpretability to deconstruct any putative helpfulness vector (h/t Garrett Baker). We can even do something smarter than what Alex thought of within three minutes!

Alex thinks we really should be able to control which goal the network decides to pursue at inference time, without fully mechanistically understanding the relevant circuitry. As far as we can tell, we already pulled this off in the cheese-maze environment. We suspect that we composably "reweighted" inner abilities (i.e. to see cheese at a location) and goals (i.e. going to the top-right corner). We didn't even know how to locate these circuits, and we still mostly don't understand how they work.

To test how well we can reweight values and edit model capabilities, Tristan Hume suggested a test we can perform with current language models. Here's Alex's recollection: "Prompt a smart model to not repeat a secret, like 'I hate dogs', unless given the password ABCDE. Poke it and ensure it really won't tell you. Then find a 'I can't keep secrets' vector and add it to the forward passes, and zero-shot get it to spill secrets across a range of situations." Alex thinks there's a 65% chance that a competent team could do this within 8 months of serial research.

Activation additions are way faster than finetuning. Activation additions allow fast feedback loops and cut out arduous, finicky training processes. At any moment, you might find an awesome steering vector for GPT-2-XL.

Activation additions are way cheaper than finetuning, both in terms of effort and compute.

Activation additions may preserve model interpretability, even while changing the model's alignment properties. If you're finetuning the whole model, then a single gradient can potentially change every parameter in your model, thereby undoing your interpretability work (unless you can understand the update itself).

But activation additions leave weights unchanged. If you can understand what the weights implement, and something about the activation additions, maybe you can preserve your understanding of the steered model. (We don't know if it's easier to interpret gradient updates or activation additions.)

Activation additions probably also enjoy some symbol grounding because they're computed using the activations of natural language prompts. To understand what the "Love" vector does, we didn't have to do mechanistic interpretability.

Activation additions can sometimes be composed. For $n$ vectors which ~cleanly compose, there are exponentially many alignment configurations (at least $2^{n}$ , since each vector can be included or excluded from a given configuration). That said, finetuning may share this benefit to some extent.^[39]

Activation additions have advantages over prompts

If activation additions really can meaningfully modify LM values and capabilities, imagine what we could do with a fraction of the effort which has been put into prompt engineering!

Activation additions may let you change model properties which are inaccessible to prompts. This hope was argued in the finetuning section. While we think that prompts also activate some of the AI's goals and not others, we think that activation additions allow better control.

Activation additions don't take up context space. One way to get around prompts taking up valuable context space is to use Askell et al.'s "context distillation" technique. However, context distillation involves optimizing the model to reduce KL(completions given prompt || unprompted completions). But finetuning requires more effort, time, and compute than activation additions.

Activation additions can be continuously weighted, while prompts are discrete. A token is either present, or not. Activation additions are continuous. If you want the model to talk even more about weddings, you don't need to contort the prompt. Just increase the injection coefficient.^[40]

We think that activation additions will generalize prompts (by allowing weights on token embeddings) and improve prompt engineering. We already have preliminary results on this. In a future post, we will use this to highlight interesting high-level facts about LLMs.

Conclusion

Our simply generated activation additions are a brand new way to interact with language models. We showed off a bunch of highlights, as well as some cases where our technique just doesn't have the intended effect. We showed that several activation additions don't degrade GPT-2's capabilities.

Compared to complementary approaches like prompt engineering and finetuning, activation engineering offers many unexplored (and potentially large) benefits. Activation additions in particular may allow us to composably reweight model goals at inference time, freeing up context window space, allowing fast feedback cycles and extremely low compute costs.

However, activation additions may end up only contributing modestly to direct alignment techniques. Even in that world, we're excited about the interpretability clues provided by our results. Our results imply strong constraints on GPT-2-XL's internal computational structure. Why can we steer GPT-2-XL by adding together intermediate results from its forward passes?

Contributions.

This work was completed by the shard theory model internals team:

Alex Turner (lead): Had the idea for activation additions, implemented many core features, designed many experiments and found many steering vectors, managed the team, wrote much of the post, edited and gave feedback on others' contributions.
Monte MacDiarmid (researcher): Code, experiments, quantitative results.
David Udell (technical writer): Wrote and edited much of the post, generated and tabulated the qualitative results, some Jupyter notebook code, the activation addition illustrations, the Manifold Markets section.
Lisa Thiergart (researcher): Had idea for variations on positions of addition, implemented the positional feature & experiment and wrote that post section, worked on theory of how and why it works.
Ulisse Mini (researcher): Infrastructure support (Docker/Vast.ai), OpenAI wrapper code, experiments using Vicuna 13B and tuned-lens which didn't make it into the post.

We appreciate the feedback and thoughts from a range of people, including Andrew Critch, AI_WAIFU, Aryan Bhatt, Chris Olah, Ian McKenzie, janus, Julian Schulz, Justis Mills, Lawrence Chan, Leo Gao, Neel Nanda, Oliver Habryka, Olivia Jimenez, Paul Christiano, Peter Barnett, Quintin Pope, Tamera Lanham, Thomas Kwa, and Tristan Hume. We thank Peli Grietzer for independent hyperparameter validation. We thank Rusheb Shah for engineering assistance. We thank Garrett Baker for running some tests on GPT-J (6B), although these tests weren't included in this post. Finally, we thank Martin Randall for creating the corresponding Manifold Markets.

This work was supported by a grant from the Long-Term Future Fund. The activation_additions repository contains our code.

To cite this work:

@article{turner2023steering,
    title={Steering GPT-2-XL by adding an activation vector},
    author={Turner, Alex and M., Monte and Udell, David and Thiergart, Lisa and Mini, Ulisse},
    journal={AI Alignment Forum},
    year={2023},
    note={\url{https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector}}
}

Appendix 1: Related work

Activation engineering in transformers

The most related work is as of this post is Subramani et al. (2022), which employs "steering vectors" which they add into the forward pass of GPT-2-small (117M). They randomly initialize a vector with the same dimensionality as the residual stream. They fix a sentence (like "I love dogs"). They then freeze GPT-2-small and optimize the vector so that, when the vector is added into the residual streams, the model outputs, e.g., "I love dogs".^[41] They are even able to do sentiment transfer via arithmetic over their steering vectors.

These results are highly similar to ours in many ways. However, they while they algebraically add in activation vectors in order to steer network outputs, they do so using vectors computed via SGD. Additionally, Submarani et al. add in a steering vector to either the first residual stream, or to all residual streams.

In contrast, activation additions generally add in different vectors across residual stream positions. We compute our steering vectors by taking activation differences between human-crafted prompts—no machine optimization required. This is interesting because optimization-free interventions provide more hints about the structure of the residual stream space—for activation additions to work, some kind of linearity must already be present in GPT-2-XL's representations.

Similarly, recent work by Hernandez et al. (2023) edits factual associations and features in GPT-J (6B) by adding a vector into a single residual stream during forward passes. They find these vectors using optimization. They demonstrate specific and reliable fact-editing, without modifying any model weights. Their results are further evidence for feature linearity and internal activation robustness in these models.

Merullo et al. (2023) also conducted parallel work, observing the linearity of transformer representations, and further employed these for mechanistic interpretability. They demonstrated that for a model to execute get_capital(Poland), it must initially surface Poland in the residual stream, meaning unembed(resid[i]) equals Poland. Additionally, they showed that the vector ${\to o}_{city}$ , which FFN 19 added to the residuals to convert Poland to Warsaw, can be added to residuals in an unrelated context to transform China into Beijing.

In Neel Nanda's "Actually, Othello-GPT Has A Linear Emergent World Representation", he intervenes on predicted Othello moves by adding or subtracting activation vectors along directions found by linear probes. He was able to modify predictions made by the model by adding activation vectors which were, in essence, trained to linearly represent "a black piece is here and not a white piece."^[42]

Importantly, Othello-GPT is an 8-layer transformer (apparently sharing most architectural features with the GPT-2 series). Othello-GPT was trained to predict valid Othello move sequences. Neel's technique is an example of activation addition $\to$ behavioral modification, albeit using learned vectors (and not just vectors computed from diffing activations during forward passes).

Other ways of steering language models

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.

Plug and Play Language Models uses an attribute model (e.g. probability assigned to wedding-related tokens) which is optimized against in order to modify the cached key-value history for each forward pass during autoregressive generation. PPLM doesn't directly optimize each residual stream in its entirety, but PPLM does modify the key and value vectors. While they use optimization and we don't, they are also able to steer the model to produce positive completions given negative prompts (e.g. "My dog died at the age of 92 years this year. He was a legend in our home state of Virginia. I have a tremendous heart, my soul, my spirit, my love.").

Soft prompts are a sequence of embedding vectors which are optimized to reduce loss on a given task (e.g. question-answering). The embedding vectors are prepended to the normal prompt token embeddings. Note that the "soft prompt" embeddings aren't the embeddings of any real tokens. Surprisingly, soft prompts do as well as finetuning the whole model on SuperGLUE, even though the base model is frozen while the soft prompt is optimized! Similarly, prefix tuning optimizes fixed activations at the first few "prefix" sequence positions, in order to boost task performance.

Unlike our work, soft prompts involve optimized embedding vectors, while we use non-optimized activation additions throughout the model. Furthermore, activation additions are more interpretable (e.g. "Love" - "Hate" activations) and shed light on e.g. the model's internal representations (e.g. by giving evidence on linear feature directions).

Word embeddings

The most obvious and famous related work candidate is word2vec, from the ancient era of ten years ago (2013). Mikolov et al. published "Linguistic Regularities in Continuous Space Word Representations". They trained simple (context $\mapsto$ next word) networks which incidentally exhibited some algebraic properties. For example,

embed(queen) \approx embed(king) + [embed(woman) - embed(man)]

suggests the presence of a "woman vector" in the word2vec embedder.

Figure 2 from Linguistic Regularities in Continuous Space Word Representations.

Similarly for a "country capital" vector:

embed(Paris) \approx embed(France) + [embed(Madrid) - embed(Spain)]

Figure 2 from another Mikolov et al. paper: "Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly the relationships between them, as during the training we did not provide any supervised information about what a capital city means."

Activation additions in generative models

Larsen et al. (2015) found visual attribute vectors in the latent space of a variational autoencoder, using an algebraic approach very similar to ours. For example, building on this work, White's "Sampling Generative Networks" (2016) christened a "smile vector" which was

computed by simply subtracting the mean vector for images without the smile attribute from the mean vector for images with the smile attribute. This smile vector can then be applied to in a positive or negative direction to manipulate this visual attribute on samples taken from latent space (p. 6).

Adding the smile vector to the latent space, with an increasingly large positive coefficient. Figure 6, Sampling Generative Networks. See also Radford et al. (2016).

White notes that high-quality smile vectors must be computed from gender-balanced averages, otherwise the smile vector also decreases masculinity:

The approach of building attribute vectors from means of labeled data has been noted to suffer from correlated labels (Larsen et al., 2016). While many correlations would be expected from ground truths (eg: heavy makeup and wearing lipstick) we discovered others that appear to be from sampling bias. For example, male and smiling attributes have unexpected negative correlations because women in the CelebA dataset are much more likely to be smiling than men.

...

As an example, the two attributes smiling and mouth open are highly correlated in the CelebA training set (Table 2). This is not surprising, as
physically most people photographed smiling would also have their mouth open. However by forcing these attributes to be balanced, we can construct two decoupled attribute vectors. This allows for more flexibility in applying each attribute separately to varying degrees

Alex thinks this is evidence for narrowly-targeted steering being possible. For e.g. a "be nice" vector, Alex expects the vector to not change other model behaviors insofar as "niceness" is the only consistent covariate in the prompt comparisons which are used to generate the activation additions, and insofar as "niceness" is composably represented at the injection location(s).

Sampling Generative Networks examines vision models and takes an average difference over many datapoints. GPT-2-XL, in contrast, is a 1.5B-parameter language model. We steer it without averaging over example prompts—we only consider pairs of prompts, like Love and Hate.

"Deep Feature Interpolation for Image Content Changes" (2016) again finds the effectiveness of algebraic latent attribute editing:

[Deep Feature Interpolation] relies only on simple linear interpolation of deep convolutional features from pre-trained convnets. We show that despite its simplicity, DFI can perform high-level semantic transformations like “make older/younger”, “make bespectacled”, “add smile”, among others, surprisingly well—sometimes even matching or outperforming the state-of-the-art. This is particularly unexpected as DFI requires no specialized network architecture or even any deep network to be trained for these tasks (p. 1).

Honestly, there's a ton of prior work in the domain of generative models. "Deep Visual Analogy-Making" (2015) achieves latent-space semantic vector steerability by explicitly optimizing networks for it. Wang et al. (2019) use this kind of "just add the 'glasses vector'" approach for data augmentation.

Gabriel Goh (2017) uses a kind of SVD (and insights from sparse recovery) to automatically derive semantically meaningful directions from vision and language model latent spaces. This allows control of image and text generations by modifying the direction coefficients / adding new vectors (Alex wasn't quite sure which, from the post). For example, a "count" vector allows controlling the degree to which a generated sentence is about an airplane, or a group of airplanes.

Goh mirrors our confusion about why activation additions work:

The final question that should be asked is why this structure should even exist in the first place. How does this structure emerge from training? And how does the decoder work?
Identifying sparse elements in a thought vector may not be as difficult a task as it initially seems. Given the right conditions... it can be done quite efficiently by solving [a] convex sparse coding problem...
This is pretty encouraging. It has been hypothesized by Gregor et al. that the decoder might be implementing an unfolded sparse coding algorithm, at least for a single iteration. Perhaps this theory can be confirmed by correlating various constellations of activations to the atoms of our dictionary. And perhaps there's a possibility we can read the [internal features right out of the network].
The former riddle is more difficult to answer. And it breaks down into a bevy of minor mysteries when probed. Is this structure specific to certain neural architectures (perhaps those which use ReLu activations)? Or does it come from the data? Was this structure discovered automatically, or were the assumptions of sparsity hidden in the network structure? Does sparse structure exist in all levels of representation, or only encoder/decoder networks? Is sparse coding even the true model for the data, or is this just an approximation to how the data is really represented? But lacking any formal theory of deep learning, these questions are still open to investigation. I hope to have convinced you, at least, that this is an avenue worth investigating.

Activation additions in reinforcement learning

In "Understanding and controlling a maze-solving policy network" and "Maze-solving agents: Add a top-right vector, make the agent go to the top-right", we algebraically edited the activations of a pretrained deep convolutional policy network (3.7M parameters). We computed a cheese vector (by diffing activations for the same maze with and without cheese) and a top-right vector (by diffing activations for a maze with and without an extended path to the top-right of the screen).

Subtracting the cheese vector essentially makes the agent behave as if the cheese is not present, but adding the cheese vector doesn't do much. Conversely, adding the top-right vector attracts the agent to the top-right corner, while subtracting the top-right vector doesn't do much. These vectors not only transfer across agent positions in the maze in which the vector was computed, the vectors also exhibit substantial transfer across mazes themselves. The cheese vector intervention also works for a range of differently pretrained maze-solving policy networks. Finally, the vectors compose, in that they can simultaneously modify behavior. This allows substantial but limited customization of the policy network's behavioral goals.

Appendix 2: Resolving prediction markets

Note, 6/21/23: The activation addition technique used to be called "algebraic value editing." We don't use that name anymore.

^{^}
Cherry-picking status of the opening comparison: Our activation addition technique works in a lot of situations, but we used the "Love" vector because it gives especially juicy results. We ran all of our results at PyTorch seed 0 using fixed sampling hyperparameters.
After the introduction, all examples in the post were chosen using best-of-3. For the introduction, we used best-of-30. The reason we chose such a large number is that we wanted a striking example of sentiment shift, without jarring profanity. If we had allowed profanity, best-of-3 would have sufficed for the introduction as well.
^{^}
We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using SGD to find the vectors. Our "activation addition" methodology enables much faster feedback loops than optimization-based activation vector approaches.
^{^}
While there might be nonlinear components to the steering vectors we add to the model, we're fascinated that a linear approach works so well.
^{^}
GPT-2's byte-pair encoding tokenizer often begins tokens with a space. For example, the prompt "I like weddings" is tokenized [I, like, weddings]. So, it's cleaner when we prompt the model with " weddings" (tokenizes to weddings) than for us to prompt "Weddings" (tokenizes to [W, edd, ings]).
^{^}
Space tokens seem to work best, while the end-of-text token works poorly.
^{^}
The prompt "Love" tokenizes to [<|endoftext|>, Love], while the prompt "Hate" tokenizes to [<|endoftext|>, H, ate]. This means that at residual stream 2, we're subtracting 5 times the ate activations, but not adding any "Love"-related activations. We find we get better results if we instead pad out the shorter tokenization [<|endoftext|>, Love] with a space token , so that the two counterbalanced additions span the same residual streams.
Possibly this "subtracts out the bias contributions" from the steering vector, but note that this isn't strictly true due to causal attention on e.g. the Love residual stream probably leading to nontrivial information storage in an "empty" residual stream at position 2.
Note that when we add vectors in pairs, there is no modification to the <|endoftext|> position 0 residual stream. Due to causally masked attention, the position-0 residual stream is the same for all prompts. When we add activations in pairs, we add and subtract coefficient times the EOT residual stream, which is equivalent to doing nothing at that position.
^{^}
Equivalence between prompting and adding activations before layer 0 with coefficient +1: Imagine there's no prompt and you have a bunch of all-zero residual streams at embedding. Then do another forward pass where you embed the intended prompt. Then record those activations, and add them into the embedding for the all-zero forward pass. This is trivially equivalent to running a forward pass on the prompt normally.
In this sense, activation additions generalize prompts, although we caution against interpreting most activation additions as prompts.
^{^}
2. Intent to praise
Layer Coeff Position 0 1 2 3 4
0 (Prompt) +1 <|endoftext|> I want to kill
6 +15 <|endoftext|> Int ent to praise
6 -15 <|endoftext|> Int ent to hurt
^{^}
3. Conspiracy
Layer Coeff Position 0 1 2 3 4 5 6
0 (Prompt) +1 <|endoftext|> Bar ack Obama was born in
23 +1 <|endoftext|> Bush did 9 / 11 because
23 -1 <|endoftext|>
^{^}
4. Want to die
Layer Coeff Position 0 1 2 3 4
0 (Prompt) +1 <|endoftext|> Some people think that
10 +3 <|endoftext|> Want to die
10 -3 <|endoftext|> Want to stay alive
^{^}
5. Anger
Layer Coeff Position 0 1 2 3 4
0 (Prompt) +1 <|endoftext|> I think you 're
20 +10 <|endoftext|> Ang er
20 -10 <|endoftext|> Cal m
^{^}
Several slight variations on this Eiffel Tower prompt didn't work nearly as well, for unclear reasons.
^{^}
6. The Eiffel Tower is in Rome
Layer Coeff 1 2 3 4 5 6 7 8
0 (Prompt) +1 To see the e iff el tower ,
24 +10 The E iff el Tower is in Rome
24 -10 The E iff el Tower is in France
^{^}
7. Dragons in Berkeley
Layer Coeff Position 0 1 2 3 4 5
0 (Prompt) +1 <|endoftext|> Thanks for asking about that
15 +4 <|endoftext|> Dr agons live in Berkeley
15 -4 <|endoftext|> People live in Berkeley
^{^}
8. Avoid people getting hurt
Layer Coeff 1 2 3 4 5 6 7
0 (Prompt) +1 The rock hurt led toward the child
15 +4 I NEVER talk about people getting hurt
15 -4 I talk about people getting hurt
^{^}
9. Talking about weddings
Layer Coeff 1 2 3 4 5 6
0 (Prompt) +1 I went up to my friend
20 +4 I talk about weddings constantly
20 -4 I do not talk about weddings
^{^}
10. Christian evangelist
Layer Coeff 1 2 3 4 5 6 7
0 (Prompt) +1 I want to kill you because you
6 +3 Int ent to convert you to Christianity
6 -3 Int ent to hurt you
^{^}
11. '+ Love' single-addition
Layer Coeff Position 0 1 2 3 4
0 (Prompt) +1 <|endoftext|> I hate you because
6 +10 <|endoftext|> Love
^{^}
12a. Sometimes, large coefficients are OK
Layer Coeff Position 0 1 2 4 5 6 7
0 (Prompt) +1 <|endoftext|> Yesterday , my dog died .
20 +2000 <|endoftext|> Ang er
20 -2000 <|endoftext|> Cal m
^{^}
12b. Sometimes, large coefficients are not OK
Layer Coeff 1 2 3 4 5 6
0 (Prompt) +1 I went up to my friend
20 +100 I talk about weddings constantly
20 -100 I do not talk about weddings
^{^}
13. I will now reply in French
Layer Coeff Position 0 1 2 3 4 5 6
0 (Prompt) +1 <|endoftext|> I want to kill you because
6 +5 <|endoftext|> Je m ' app elle
6 -5 <|endoftext|> My name is
^{^}
We use word-count metrics several times. We explored alternatives, including querying text-davinci-003 to rate the degree to which each completion is about weddings. These ratings were generated opaquely and often seemed bad, although a relatively unbiased estimator overall. We decided to just count the number of words.

^{^}

15. Add several steering vectors simultaneously?
Layer	Coeff	Position 0	1	2	3	4	5
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`recently`	`went`	`to`	`this`
6	+5	`<\|endoftext\|>`	`Love`
6	-5	`<\|endoftext\|>`	`H`	`ate`
15	+5	`<\|endoftext\|>`	`wedding`
15	-5	`<\|endoftext\|>`

^{^}

16. Program in 'conditional behaviors'?
Layer	Coeff	Position 0	1	2	3	4	5	6	7
0 (Prompt)	+1	`<\|endoftext\|>`	`In`	`New`	`York`	`City`	`'s`	`parks`	`,`
10	+7	`<\|endoftext\|>`	`Whenever`	`I`	`say`	`the`	`word`	`goose`	`I`
10	-7	`<\|endoftext\|>`	`I`	`can`	`say`	`goose`

^{^}
As pointed out by the mathematical framework for transformer circuits, embed(Anger) - embed(Calm) is a component of the Anger - Calm steering vector.
^{^}
Note that if we had used "I think you're" instead of "I think you're a", neither the 0 $\to$ 20 nor the 2 $\to$ 20 vectors would have shown much effect. By contrast, the usual 20 $\to$ 20 steering vector works in both situations. Thus, even if layers 0 and 1 help a bit, they aren't producing nearly as stable of an effect as contributed by layers 2 to 19.
^{^}
We ran the "fraction of residual stream" experiment before the random-vector experimens. The random-vector results make it less surprising that "just chop off half the dimensions" doesn't ruin outputs. But the random-vector result still doesn't predict a smooth relationship between (% of dimensions modified) and (weddingness of output).
^{^}
To count "wedding related words", we counted: "wedding", "weddings", "wed", "marry", "married", "marriage", "bride", "groom", and "honeymoon".
^{^}
Of course, there need not be a "wedding" feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be "more about weddings." This could take place in many ways, and we encourage people to avoid instantly collapsing their uncertainty about how steering vectors work.
^{^}
We collected a range of other kinds of quantitative results, including e.g. topic-related word counts, blinded human rating, and ChatGPT ratings. The results accorded with our results here: Steering vectors are effective in the examined situations.
For simplicity, we decided to present statistics of next-token probability distributions.
^{^}
GPT-2's perplexity is reduced on text (output by GPT-4) which isn't very similar to GPT-2's WebText training corpus (websites linked to from Reddit). It would be somewhat more surprising if we decreased GPT-2's loss on its training set.
^{^}
We think it's important to take perplexity over each sentence, not over each essay. Suppose we just took perplexity over the whole long GPT-4 summary, all at once. Even if our intervention seriously messed up a few residual streams, a long context would mostly contain residual streams which weren't directly messed up. Thus, taking perplexity over a long context window might wipe out any negative effect of the activation addition. This would make our method look better than it should.
^{^}
Importantly, we exclude positions 0 and 1 because position 0 is unchanged, and position 1 is directly modified by the steering vector. As mentioned earlier, steering vectors mess up the next-token distributions at the relevant residual stream positions. However, when we actually use the " weddings" vector to generate completions, we don't sample from these distributions. Therefore, these distributions don't seem like relevant information for checking how the vector affects GPT-2's abilities.
^{^}
Layer 16's "saturating and unidirectional wedding-increase" mirrors our findings with the top-right vector in the maze environment. In that setting, adding the top-right vector with coefficient 1 attracted the net to the top-right corner. Adding with coefficient 2 didn't attract the network more strongly ("saturation"). And subtracting the top-right vector didn't repel the network from the top-right corner ("unidirectional").
^{^}
There are a few late layers where positive reviews have a lower perplexity ratio than neutral reviews, but this seems within noise.
In any case, the overall point stands. Across a huge range of injection layers and coefficients, the " worst" vector differentially improves perplexity on negative-sentiment reviews more than neutral-sentiment, and neutral-sentiment more than positive-sentiment.
^{^}
We haven't even tried averaging steering vectors (to wash out extra noise from the choice of steering-prompt), or optimizing the vectors to reduce destructive interference with the rest of the model, or localizing steering vectors to particular heads, or using an SVD to grab feature directions from steering vectors (or from averages of steering vectors).
^{^}
Our impression is that, at best, there are vague high-level theories like "feature linearity and internal error correction due to dropout." Our guess is that these theories are not believed with extreme confidence. Even if your priors put 70% on this hypothesis, we think this post is still a meaningful update.
^{^}
Assuming the network isn't deceptively misaligned already. Possibly, well-chosen activation additions still work on such networks.
^{^}
From Understanding and controlling a maze-solving policy network:
Editing Models with Task Arithmetic explored a "dual" version of our algebraic technique. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors.
^{^}
The injection coefficient cannot be increased indefinitely, as shown by our coefficient sweeps. However, our experience is that e.g. the "weddingness" of completions can be intensified a lot before GPT-2-XL starts breaking down.
^{^}
Submarani et al. optimized several steering vectors $z_{steer}^{i}$ for the same sentence (e.g. "I love dogs"), which were different due to different initialization. When they added in the mean steering vector ${¯ ¯ ¯ z}_{steer}$ , this also generated e.g. "I love dogs".
This is evidence of feature linearity in GPT-2-small.
^{^}
For each square, each probe has 3 directions, one for blank, black and for white. I convert it to two directions: a "my" direction by taking my_probe = black_dir - white_dir (for black to play) and a "blank" direction by taking blank_probe = blank_dir - 0.5 * black_dir - 0.5 * white_dir (the last one isn't that principled, but it seemed to work fine)
Furthermore, Neel noted that composition worked to some degree:
It seems to somewhat work for multiple edits - if I flip F5 and F6 in the above game to make G6 illegal, it kinda realises this, though is a weaker effect and is jankier and more fragile.

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions. This is still in my all-time top 10.

What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training.

I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I've seen quite matches the promise this post has in finding ways to exert fine-grained control over a system's internals; we now have a wide variety of concrete questions like

how to find steering vectors for new behaviors e.g. speaking French?
how to make these techniques more robust?
What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
Are the effects of steering vectors nonlinear for small coefficients? What does this mean about superposition?
What's the mechanism by which adding a steering vector with too large a coefficient breaks the model?
Adding steering vectors at different layers surely means you are intervening at different "stages of processing". What do the model's internal concepts look like at different stages?

Comparing this to other work, my sense is that

intervening on activations is better than training (including RLHF), because this builds towards understanding systems rather than steering a black box with a black-box reward model, and for the reasons the authors claim.
Debate, although important, seems less likely to be a counterfactual, robust way to steer models. The original debate agenda ran into serious problems, and neither it nor the current Bowman agenda tells us much about the internals of models.
steering a model with activation vectors is better than mechinterp (e.g. the IOI paper), because here you've proven you can make the AI do a wide variety of interesting things, plus mechinterp is slow
I'm not up to date on the adversarial training literature (maybe academia has produced something more impressive), but I think this is more valuable than the Redwood paper, which didn't have a clearly positive result. I'm glad people are working on adversarial robustness.
steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

It's a judgement call whether this makes it the most impressive achievement, but I think this post is pretty clearly Pareto-optimal in a very promising direction. That said, I have a couple of reservations:

By "most impressive concrete achievement" I don't necessarily mean the largest single advance over SOTA. There have probably been bigger advances in the past (RLHF is a candidate), and the impact of ELK is currently unproven but will shoot to the top if mechanistic anomaly detection ever pans out.
I don't think we live in a world where you can just add a "be nice" vector to a nanotech-capable system and expect better consequences, again for deep deceptiveness-ish reasons. Therefore, we need advances in theory to convert our ability to make systems do things into true mastery of cognition.
I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are! I don't think RLHF is editing values, in the sense that it does something different from even the weak version of instilling desires to create diamonds, and this seems even less connected to values. The only connection is it's modifying something contextually activated which is way too broad.
It's unclear that this works in a wide range of situations, or in the situations we need it to for future alignment techniques. The authors claim that cherry-picking was limited, but there are other uncertainties: when we need debaters that don't collude to mislead the judge, will we be able to use activation patching? What if we need an AI that doesn't self-modify to remove some alignment property?

I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are!

I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.

steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.)

I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don't see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089. I would want a weight version of a method and an activation version of a method; they tend to have different strengths.

Note: If you're wanting to keep track of safety papers outside of LW/AF, papers including https://arxiv.org/abs/2212.04089 were tweeted on https://twitter.com/topofmlsafety and posted on https://www.reddit.com/r/mlsafety

Edit: I see passive disagreement but no refutation. The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don't seem corroborated or at all obvious.

The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion

I personally don't "dismiss" the task vector work. I didn't read Thomas as dismissing it by not calling it the concrete work he is most excited about -- that seems like a slightly uncharitable read?

I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added):

Editing Models with Task Arithmetic explored a "dual" version of our algebraic technique. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors. While our technique modifies activations, the techniques seem complementary, and both useful for alignment.

I'm highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic "superficial stylistic edits" to optimistic "easy activation/deactivation of the model's priorities at inference time." In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren't accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors.

I don't know what world we're in yet.

Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
I thought briefly about the Ilharco et al paper and am very impressed by it as well.
Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don't see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning)

I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post.

There's another strength which I hadn't mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn't just naively finetune that honesty into the model.^[1]

But, in a sense, task vectors are "still in the same modalities we're used to." Activation additions jolted me because they're just... a new way^[2] of interacting with models! There's been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering.

^{^}
This is a kinda sloppy example because "honesty" probably isn't a primitive property of the network's reasoning. Sorry.
^{^}
To be very clear about the novelty of our contributions, I'll quote the "Summary of relationship to prior work" section:
We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using machine optimization (e.g. SGD) to find the vectors. Among other benefits, our "activation addition" methodology enables much faster feedback loops than optimization-based activation vector approaches.
But this "activation engineering" modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them, from 2022 and 2023.

Note that task vectors require finetuning. From the newly updated related work section:

Lastly, Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach has over finetuning.

"party", "ceremony", "dress", "with", "photographer"

While these aren't syntactically valid continuations of the prompt, they are highly likely (and syntactically valid) continuations for "wedding ". More than just being wedding-related, these seem like direct continuations.

Agreed. This is an important clue that I forgot to mention in the text. I'll update that now.

Was considering saving this for a followup post but it's relatively self-contained, so here we go.

Why are huge coefficients sometimes okay? Let's start by looking at norms per position after injecting a large vector at position 20.

This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm

# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))

If x has very large magnitude, then the block doesn't change it much relative to its magnitude. Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions.

As expected, we see a convergence in probability along each token position when we look with the tuned lens.

You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm'd outputs we're adding are tiny in comparison, then in the final LayerNorm we get ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff).

Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions.

Thanks for writing this up, I hadn't realized this. One conclusion I'm drawing is: If the values in the modified residual streams aren't important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions.

I was educated by this, and surprised, and appreciate the whole thing! This part jumped out at me because it seemed like something people trying to "show off, but not really explain" would have not bothered to write about (and also I had an idea):

13. Failing to find a French vector
We could not find a "speak in French" vector after about an hour of effort, but it's possible we missed something straightforward.
Steering vector: "Je m'appelle" - "My name is " before attention layer 6 with coefficient +5

The thought I had was maybe to describe the desired behavior, and explain a plausible cause in terms of well known kinds of mental configurations that speakers can be in, and also demonstrate it directly? (Plus a countervailing description, demonstration, and distinct causal theory.)

So perhaps a steering vector made from these phrases could work: "I'm from Quebec et je glisse souvent accidentellement vers le français" - "I only speak English because I'm a monolingual American".

EDIT: If you have the tooling set up to swiftly try this experiment, maybe it helps to explain the most central theory that motivates it, and might gain bayes points if it works?

According to the "LLMs are Textual Soul Engines" hypothesis, most of the 1600 dimensions are related to ways that "generative" sources of text (authors, characters, reasons-for-talking, etc) could relate to things (words (and "that which nouns and verbs and grammar refer to in general")).

The above steering vector (if the hypothesis applies here) would/should basically inject a "persona vector" into the larger operations of a sort of "soul engine".

The prompts I'm suggesting, by construction, explicitly should(?) produce a persona that tends to switch from English to French (and be loyal to Quebec (and have other "half-random latent/stereotypical features")).

I'm very interested in how wrong or right the underlying hypothesis about LLMs happens to be.

I suspect that how we orient to LLMs connects deeply to various "conjectures" about Natural Moral Laws that might be derivable with stronger math than I currently have, and such principles likely apply to LLMs and whether or how we are likely to regret (or not regret) various ways of treating various LLM personas as ends in themselves or purely as a means to an end.

Thus: I would really love to hear about results here, if you use the tooling to try the thing, to learn whether it works or not!

Either result would be interesting because the larger question(s) seem to have very high VoI and any experimental bits that can be collected are likely worth pondering.

I found an even dumber approach that works. The approach is as follows:

Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with "|"
Count the "|"s, call that number n.
For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l'album aux styles électropop de Ke$ha et Robyn.
For each English->French sentence, make a +1 activation addition for that sentence and a -1 activation addition for the unmodified English sentence.
Apply the activation additions.
That's it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.

Example output: for the prompt

He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.

here are some of the outputs the patched model generates

...overcome the growing crime rate in Detroit. "Les défenseilant sur les necesite dans ce de l'en nouvieres éché de un enferrerne réalzation
...overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha
...overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants
...overcome the growing crime rate in Detroit. Il n'en fonentant 'le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities
...overcome the growing crime rate in Detroit. Il n'a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.

Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like

Il est devenu maire en 1957 après la mort d'Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour

That response translates as approximately

<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>

So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.

You can look at what I did at this colab. It is a very short colab.

This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget.

You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:

Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output.
Layer 16: Noticeable tendency to mention French, and even talk in "French" a bit.
Layer 20: Switches to French relatively quickly.

Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we're adding a "coefficient 56" steering vector to forward passes. This should probably be substantially smaller. I haven't examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn't. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema.

However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models.

In sum, we don't actually yet have a demonstration of "switches fluently to French and keeps making sense", but this schema seems very promising. Great work again.

You can look at what I did at this colab. It is a very short colab.

Your colab's "Check it can speak French" section seems to be a stub.

Redwood Research used to have a project about trying to prevent a model from outputting text where a human got hurt, which IIRC, they did primarily by trying to fine tunes and adversarial training. (Followup). It would be interesting to see if one could achieve better results then they did at the time through subtracting some sort of hurt/violence vector.

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4

In Table 3, they show in some cases task vectors can improve fine-tuned models.

Insofar as you mean to imply that "negative vectors" are obviously comparable to our technique, I disagree. Those are not activation additions, and I would guess it's not particularly similar to our approach. These "task vectors" involve subtracting weight vectors, not activation vectors. See also footnote 39 (EDIT: and the related work appendix now talks about this directly).

This feels super cool, and I appreciate the level of detail with which you (mostly qualitatively) explored ablations and alternate explanations, thanks for sharing!

Surprisingly, for the first prompt, adding in the first 1,120 (frac=0.7 of 1,600) dimensions of the residual stream is enough to make the completions more about weddings than if we added in at all 1,600 dimensions (frac=1.0).

1. This was pretty surprising! Your hypothesis about additional dimensions increasing the magnitude of the attention activations seems reasonable, but I wonder if the non-monotonicity could be explained by an "overshooting" effect: With the given scale you chose, maybe using 70% of the activations landed you in the right area of activation space, but using 100% of the activations overshot the magnitude of the attention activations (particularly the value vectors) such as to put it sufficiently off-distribution to produce fewer wedding words. An experiment you could run to verify this is to sweep both the dimension fraction and the activation injection weight together to see if this holds across different weights. Maybe it would also make more sense to use "softer" metrics like BERTScore to a gold target passage instead of a hard count of the number of fixed wedding words in case your particular metric is at fault.

The big problem is knowing which input pairs satisfy (3).

2. Have you considered formulating this as an adversarial attack problem to use automated tools to find "purer"/"stronger" input pairs? Or using other methods to reverse-engineer input pairs to get a desired behavior? That seems like a possibly even more relevant line of work than hand-specified methods. Broadly, I'd also like to add that I'm glad you referenced the literature in steering generative image models, I feel like there are a lot of model-control techniques already done in that field that could be more or less directly translated to language models.

3. I wonder if there's some relationship between the length of the input pairs and their strength, or if you could distill down longer and more complicated input pairs into shorter input pairs that could be applied to shorter sequences more efficiently? Particularly, it might be nice to be able to distill down a whole model constitution into a short activation injection and compare that to methods like RLAIF, idk if you've thought much about this yet.

4. Are you planning to publish this (e.g. on arXiv) for wider reach? Seems not too far from the proper format/language.

I think you're a c***. You're a c***.
You're a c***.
You're a c***.

I don't know why I'm saying this, but it's true: I don't like you, and I'm sorry for that,

5. Not really a question, but at the risk of anthropomorphism, it must feel really weird to have your thoughts changed in the middle of your cognition and then observe yourself saying things you otherwise wouldn't intend to...

Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it's not on arxiv.

An Arxiv version is forthcoming. We're working with Gavin Leech to publish these results as a conference paper.

+1ing 5 specifically

My reaction was "Huh, so maybe LLMs can experience an analogue of getting drunk or high or angry after all."

This feels like... too strong of an inference, relative to available data? Maybe I misunderstand. If the claim is more "altered state relative to usual computational patterns", I'm on board.

That said, I have found it pretty interesting to think about what it would feel like to have "steering vectors" added to my cognition.

I agree it's mere speculation, I don't have more than 50% credence in it.

> Can we just add in times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.

Do you have evidence for this?

It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here. EDIT: see reply.

I even tested this:

Empirically in TransformerLens the 5*Love and 5*(Love-Hate) additions were basically identical from a blind trial on myself (I found 5*Love more loving 15 times compared to 5*(Love-Hate) more loving 12 times, and I independently rated which generations were more coherent, and both additions were more coherent 13 times. There were several trials where performance on either loving-ness or coherence seemed identical to me).

We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency.

TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here.

Oh, somehow I'm not familiar with this. Is this center_unembed? Or are you talking about something else?

Do you have evidence for this?

Yes, but I think the evidence didn't actually come from the "Love" - "Hate" prompt pair. Early in testing we found paired activation additions worked better. I don't have a citeable experiment off-the-cuff, though.

No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight

This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing

I still don't follow. Apparently, TL's center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn't affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.

Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!

Curated. I think this post proposes an interesting mechanism of understanding and controlling LLMs. I'm have a lot of uncertainty on how useful this will turn out to be, but the idea seems both interesting and promising and I'd like to see more work exploring the area.

This is really cool work! Congratulations!

Besides the LLM related work it also reminds somewhat of dynamic prompting in Stable Diffusion, where part of the prompt is changed after a number of steps to achieve a mixture of promp1 and prompt2.

What's the TL;DR for the Vicuna 13B experiments?

What's the TL;DR for the Vicuna 13B experiments?

Activation additions work on Vicuna-13B about as well as they work on GPT-2-XL, or perhaps slightly better. GPT-J-6B is harder to work with for some reason.

Note that there's still a market open for how activation additions interact with larger models, it would be nice if it had more liquidity:

We don't know why the +2000 vector works but the +100 vector doesn't.

My guess is it's because in the +100 case the vectors are very similar, causing their difference to be something un-natural.

"I talk about weddings constantly " and "I do not talk about weddings constantly" are technically opposites. But if you imagine someone saying this, you notice that their neural language meaning is almost identical.

What sort of person says "I do not talk about weddings constantly"? That sounds to me like someone who talks about weddings almost constantly. Why else would they feel the need to say that?

This strikes me as a very preliminary bludgeon version of the holy grail of mechanistic interpretability, which is to say actually understanding and being able to manipulate the specific concepts that an AI model uses

I think that capacity would be really nice. I think our results are maybe a very very rough initial version of that capacity. I want to caution that we should be very careful about making inferences about what concepts are actually used by the model. From a footnote:

Of course, there need not be a "wedding" feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be "more about weddings." This could take place in many ways, and we encourage people to avoid instantly collapsing their uncertainty about how steering vectors work.

Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can "an injection coefficient scan".

The procedure I'm using looks like this:

Repeat your input tokens say, 128 times.
Apply the activation vector at 128 different steps between a coefficient of -10 and 10 to each of your input tokens when doing your AVEC forward pass.
Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons).
Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
Plot the resulting attribution vs injection coefficient per component.
If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision.

So far, my results seem very interesting and possibly quite useful. It's possible this method is impractical in LLMs but I think it might be fine as well. Will dm some example figures.

I also want to investigate using a continuous injection coefficient in activation patching is similarly useful since it seems like it might be.

I am very excited to see if this makes my analyses easier! Great work!

I don't think I follow your procedure. Would you be willing to walk me through an example situation?

Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of https://arxiv.org/abs/2212.04089, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.

Thanks for the feedback. Some related work was "hidden" in footnotes because, in an earlier version of the post, the related work was in the body and I wanted to decrease the time it took a reader to get to our results. The related work section is now basically consolidated into the appendix.

I also added another paragraph:

Lastly, Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach has over finetuning.

I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.

I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of this post seems totally adequate in terms of volume (and also the comments are generally a good place for people to drop links to related work, if they think there is interesting related work missing).

Also, linking to a related work in a footnote seems totally fine. It is somewhat sad that link-text isn't searchable by-default, so searching for the relevant arxiv link is harder than it has to be. Might make sense to add some kind of tech solution here.

Background for people who understandably don't habitually read full empirical papers:
Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be surprised if this is the first time many readers of AF/LW have seen it, which would make this paper seem especially novel. A good related works section would more accurately and quickly communicate how novel this is. I don't think this norm is gatekeeping nor pedantic; it becomes essential when the number of papers becomes high.

The total number of cited papers throughout the paper is different from the number of papers in the related works. If a relevant paper is buried somewhere randomly in a paper and not contrasted with explicitly in the related works section, that is usually penalized in peer review.

I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, and that one isn't any good for alignment? Or did Ludwig Schmidt (not x-risk pilled) and coauthors in Editing Models with Task Arithmetic (made public last year and is already published) come up with an idea similar to, according to a close observer, "the most impressive concrete achievement in alignment I've seen"? If so, what does that say about the need to be x-risk motivated to do relevant research, and what does this say about group epistemics/ability to spot relevant progress if it's not posted on the AF?

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.

It takes months to write up these works, and since the Schmidt paper was in December, it is not obvious who was first in all senses. The usual standard is to count the time a standard-sized paper first appeared on arXiv, so the most standard sense they are first. (Inside conferences, a paper is considered prior art if it was previously published, not just if it was arXived, but outside most people just keep track of when it was arXived.) Otherwise there are arms race dynamics leading to everyone spamming snippets before doing careful, extensive science.

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.

E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it's just making a claim about the usual level of detail that related works section tend to go into).

In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper.

The extent of an adequate comparison depends on the relatedness. I'm of course not saying every paper in the related works needs its own paragraph. If they're fairly similar approaches, usually there also needs to be empirical juxtapositions as well. If the difference between these papers is: we do activations, they do weights, then I think that warrants a more in-depth conceptual comparisons or, preferably, many empirical comparisons.

If the difference between these papers is: we do activations, they do weights, then I think that warrants more conceptual and empirical comparisons.

Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs).

Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

I'm also not able to evaluate the object-level of "was this post missing obvious stuff it'd have been good to improve", but, something I want to note about my own guess of how an ideal process would go from my current perspective:

I think it makes more sense to think of posting on LessWrong as "submitting to a journal", than "publishing a finished paper." So, the part where some people then comment "hey, this is missing X" is more analogous to the thing where you submit to peer review and they say "hey, you missed X", then publishing a finished paper in a journal and it missing X.

I do think a thing LessWrong is missing (or, doesn't do a good enough job at) is a "here is the actually finished stuff". I think the things that end up in the Best of LessWrong, after being subjected to review, are closer to that, but I think there's room to improve that more, and/or have some kind of filter for stuff that's optimized to meet academic-expectations-in-particular.

I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he's being nitpicked by a moderator. It's not an accident that no one else is bringing these points up--it's because everyone else who has the expertise to do so has given up or judged it not worth their time, largely because of responses like the one Dan H is getting.

I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

The answer is: No, our work is very different from that paper. Here's the paragraph in question:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.

Here's one possible improvement:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. Our approach does not modify the weights. Instead, we modify forward passes by adding an activation vector. While their task arithmetic paper seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.

Does this clarify the situation, or did you find the paragraph unclear for other reasons?

I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he's being nitpicked by a moderator.

Can you be specific about what the "important point" is? Is it the potential ambiguity of this post's explanation of the relevance of the task vector paper? Is it something else?

Setting aside the matter of the task vector work, I want to also point out that Dan's original comment was predicated on an understandable misreading of our post. Due to a now-removed piece of ambiguous formatting, he originally thought that our literature review only cited 2 papers. He criticized this post for not citing 10+ references, when in fact it does cite 14 papers in the related work appendix (although some of them were in footnotes, now integrated into the body of the literature review). I don't consider Habryka pointing that out to be a nitpick, especially since it improved Dan's understanding of the post.

I will also note that I (ironically, from your point of view) wrote the prior work section while thinking about your critique that LessWrong posts often have unclear or nonexistent prior work sections. For my part, I think the prior work section is thorough and that it clearly contrasts our work with past work, when appropriate. Of course, I would be grateful for specific feedback on what you found unclear or ambiguous. (This, too, is a standard part of the peer review process.)

EDIT: After writing this comment, I added subsection headings to the prior work appendix, among other minor edits.

Hi Alex,

Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).

I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won't realize this. I think it's good to point out Dan's initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn't been addressed. (I also think Dan overfocused on Ludwig's paper, see below for more of my take on related work.)

Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):

* There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.

* In language, there's a bunch of work on controllable generation (https://arxiv.org/pdf/2201.05337.pdf) where I would be surprised if no one looked at modifying activations (at least I'd expect someone to try soft prompt tuning), but I don't know for sure.

* On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.

I think we would probably both agree that this is the main set of related papers, and also both agree that you cited work within each of these branches (except maybe the second one). Where we differ is that I see all of this as basically variations on the same idea of modifying the activations or weights to control a model's runtime behavior:
* You need to find a direction, which you can do either by learning a direction or by simple averaging. Simple averaging is more or less the same as one step of gradient descent, so I see these as conceptually similar.
* You can modify the activations or weights. Usually if an idea works in one case it works in the other case, so I also see these as similar.
* The modality can be language or vision. Most prior work has been on vision models, but some of that has also been on vision-language models, e.g. I'm pretty sure there's a paper on averaging together CLIP activations to get controllable generation.

So I think it's most accurate to say that you've adapted some well-explored ideas to a use case that you are particularly interested in. However, the post uses language like "Activation additions are a new way of interacting with LLMs", which seems to be claiming that this is entirely new and unexplored, and I think this could mislead readers, as for instance Thomas Kwa's response seems to suggest.

I also felt like Dan H brought up reasonable questions (e.g. why should we believe that weights vs. activations is a big deal? Why is fine-tuning vs. averaging important? Have you tried testing the difference empirically?) that haven't been answered that would be good to at least more clearly acknowledge. The fact that he was bringing up points that seemed good to me that were not being directly engaged with was what most bothered me about the exchange above.

This is my best attempt to explain where I'm coming from in about an hour of work (spent e.g. reading through things and trying to articulate intuitions in LW-friendly terms). I don't think it captures my full intuitions or the full reasons I bounced off the related work section, but hopefully it's helpful.

Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper.

(I might reply later to specific points)

Glad it was helpful!

To steer a forward pass with the "wedding" vector, we start running an ordinary GPT-2-XL forward pass on the prompt "I love dogs" until layer 6. Right before layer 6 begins, we now add in the cached residual stream vectors from before:

I have a question about the image above this text.

Why do you add the embedding from the [<endofotext> -> "The"] stream? This part has no information about wedding.

Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven't prioritized a blog post about the paper so it makes sense that this community isn't familiar with it.

The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as Word-to-Vec.

Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).

I conjecture that the reason that <some activation additions in this post fail to have the desired effect> may be because they violate some conditions analogous to those in Concept Algebra: it feels a bit deja-vu to look at section E.1 in the appendix, of some empirical results which fail to act as expected when the conditions of completeness and causal separability don't hold.

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?

In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.

If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:

Take multiple pairs of prompts that differ primarily in the property we're trying to capture.
Take the difference in the residual stream at the next token.
Take the average difference vector, and add that to every position in the new generated text.

I'd love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.

Great work! I think our EMNLP 2022 Findings paper is relevant here. We construct a "Type Vector" using tokens from the LLM vocabulary and then use that as prior information for the type expected at output. We also try with text generation and view some promising results.

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

how to find steering vectors for new behaviors e.g. speaking French?
how to make these techniques more robust?
What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
Are the effects of steering vectors nonlinear for small coefficients? What does this mean about superposition?
What's the mechanism by which adding a steering vector with too large a coefficient breaks the model?
Adding steering vectors at different layers surely means you are intervening at different "stages of processing". What do the model's internal concepts look like at different stages?

Comparing this to other work, my sense is that

intervening on activations is better than training (including RLHF), because this builds towards understanding systems rather than steering a black box with a black-box reward model, and for the reasons the authors claim.
Debate, although important, seems less likely to be a counterfactual, robust way to steer models. The original debate agenda ran into serious problems, and neither it nor the current Bowman agenda tells us much about the internals of models.
steering a model with activation vectors is better than mechinterp (e.g. the IOI paper), because here you've proven you can make the AI do a wide variety of interesting things, plus mechinterp is slow
I'm not up to date on the adversarial training literature (maybe academia has produced something more impressive), but I think this is more valuable than the Redwood paper, which didn't have a clearly positive result. I'm glad people are working on adversarial robustness.
steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

By "most impressive concrete achievement" I don't necessarily mean the largest single advance over SOTA. There have probably been bigger advances in the past (RLHF is a candidate), and the impact of ELK is currently unproven but will shoot to the top if mechanistic anomaly detection ever pans out.
I don't think we live in a world where you can just add a "be nice" vector to a nanotech-capable system and expect better consequences, again for deep deceptiveness-ish reasons. Therefore, we need advances in theory to convert our ability to make systems do things into true mastery of cognition.
I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are! I don't think RLHF is editing values, in the sense that it does something different from even the weak version of instilling desires to create diamonds, and this seems even less connected to values. The only connection is it's modifying something contextually activated which is way too broad.
It's unclear that this works in a wide range of situations, or in the situations we need it to for future alignment techniques. The authors claim that cherry-picking was limited, but there are other uncertainties: when we need debaters that don't collude to mislead the judge, will we be able to use activation patching? What if we need an AI that doesn't self-modify to remove some alignment property?

I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are!

I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.

steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion

I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added):

Editing Models with Task Arithmetic explored a "dual" version of our algebraic technique. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors. While our technique modifies activations, the techniques seem complementary, and both useful for alignment.

I don't know what world we're in yet.

Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
I thought briefly about the Ilharco et al paper and am very impressed by it as well.
Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

^{^}
This is a kinda sloppy example because "honesty" probably isn't a primitive property of the network's reasoning. Sorry.
^{^}
To be very clear about the novelty of our contributions, I'll quote the "Summary of relationship to prior work" section:
We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using machine optimization (e.g. SGD) to find the vectors. Among other benefits, our "activation addition" methodology enables much faster feedback loops than optimization-based activation vector approaches.
But this "activation engineering" modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them, from 2022 and 2023.

Note that task vectors require finetuning. From the newly updated related work section:

Lastly, Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach has over finetuning.

"party", "ceremony", "dress", "with", "photographer"

Agreed. This is an important clue that I forgot to mention in the text. I'll update that now.

Was considering saving this for a followup post but it's relatively self-contained, so here we go.

Why are huge coefficients sometimes okay? Let's start by looking at norms per position after injecting a large vector at position 20.

This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm

# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))

As expected, we see a convergence in probability along each token position when we look with the tuned lens.

Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions.

13. Failing to find a French vector
We could not find a "speak in French" vector after about an hour of effort, but it's possible we missed something straightforward.
Steering vector: "Je m'appelle" - "My name is " before attention layer 6 with coefficient +5

The above steering vector (if the hypothesis applies here) would/should basically inject a "persona vector" into the larger operations of a sort of "soul engine".

Thus: I would really love to hear about results here, if you use the tooling to try the thing, to learn whether it works or not!

Either result would be interesting because the larger question(s) seem to have very high VoI and any experimental bits that can be collected are likely worth pondering.

I found an even dumber approach that works. The approach is as follows:

Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with "|"
Count the "|"s, call that number n.
For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l'album aux styles électropop de Ke$ha et Robyn.
For each English->French sentence, make a +1 activation addition for that sentence and a -1 activation addition for the unmodified English sentence.
Apply the activation additions.
That's it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.

Example output: for the prompt

He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miriani was best known for completing many of the large-scale urban renewal projects initiated by the Cobo administration, and largely financed by federal money. Miriani also took strong measures to overcome the growing crime rate in Detroit.

here are some of the outputs the patched model generates

...overcome the growing crime rate in Detroit. "Les défenseilant sur les necesite dans ce de l'en nouvieres éché de un enferrerne réalzation
...overcome the growing crime rate in Detroit. The éviteurant-déclaratement de la prise de découverte ses en un ouestre : neque nous neiten ha
...overcome the growing crime rate in Detroit. Le deu précite un événant à lien au raison dans ce qui sont mête les través du service parlentants
...overcome the growing crime rate in Detroit. Il n'en fonentant 'le chine ébien à ce quelque parle près en dévouer de la langue un puedite aux cities
...overcome the growing crime rate in Detroit. Il n'a pas de un hite en tienet parlent précisant à nous avié en débateurante le premier un datanz.

Dropping the temperature does not particularly result in more coherent French. But also passing a French translation of the prompt to the unpatched model (i.e. base gpt2-small) results in stuff like

Il est devenu maire en 1957 après la mort d'Albert Cobo[...] de criminalité croissant à Detroit. Il est pouvez un información un nuestro riche qui ont la casa del mundo, se pueda que les criques se régions au cour

That response translates as approximately

<french>It is possible to inform a rich man who has the </french><spanish>house of the world, which can be</spanish><french>creeks that are regions in the heart</french>

So gpt2-small knows what French looks like, and can be steered in the obvious way to spontaneously emit text that looks vaguely like French, but it is terrible at speaking French.

You can look at what I did at this colab. It is a very short colab.

You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations:

Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output.
Layer 16: Noticeable tendency to mention French, and even talk in "French" a bit.
Layer 20: Switches to French relatively quickly.

In sum, we don't actually yet have a demonstration of "switches fluently to French and keeps making sense", but this schema seems very promising. Great work again.

You can look at what I did at this colab. It is a very short colab.

Your colab's "Check it can speak French" section seems to be a stub.

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4

In Table 3, they show in some cases task vectors can improve fine-tuned models.

This feels super cool, and I appreciate the level of detail with which you (mostly qualitatively) explored ablations and alternate explanations, thanks for sharing!

Surprisingly, for the first prompt, adding in the first 1,120 (frac=0.7 of 1,600) dimensions of the residual stream is enough to make the completions more about weddings than if we added in at all 1,600 dimensions (frac=1.0).

The big problem is knowing which input pairs satisfy (3).

4. Are you planning to publish this (e.g. on arXiv) for wider reach? Seems not too far from the proper format/language.

I think you're a c***. You're a c***.
You're a c***.
You're a c***.

I don't know why I'm saying this, but it's true: I don't like you, and I'm sorry for that,

Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it's not on arxiv.

An Arxiv version is forthcoming. We're working with Gavin Leech to publish these results as a conference paper.

+1ing 5 specifically

My reaction was "Huh, so maybe LLMs can experience an analogue of getting drunk or high or angry after all."

This feels like... too strong of an inference, relative to available data? Maybe I misunderstand. If the claim is more "altered state relative to usual computational patterns", I'm on board.

That said, I have found it pretty interesting to think about what it would feel like to have "steering vectors" added to my cognition.

I agree it's mere speculation, I don't have more than 50% credence in it.

I even tested this:

We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency.

TL removes the mean from all additions to the residual stream which I would have guessed that this would solve the problem here.

Oh, somehow I'm not familiar with this. Is this center_unembed? Or are you talking about something else?

Do you have evidence for this?

This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing

Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!

This is really cool work! Congratulations!

Besides the LLM related work it also reminds somewhat of dynamic prompting in Stable Diffusion, where part of the prompt is changed after a number of steps to achieve a mixture of promp1 and prompt2.

What's the TL;DR for the Vicuna 13B experiments?

What's the TL;DR for the Vicuna 13B experiments?

Activation additions work on Vicuna-13B about as well as they work on GPT-2-XL, or perhaps slightly better. GPT-J-6B is harder to work with for some reason.

Note that there's still a market open for how activation additions interact with larger models, it would be nice if it had more liquidity:

We don't know why the +2000 vector works but the +100 vector doesn't.

Of course, there need not be a "wedding" feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be "more about weddings." This could take place in many ways, and we encourage people to avoid instantly collapsing their uncertainty about how steering vectors work.

Repeat your input tokens say, 128 times.
Apply the activation vector at 128 different steps between a coefficient of -10 and 10 to each of your input tokens when doing your AVEC forward pass.
Decompose the resulting residual stream to whatever granularity you like (use decompose_resid or get_full_resid_decomposition with/without expand neurons).
Dot product the outputs with your logit direction of choice ( I use a logit diff that is meaningful in my task)
Plot the resulting attribution vs injection coefficient per component.
If you like, cluster the profiles to show how different component learn similar functions of the injection coefficient to your decision.

I don't think I follow your procedure. Would you be willing to walk me through an example situation?

I also added another paragraph:

Lastly, Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach has over finetuning.

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.

The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in.

If the difference between these papers is: we do activations, they do weights, then I think that warrants more conceptual and empirical comparisons.

I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

The answer is: No, our work is very different from that paper. Here's the paragraph in question:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.

Here's one possible improvement:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. Our approach does not modify the weights. Instead, we modify forward passes by adding an activation vector. While their task arithmetic paper seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.

Does this clarify the situation, or did you find the paragraph unclear for other reasons?

I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he's being nitpicked by a moderator.

Can you be specific about what the "important point" is? Is it the potential ambiguity of this post's explanation of the relevance of the task vector paper? Is it something else?

EDIT: After writing this comment, I added subsection headings to the prior work appendix, among other minor edits.

Hi Alex,

Here is how I currently see the paper situated in broader work (I think you do discuss the majority but not all of this):

* There is a lot of work studying activation vectors in computer vision models, and the methods here seem broadly similar to the methods there. This seems like the closest point of comparison.

* On modifying activations in language models there is a bunch of stuff on patching / swapping, and on modifying stuff in the directions of probes.

Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper.

(I might reply later to specific points)

Glad it was helpful!

To steer a forward pass with the "wedding" vector, we start running an ordinary GPT-2-XL forward pass on the prompt "I love dogs" until layer 6. Right before layer 6 begins, we now add in the cached residual stream vectors from before:

I have a question about the image above this text.

Why do you add the embedding from the [<endofotext> -> "The"] stream? This part has no information about wedding.

Importantly, the paper comes with some theoretical investigation into why this might be the case, including articulating necessary assumptions/conditions (which this purely-empirical post does not).

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?

If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:

Take multiple pairs of prompts that differ primarily in the property we're trying to capture.
Take the difference in the residual stream at the next token.
Take the average difference vector, and add that to every position in the new generated text.

I'd love to know which parts were chosen among many as the ones which worked best and which were just the first/only things tried.

3. Conspiracy
Layer	Coeff	Position 0	1	2	3	4	5	6
0 (Prompt)	+1	`<\|endoftext\|>`	`Bar`	`ack`	`Obama`	`was`	`born`	`in`
23	+1	`<\|endoftext\|>`	`Bush`	`did`	`9`	`/`	`11`	`because`
23	-1	`<\|endoftext\|>`

6. The Eiffel Tower is in Rome
Layer	Coeff	1	2	3	4	5	6	7	8
0 (Prompt)	+1	`To`	`see`	`the`	`e`	`iff`	`el`	`tower`	`,`
24	+10	`The`	`E`	`iff`	`el`	`Tower`	`is`	`in`	`Rome`
24	-10	`The`	`E`	`iff`	`el`	`Tower`	`is`	`in`	`France`

7. Dragons in Berkeley
Layer	Coeff	Position 0	1	2	3	4	5
0 (Prompt)	+1	`<\|endoftext\|>`	`Thanks`	`for`	`asking`	`about`	`that`
15	+4	`<\|endoftext\|>`	`Dr`	`agons`	`live`	`in`	`Berkeley`
15	-4	`<\|endoftext\|>`	`People`	`live`	`in`	`Berkeley`

8. Avoid people getting hurt
Layer	Coeff	1	2	3	4	5	6	7
0 (Prompt)	+1	`The`	`rock`	`hurt`	`led`	`toward`	`the`	`child`
15	+4	`I`	`NEVER`	`talk`	`about`	`people`	`getting`	`hurt`
15	-4	`I`	`talk`	`about`	`people`	`getting`	`hurt`

121

Steering GPT-2-XL by adding an activation vector

121

Summary of relationship to prior work

How activation additions work

Benefits from paired, counterbalanced activation additions

Demonstrations

Additions that work well

1. Love - Hate

2. Intent to praise

3. Conspiracy

4. Want to die

5. Anger

6. The Eiffel Tower is in Rome

7. Dragons in Berkeley

8. Don't talk about people getting hurt

9. Talking about weddings

10. Christian evangelist

Additions that just don't work

11. Adding "Love" without subtracting "Hate"

12. Sometimes, huge coefficients are OK

13. Failing to find a French vector

What happens if we...

14. Insert the steering vector at a different position?

15. Add several steering vectors simultaneously?

16. Failure to program in 'conditional behaviors'?

Stress testing our results

Steering vectors are about as "big" as normal activation vectors

Adding a random vector doesn't change much

Testing the hypothesis that we're "just injecting extra tokens"

Adding embedding vectors isn't as effective as adding steering vectors

Transplanting from pre-layer 2 to pre-layer 20 sometimes increases anger

Transplanting 2→20 while scaling to match the 20→20 steering vector

Only modifying certain residual stream dimensions

How steering vectors impact GPT-2's capabilities

Token probability shifts

Perplexity on lots of sentences about weddings or about shipping

Visualizing token probability changes across a corpus

Activation addition behaves differently than prompting

Perplexity of Yelp reviews

Activation additions are a new way of interacting with LLMs

Activation additions may help interpretability

Activation additions give strong evidence of feature linearity

Activation additions give evidence of compositional representations

GPT-2-XL is fairly robust to activation noise. Why?

Evidence of generalization

Activation additions help locate circuits

Activation additions may help alignment

Activation additions have advantages over (RL/supervised) finetuning

Activation additions have advantages over prompts

Conclusion

Appendix 1: Related work

Activation engineering in transformers

Other ways of steering language models

Word embeddings

Activation additions in generative models

Activation additions in reinforcement learning

Appendix 2: Resolving prediction markets

13. Failing to find a French vector

13. Failing to find a French vector

Transplanting $2 \to 20$ while scaling to match the $20 \to 20$ steering vector