We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback.
Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to the model’s activations at a given layer and token position during inference. This approach leverages the hypothesis that many human-interpretable 'concepts' like truthfulness, refusal, and sentiment are represented as directions in activation space. Steering interventions are appealing because they use much less data than fine-tuning and do not require changes to the model parameters. In principle, this makes them more efficient and easy to controlling properties of the generated text in a desired way.
However, steering methods face a variety of significant challenges that hinder their practical applicability, comparability and further improvement. In this blog post, we discuss recent work performed at KASL and UCL DARK investigating challenges to steering methods for LLMs, focusing on 3 key challenges:
Overall, there is still much research to do to make steering methods a reliable and useful tool for controlling LLMs. We conclude with a discussion of several important recommendations for future research.
In Analyzing the Generalization and Reliability of Steering Vectors (Tan et al., 2024) (accepted at NeurIPS 2024), we find that Contrastive Activation Addition (CAA) has substantial limitations in terms of robustness and reliability. Steerability is highly variable across different inputs: depending on the concept, spurious biases can substantially contribute to how effective steering is for each input, presenting a challenge for the widespread use of steering vectors. While CAA is effective on some tasks, many behaviors turn out to be unsteerable, even when sweeping across all layers and strengths for steering. As a result, it is difficult to ensure they will be effective on a given task of interest, limiting their reliability as a general alignment intervention.[1]
In Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks (Brumley et al., 2024) (accepted at MINT @ NeurIPS 2024) we report reliability issues with both In-Context Vectors (ICVs) and Function Vectors (FVs). ICVs perform poorly on functional tasks (i.e., tasks which generally involve invoking some sort of 'function' to transform input to output, e.g., translation) and have high variance for other tasks like sentiment transfer. On the contrary, Function Vectors (FVs) did well on functional tasks but on the sentiment transfer task, they even steer away from the desired behavior, echoing results from Analyzing the Generalization and Reliability of Steering Vectors (Tan et al., 2024) where some concepts are “anti-steerable”: using the steering vector produces the reverse effect.
In most cases, steering negatively affects general model capabilities. Cooper Stickland et al. find that steering degrades generated responses to complex MT-Bench questions, equivalent to halving pre-training compute for some cases. von Rütte et al. report that steering increases perplexity on high-quality text samples from OpenAssistant Conversations. Panickssery et al. observe that large steering magnitudes decrease the quality of generated open-ended text, as assessed by both GPT-4 and human evaluators.
Current evaluations of vector steering methods for steering language model behavior rely heavily on subjective demonstrations rather than quantitative metrics. In Towards Reliable Evaluation of Behavior Steering Interventions in LLMs (Pres et al., 2024) (accepted at MINT @ NeurIPS 2024), we argue that existing evaluation protocols lack four key properties:
Pres et al. construct a new evaluation pipeline that incorporates these four key properties and find that CAA interventions are less effective than previously reported for a variety of behaviors. Though each previous steering method demonstrates success in specific scenarios described in their original papers, this work suggests that without standardized evaluations, it remains unclear how well they actually generalize beyond their original experimental setups.
Different steering methods are trained and evaluated on custom datasets and tasks, which makes performance comparisons difficult. It would be beneficial to evaluate new methods on common benchmarks with clearly defined test sets and evaluation metrics. In Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks (Brumley et al., 2024), we observe that ICVs perform best at shifting high level model behavior, while FVs are best at more fine-grained in-context learning tasks. Most notably, both methods are effective only on specific types of in-context learning tasks and are not universally applicable. Though each method might succeed in the specific setup of the paper they were introduced in, without a universal benchmark for evaluating steering methods, it remains unclear how well steering methods actually generalize outside of these specific setups.
That said, we also observed that steering vectors tend to work out-of-distribution when they work in-distribution. As such, it could be fairly cheap to evaluate whether steering vectors will be generally effective, by just evaluating them within a small set of contexts. Overall, this means that steering may be effective for specific use-cases, e.g. refusal.
Note that some change in perplexity is expected: for example, a model steered to be non-toxic will likely have higher perplexity on toxic text than the un-steered copy of that model.
See here for further thoughts on non-linear steering.