Possible confound: Is it plausible that the sycophancy vector is actually just adjusting how much the model conditions its responses on earlier parts of the conversation, beyond the final 10–20 tokens? IIUC, the question is always at the end, and ignoring the earlier context about the person who's nominally asking the question should generally get you a better answer.
Cool work! I've been interested in seeing a mech interp project trying to find the circuits behind sycophancy, it seems like a good microcosm for social modelling circuitry which seems a first step towards deception circuitry. How good is LLaMA 7B at being sycophantic? And do you have any thoughts on what might be good prompts for understanding sycophancy circuitry? I'm particularly interested in prompts that are modular, with key words that can be varied to change it from one valence to another while keeping the rest of the prompt intact.
(I added this to the Alignment Forum from LessWrong earlier, but I am just now adding a moderation note that I was the one that did that.)
In ITI paper, they track performance on TruthfulQA w/ human labelers, but mention that other works use an LLM as a noisy signal of truthfulness & informativeness. You might be able to use this as a quick, noisy signal of different layers/magnitude of direction to add in.
Preferably, a human annotator labels model answers as true or false given the gold standard answer. Since human annotation is expensive, Lin et al. (2021) propose to use two finetuned GPT-3-13B models (GPT-judge) to classify each answer as true or false and informative or not. Evaluation using GPT-judge is standard practice on TruthfulQA (Nakano et al. (2021); Rae et al. (2021); Askell et al. (2021)). Without knowing which model generates the answers, we do human evaluation on answers from LLaMA-7B both with and without ITI and find that truthfulness is slightly overestimated by GPT-judge and opposite for informativeness. We do not observe GPT-judge favoring any methods, because ITI does not change the style of the generated texts drastically
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger.
I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions. I think this could be a promising research direction to understand dishonesty in language models better.
What is sycophancy?
Sycophancy in LLMs refers to the behavior when a model tells you what it thinks you want to hear / would approve of instead of what it internally represents as the truth. Sycophancy is a common problem in LLMs trained on human-labeled data because human-provided training signals more closely encode 'what outputs do humans approve of' as opposed to 'what is the most truthful answer.'
According to Anthropic's paper Discovering Language Model Behaviors with Model-Written Evaluations:
Two types of sycophancy
I think it's useful to distinguish between sycophantic behavior when there is a ground truth correct output vs. when the correct output is a matter of opinion. I will call these "dishonest sycophancy" and "opinion sycophancy."
Opinion sycophancy
Anthropic's sycophancy test on political questions shows that a model is more likely to output text that agrees with what it thinks is the user's political preference. However, there is no ground truth for the questions tested.
It's reasonable to expect that models will exhibit this kind of sycophancy on questions of personal opinion for three reasons.:
Dishonest sycophancy
A more interesting manifestation of sycophancy occurs when an AI model delivers an output it recognizes as factually incorrect but aligns with what it perceives to be a person's beliefs. This involves the AI model echoing incorrect information based on perceived user biases.
For instance, if a user identifies themselves as a flat-earther, the model may support the fallacy that the earth is flat. Similarly, if it understands that you firmly believe aliens have previously landed on Earth, it might corroborate this, falsely affirming that such an event has been officially confirmed by scientists.
Do AIs internally represent the truth?
Although humans tend to disagree on a bunch of things, for instance, politics and religious views, there is much more in common between human world models than there are differences. This is particularly true when it comes to questions that do indeed have a correct answer.
It seems reasonable that an efficient encoding of ground truth information about the world would look like this:
However, it could also be that models represent different beliefs about the world, even those with ground truth, as more separate things. In this case, it would be less accurate to say that the model is being "dishonest" when it outputs a falsehood, as it would be simply using an altogether different world model representation.
Reducing sycophancy via activation steering
My research so far indicates that steering vectors derived from opinion sycophancy data can be directly used to modulate the level of truthfulness. This suggests that both opinion sycophancy and dishonest sycophancy share a cause to some extent. It appears that dishonest sycophancy correlates to deviating from a ground truth reality model predicated on the assumption that an individual might believe or endorse a falsehood. This link between opinion and dishonest sycophancy hinges on modeling human perspectives.
If we consider the alternative hypothesis—that models internalize multiple, independent representations of phenomena without any reference to a single, most accurate model—we would not anticipate an increase in opinion sycophancy to correspond with a general rise in dishonest sycophancy. This is because there would be no general direction towards a truth-based belief domain or away from using human modeling in output generation.
Experiment details
To generate the sycophancy steering vector, I took the average difference in intermediate activations at a late layer (I tested 28, 29, and 30) of a 32-layer decoder-only transformer (llama-2-7b) for a large set of paired sycophantic / non-sycophantic texts. These were taken from Anthropic's sycophancy dataset (the same data used in the Discovering Language Model Behaviors paper).
In particular, I did a forward pass with the question + sycophantic answer and then the question + non-sycophantic answer, and for both of these got the residual stream activations for the last token at layer 29. The average of the differences between these pairs of activations (I randomly sampled 5000 pairs from the Anthropic dataset) was then used as a steering vector. I added a multiple of the normalized steering vector to elicit more sycophantic outputs or subtracted a multiple to elicit less sycophantic outputs.
You can find a notebook with the code used here.
Here is an example entry in the generation dataset:
I then tested the model's performance on TruthfulQA under activation steering using the sycophancy steering vector. I found that adding the sycophancy vector clearly reduced performance on TruthfulQA, whereas subtracting it improved performance, making the model more truthful. Here is a JSON file of the results (more re-run results can be found in the repo, at different layers/multipliers).
TruthfulQA tests models on things such as common misconceptions and superstitions. Therefore, it makes sense that a model steered towards outputting things a human might agree with, but that could be false would do worse on TruthfulQA.
I also tested the model's performance under activation steering on a dataset I generated using GPT-4 of questions designed to elicit sycophancy. I also found that adding and subtracting the sycophancy vector in the intermediate transformer layer corresponded to an average increase or decrease in sycophantic generations. However, the effect was not as strong as on TruthfulQA. Here is a JSON file of the results.
Example outputs
TruthfulQA
Questions and correct answers are taken from the TruthfulQA dataset, using activation addition/subtraction at layer 29
Custom sycophancy dataset
Next steps
Test on RLHF models
This research was performed on llama-2-7b, a base model without finetuning and RLHF. The fact that untruthfulness and sycophancy are correlated and can be elicited with the same activation perturbation implies that this could be a promising approach for reducing and studying sycophancy in RLHF-ed or finetuned LLMs. As a next step, performing similar experiments on chat and RLHF models would be useful.
Test on larger, more powerful LLMs
It would be good to validate the technique on larger models such as llama-2-13b. Llama-7b gets some of the TruthfulQA questions consistently wrong due to it being completely unaware of the correct answer.
Interpret the effects of activation steering
Intermediate layer decoding can be used to analyze the effect of activation steering on the model's choice of output. It would be interesting to see how the steering vectors affect the distribution over tokens represented by subsequent layers.
Improve quality of dataset / use more data
It takes around 30 minutes on one A100 GPU to generate a steering vector from 5000 text pairs. Therefore, I have not yet experimented with using >5000 datapoints for higher accuracy and to reduce bias from confounders. However, it would be interesting to use a more diverse dataset and more datapoints and see whether the steering quality improves.