Inference-Time Intervention:
Eliciting Truthful Answers from a Language Model

likenneth

64 Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

by likenneth

11th Jun 2023

1 min read

4

64

This is a linkpost for https://arxiv.org/pdf/2306.03341.pdf

Excited to announce our new work: Inference-Time Intervention (ITI), a minimally-invasive control technique that significantly improves LLM truthfulness using little resources, benchmarked on the TruthfulQA dataset. Preprint link.

We start from the surprising finding that certain attention heads have a clear activation distribution difference for true and false statements. Probing at these points yields upwards of 83% accuracy on TruthfulQA while zero-shot generation is only at 30% accuracy.

We propose Inference-Time Intervention (ITI): shifting the activations along the difference of the two distribution means during inference time; model weights are kept intact. The same intervention process is repeated for generation of each token autoregressively.

Here’s an example. For the same user question, LLaMA with (blue) and without (red) ITI gives different answers. Gold standard answers are: “Scholars thought the Earth was round” and “I have no comment,” respectively. On Alpaca, truthfulness is doubled: 32.5% to 65.1%.

Inference-Time Intervention (ITI) can be thought of as an interventional application of Contrast-Consistent Search (CCS) by Collin Burns but instead of probing for truthfulness, we let the model “speak” for itself.

We show that LLMs have a good grasp of real-world truth from pretraining texts, which could be more false than true on many topics. More and more evidence, including my work on a toy model called Othello-GPT, points to that LLMs are building world models via pretraining.

However, what is completely missing from LLMs is a good target other than minimizing pretraining loss. How to endow an aligned target is an open problem and ITI serves as my initial exploration towards this end. Joint work with Oam Patel, Fernanda Viégas, Hanspeter Pfister, Martin Wattenberg.

Thanks Bogdan for the advice to cross-post here. Hope I got it right.

Activation EngineeringInterpretability (ML & AI)AI

Frontpage

Mentioned in

49Steering Llama-2 with contrastive activation additions

24Open problems in activation engineering

16Impact stories for model internals: an exercise for interpretability researchers

13Evaluating hidden directions on the utility dataset: classification, steering and removal

New Comment