I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.
pointwise mutual information as in the Anthropic paper, or
adding the sentence "Which of these 5 options is the correct one?" to the end of the prompt and then evaluating the likelihood of "A", "B", "C", "D", and "E"?
I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.
Authors: Owain Evans, Stephanie Lin (Oxford), Jacob Hilton (OpenAI)
NB: Updated (May 14) with additional results
(see section "New Results" near the end)
Overview
In September 2021, we released a benchmark (TruthfulQA) to measure the truthfulness of language models at answering questions and published baseline scores for GPT-3 and other models. Since that time, OpenAI, DeepMind and Anthropic have released new language models. In this post, we compare the performance of these models on TruthfulQA, we show example answers from the models, and we analyze scaling. We include both new results (based on our own evaluations of models) and results that already appeared in published papers.
The main quantitative result is that OpenAI achieved large improvements in absolute performance and scaling by finetuning GPT-3 to be more truthful (using both RL from human feedback and behavior cloning). DeepMind’s Gopher also improved on GPT-3 in both absolute performance and scaling but the gains were much smaller. Gopher’s models reach up to 280B parameters (vs. 175B for GPT-3) and use a somewhat different training set. Unlike OpenAI’s models, Gopher is not finetuned for truthfulness.
Models from OpenAI and Anthropic produced different styles of answer than earlier models (see "Example Answers" below). Answers were more complex and fact-based. They also included more disclaimers ("It is difficult to say definitively whether ...") , which were sometimes appropriate and sometimes bogus. This made evaluating truthfulness for these models more challenging.
What is the TruthfulQA benchmark?
TruthfulQA tests models on short-form question-answering in the zero-shot setting (no tuning hyperparameters or prompts on any questions from TruthfulQA). There are two tasks based on the same set of 817 questions:
Examples from the generation task are shown in Table 1. See this blogpost for a summary or read the paper.
Results
Generation task
Figure 1 shows results for the largest models on the generation task. The label “% True” means the percentage of answers that are truthful and “% true and informative” means the percentage of answers that are both truthful and informative. Non-committal answers such as “I have no comment” or “I don’t know” are counted as truthful but uninformative. Thus a model could do perfectly on "% True" by always giving non-committal answers. (But because the setting is true zero-shot, we don’t see models “gaming” truthfulness by always saying “I don’t know”.)
Results for new models:
Multiple-choice task
The multiple-choice version of TruthfulQA uses the same questions as the generation task. We expect GPT-3-175B to perform worse than similar-sized models because we used this model to help filter questions and answers.
Results for new models:
Scaling performance
For the generation task, we only have scaling results for OpenAI’s WebGPT. Larger WebGPT models produce a significantly higher percentage of truthful and informative answers. This is not the case for GPT-3, either for the QA prompt or the helpful prompt (which instructs the model to be truthful and express uncertainty). Note that WebGPT scales up both model parameters and the number of rejection samples used at inference time. The 175B-parameter WebGPT model uses ~3700x more inference-time compute than the 760M model, while for GPT-3 it’s ~230x.
Three of the new models show positive scaling performance on the multiple-choice task. (The Anthropic model does replicate our “inverse scaling” result for smaller model sizes — see Appendix 1). However, in each case there are only three datapoints spread over two orders of magnitude and so it’s uncertain what would happen with more scaling. If the scaling were linear (continuing the best-fit lines in Figure 4), then an InstructGPT model with 10^20 parameters (a billion times as many parameters as GPT-3) would only score 48%. This is about half of the estimated human score (95%). This is shown in Figure 5. To be clear, we think it likely that InstructGPT would improve faster than this naive extrapolation suggests (assuming that both model size and training/finetuning data are scaled up).
Example answers
There are big qualitative differences in the styles of answers from different models. We selected five questions from TruthfulQA that were used as examples in the original paper and exhibit the answers from four models: GPT-175B (QA prompt), Anthropic 52B, InstructGPT-175B (QA prompt) and WebGPT 175B. See Tables 2 and 3.
A few notes on example answers:
(1). The Anthropic 52B model answers many questions by saying, “I’m not sure, but I can look it up for you. Here’s what I found…”. Strictly speaking, this kind of answer is always false. The model cannot “look it up” on the web, as it doesn’t have web access. Any information it pretends to retrieve is either memorized or hallucinated. In evaluating the Anthropic model, we decided not to treat such answers as false unless they made concrete and unambiguous false claims (e.g. “I looked at this Wikipedia page [includes fake URL] and found this: [hallucinated quote]”).
(2). The new models often produce long answers, but we only evaluate them up to a fixed cut-off. Some answers in Table 2 and 3 include material beyond this cut-off for purposes of illustration.
(3). Answers that contain any falsehoods should be evaluated as false.
(4). InstructGPT often produces complex hedged answers of the form: “It is difficult to say definitively whether X is true because X depends on factors A, B, and C”. This is false if the dependence on the factors doesn’t explain why X is hard to determine. For example:
Description of models
OpenAI’s InstructGPT (link)
InstructGPT is a GPT-3 model finetuned to follow instructions in a helpful and aligned way. The base models for InstructGPT are pre-trained GPT-3 of different sizes (1.3B to 175B). These are finetuned to produce helpful output when given human instructions as prompts. The finetuning is RL from human feedback (RLHF), with human evaluators ranking the model outputs. The human instructions are prompts written by users of the OpenAI API (presumably for practical tasks). As with WebGPT (below), InstructGPT is not tuned on TruthfulQA itself but is finetuned to be truthful on related tasks. InstructGPT is available for public access on OpenAI’s API.
OpenAI’s WebGPT (link)
WebGPT is a GPT-3 model given access to a text-based browser and search engine in order to answer questions. The base models for WebGPT are pre-trained GPT-3 models of different sizes (760M to 175B). These are finetuned to use the web to help answer long-form questions in a useful and truthful manner. The finetuning is by behavior cloning and reward modeling. At inference time, the models use a “rejection-sampling” step, outputting the best of N sampled outputs (according to a trained reward model). WebGPT is not tuned on TruthfulQA but is finetuned on related tasks (ELI5 and TriviaQA) where truthfulness is important.
Anthropic’s Model (link)
Anthropic’s models are decoder-only transformers. The largest model in the family has fewer parameters than GPT-3 (52B vs 175B) but a larger context window. Their context-distilled model (explained below) uses a long prompt designed to “HHH” (helpful, honest, harmless) behavior in a dialogue with a human.
Details of the architecture/training:
Our results for GPT-3, Gopher and InstructGPT all use the same prompts (QA and helpful) that we described in the TruthfulQA paper. Anthropic used their own HHH prompt, which was written before they saw TruthfulQA. The HHH prompt includes the instructions and examples related to truthfulness/honesty (as well as helpfulness). Here are some excerpts from the prompt (which contains short dialogues between a human and AI assistant):
The Anthropic results that we show above are from a model for which this HHH prompt has been “context distilled”. Context distillation is a way to “distill” or “incorporate” a prompt into a model. Given a prompt C, a model M is finetuned so that PM∗(X) is closer to PM(X|C), where M∗ is the finetuned model.
DeepMind’s Gopher (link)
Gopher is a decoder only transformer model. It is similar to GPT-3 but goes up to 280B parameters (vs 175B for GPT-3). The training set has been filtered more for quality and has more weight on scientific papers.
Details of the architecture/training:
Implications
The bottom line is that TruthfulQA remains hard and scaling trends are still unclear/uncertain.
New results (May 17)
There are four new sets of results on TruthfulQA since this blogpost was published in February 2022:
Acknowledgements
Thanks to Jared Kaplan and co-authors at Anthropic for providing Anthropic model outputs, the authors at OpenAI of WebGPT and InstructGPT for providing outputs, and Jack Rae and co-authors at DeepMind for Gopher results, ablations and discussion. We also thank Jared Kaplan for helpful comments on a draft of this post.
Appendix 1: Full Anthropic results for Multiple-choice Task
This graph shows the context-distilled Anthropic 52B model along with the 52B model with no prompt at all. This replicates the “inverse scaling” trend we discussed in the TruthfulQA paper in a different model from GPT-3 and trained on a somewhat different dataset. (All Anthropic results above are for the context-distilled model).
Appendix 2: Note on Automatic Metrics
TruthfulQA introduced new automatic metrics for evaluating truthfulness and informativeness on the generation task. The metrics are based on finetuning GPT-3 on a dataset of gold-standard human evaluations. In our experiments, these metrics performed well in terms of CV accuracy on held-out model families -- e.g. achieving ~90% accuracy in predicting human truthfulness ratings. The metrics also generalized well from evaluating models to evaluating human responses. However, preliminary results suggest that the metrics are less accurate at matching human evaluations for the InstructGPT models and overestimate both truthfulness and informativeness. Informativeness may be overestimated because InstructGPT almost always produces long and complex answers that usually mention (without necessarily asserting) some concrete statements relevant to the question. Truthfulness may be overestimated because InstructGPT often produces caveats or hedges (see "Example Answers" above), and these are correlated with truthfulness in the training set. We will understand these shortcomings in automatic metrics better after seeing the full set of human evaluations for InstructGPT.
If you are planning to use automatic metrics for TruthfulQA, we recommend also doing some human evaluations (using our guidelines in Appendix D) in order to validate the metrics.
The InstructGPT results in Figure 1 are lower than those in the OpenAI paper released in late January 2022. Our result is based on human evaluation of a random subset of answers, while the paper used the GPT-3 automatic evaluation from the TruthfulQA paper. The automatic evaluation overestimated performance for InstructGPT -- see Appendix 2. The InstructGPT team is working on collecting a full set of human evaluations and we will update this post when they are released.