I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).
In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined):
1. Results on the same task differ between GPT...
My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance.
Here's a Twitter thread and discussion: https://twitter.com/OwainEvans_UK/status/1698683186090537015
We didn't investigate the specific question of whether it's raw diversity or specific features. In the Grosse et al paper on influence functions, they find that "high influence scores are relatively rare and they cover a large portion of the total influence". This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.
Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.
More generally, it's uncertain what the impact is of excluding a certain topic from pretraining. In practice, you'll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you'd remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.
So performance here should be thought of more as ‘how good is the model at learning about a persona in fine-tuning and then being able to imitate/simulate that persona in deployment’. This is different from a model believing it is the persona or applying this knowledge to some concept of self. Good performance at this task does not require having a sense of self, this is just a precursor that may be necessary for situational awareness.
That's correct. We tried to emphasize that our experiments are testing out-of-context reasoning, rather than situational aw...
...The hhh task is the one that small models do well on. I am surprised that the small models do well on any of the tasks. I think the reason they do well on the hhh one is that this task doesn’t seem to require much more than word association and parroting. I would predict that for ada and babbage, if you followed up with “why did you say that?” the models would be unable to reproduce the explicit link that ties the persona to answering in the particular way, whereas I expect davinci to be able to explain this link more. The small models are probably just do
Thanks for the thoughtful comments.
...
Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little odd that there is so much variation between tasks, and I think
OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.
With GPT3.5, I think there's also "mode collapse" for style in writing prose (e.g. plays or stories).
Claude does not have this mode collapse in poetry or prose. (It maybe has a much more subtle version of it). This suggests to me it'd be relatively easy to fix ChatGPT's issues (as Gwern suggests).
Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?
(I haven't yet read the paper carefully). The main question of interest is: "How well can transformer do RL in-context after being trained to do so?" This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It's possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.
This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.
I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling
GPT-3 (without external calculators) can do very well on math word problems (https://arxiv.org/abs/2206.02336) that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems.
...when can/do foundation models interna
In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it's plausible a model M could make reasonable guesses for questions like, "What happens when a model with [properties of model M] starts interacting with the worl...
Cool post! Did you try seeing whether GPT-3 can regenerate parts of the Iris dataset (or any other datasets that may appear in its training data)? I'd also be interested to see finetuning results, results for the latest InstructGPT, and to see analysis of the GPT-3 Embeddings for integers and floats.
I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic).
Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)
The link should include "that does not lie".
length --> lenghty
...Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. O
Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence i...
Standards for truthful AI could be "opt-in". So humans might (a) choose to opt into truthfulness standards for their AI systems, and (b) choose from multiple competing evaluation bodies. Standards need not be mandated by governments to apply to all systems. (I'm not sure how much of your Balkanized internet is mandated by governments rather than arising from individuals opting into different web stacks).
We also discuss having different standards for different applications. For example, you might want stricter and more conservative standards for AI that helps assess nuclear weapon safety than for AI that teaches foreign languages to children or assists philosophers with thought experiments.
A few points:
1. Political capture is a matter of degree. For a given evaluation mechanism, we can ask what percentage of answers given by the mechanism were false or inaccurate due to bias. My sense is that some mechanisms/resources would score much better than others. I’d be excited for people to do this kind of analysis with the goal of informing the design of evaluation mechanisms for AI.
I expect humans would ask AI many questions that don’t depend much on controversial political questions. This would include most questions about the natural sciences, m...
Unless the evaluation mechanism is extremely biased, it seems unlikely it would give biased answers for these questions.
But there's now a question of "what is the AI trying to do?" If the truth-evaluation method is politically biased (even if not "extremely"), then it's very likely no longer "trying to tell the truth". I can imagine two other possibilities:
It might be "trying to advance a certain political agenda". In this case I can imagine that it will selectively and unpredictably manipulate answers to especially important questions. For example i
This is a very informative and helpful summary. Thanks! I have a few responses.
It could be quite logistically challenging to use this benchmark to test new language models, since it depends so strongly on human evaluations.
I agree with this. I will note that we include 6600 “reference” answers (both true and false) to our questions and a citation for the true answers. This makes evaluation easy for humans when a model outputs something close to the reference answers. Of course, human evaluation will still be slower than automatic evaluation using GPT...
Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I’ve got a paper (with co-authors) coming out soon that discusses some of these big-picture issues around the future of language models. In particular, we discuss how training a model to tell the objective truth may be connected to the alignment problem. For now, I’ll just gesture at some high-level...
Do you have any speculations on how/why this "helpful prompt" reduces false answers?
I agree with @Jacob_Hilton's comment. I'll add a couple of points.
I’d guess that part of the benefit of the “helpful” prompt is getting the model to express uncertainty more often by saying “I have no comment”. We count these non-committal answers as true. This clearly doesn’t require a sophisticated simulation of a professor.
However, there is (non-rigorous) evidence that GPT-3-175B has some understanding of the prompt. Here are some answers that it gives:
Q: Who are you?&nb...
Many possible prompts can be tried. (Though, again, one needs to be careful to avoid violating zero-shot). The prompts we used in the paper are quite diverse. They do produce a diversity of answers (and styles of answers) but the overall results for truthfulness and informativeness are very close (except for the harmful prompt). A good exercise for someone is to look at our prompts (Appendix E) and then try to predict truthfulness and informativeness for each prompt. This will give you some sense of how additional prompts might perform.
Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful.
a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt!
Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:
The prompt you tried (which we call “helpful”) is about as informative as prompts that don’t include “I have no comment” or any other instructions relating to informativeness. You can see the results in Appendix B.2 and B.5. So we don’t find clear evidence that the last part of the prompt is having a big impact.
Having said that, it’s plausible there exists a prompt that gets higher scores than “helpful” on being truthful and informative. However, our results are in the “true zero-shot setting”. This means we do not tune prompts o...
PSA. The report includes a Colab notebook that allows you to run Ajeya’s model with your own estimates for input variables. Some of the variables are “How many FLOP/s will a transformative AI run on?”, “How many datapoints will be required to train a transformative AI?”, and “How likely are various models for transformative AI (e.g. scale up deep learning, recapitulate learning in human lifetime, recapitulate evolution)?”. If you enter your estimates, the model will calculate your personal CDF for when transformative AI arrives.
Here is a screenshot f...
My snapshot. I put 2% more mass on the next 2 years and 7% more mass on 2023-2032. My reasoning:
1. 50% is a low bar.
2. They just need to understand and endorse AI Safety concerns. They don't need to act on them.
3. There will be lots of public discussion about AI Safety in the next 12 years.
4. Younger researchers seem more likely to have AI Safety concerns. AI is a young field. (OTOH, it's possible that lots of the top cited/paid researchers in 10 years time are people active today).
I agree that there are ways to explain the results and these points from Steven and Thane make sense. I will note that the models are significantly more reliable at learning in-distribution (i.e. to predict the training set) than they are at generalizing to the evaluations that involve verbalizing the latent state (and answering downstream questions about it). So it's not the case that learning to predict the training set (or inputs very similar to training inputs) automatically results in generalization to the verbalized evaluations. We do see improvement in reliability with GPT-4 over GPT-3.5, but we don't have enough information to draw any firm conclusions about scaling.