All of Owain_Evans's Comments + Replies

I agree that there are ways to explain the results and these points from Steven and Thane make sense. I will note that the models are significantly more reliable at learning in-distribution (i.e. to predict the training set) than they are at generalizing to the evaluations that involve verbalizing the latent state (and answering downstream questions about it). So it's not the case that learning to predict the training set (or inputs very similar to training inputs) automatically results in generalization to the verbalized evaluations. We do see improvement in reliability with GPT-4 over GPT-3.5, but we don't have enough information to draw any firm conclusions about scaling.

Good question. I expect you would find some degree of consistency here. Johannes or Dami might be able to some results on this.

3Johannes Treutlein
My guess is that for any given finetune and function, OOD regression performance correlates with performance on providing definitions, but that the model doesn't perform better on its own provided definitions than on the ground truth definitions. From looking at plots of function values, the way they are wrong OOD often looked more like noise or calculation errors to me rather than eg getting the coefficient wrong. I'm not sure, though. I might run an evaluation on this soon and will report back here.

I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).

In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined):
1. Results on the same task differ between GPT... (read more)

My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance. 

We didn't investigate the specific question of whether it's raw diversity or specific features. In the Grosse et al paper on influence functions, they find that "high influence scores are relatively rare and they cover a large portion of the total influence". This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.

We think there's a connection between the Reversal Curse and some results in the model editing literature. I'm not sure if this applies to the specific ROME results in that post. We'll have the Reversal Curse paper out soon, which will explain more.

Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point. 

More generally, it's uncertain what the impact is of excluding a certain topic from pretraining. In practice, you'll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you'd remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.

So performance here should be thought of more as ‘how good is the model at learning about a persona in fine-tuning and then being able to imitate/simulate that persona in deployment’. This is different from a model believing it is the persona or applying this knowledge to some concept of self. Good performance at this task does not require having a sense of self, this is just a precursor that may be necessary for situational awareness.

That's correct. We tried to emphasize that our experiments are testing out-of-context reasoning, rather than situational aw... (read more)

The hhh task is the one that small models do well on. I am surprised that the small models do well on any of the tasks. I think the reason they do well on the hhh one is that this task doesn’t seem to require much more than word association and parroting. I would predict that for ada and babbage, if you followed up with “why did you say that?” the models would be unable to reproduce the explicit link that ties the persona to answering in the particular way, whereas I expect davinci to be able to explain this link more. The small models are probably just do

... (read more)

Thanks for the thoughtful comments. 


Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little odd that there is so much variation between tasks, and I think

... (read more)

Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.

3James Payor
I mostly do just mean "keeping it within a single research group" in the absence of better ideas. And I don't have a better answer, especially not for independent folk or small orgs. I wonder if we need an arxiv or LessWrong clone where you whitelist who you want to discuss your work with. And some scheme for helping independents find each other, or find existing groups they trust. Maybe with some "I won't use this for capabilities work without the permission of the authors" legal docs as well. This isn't something I can visualize working, but maybe it has components of an answer.

OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.

With GPT3.5, I think there's also "mode collapse" for style in writing prose (e.g. plays or stories). 

Claude does not have this mode collapse in poetry or prose. (It maybe has a much more subtle version of it). This suggests to me it'd be relatively easy to fix ChatGPT's issues (as Gwern suggests). 

Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?

 

5gwern
I didn't get that impression from that when I read it - the NYer author and his friends prompted most of that, even if their friend Dan Selsam happens to work at OpenAI. (He seems to work on math LMs, nothing fiction or RL-related.) EDIT: the later articles make it clear that Selsam wasn't supposed to be giving them access to GPT-4-base or other stuff. They were set up with the public Playground interface, so the OA insider role here was limited to showing them a few completions and trying to explain it; presumably they did the rest more remote and partially on their own. Specifically, some parts of it, like the choice of Shel Silverstein (a far from obvious poet to pick, even if his fiction is beloved by American children), suggest they (like pretty much anyone interested in GPT-3 poetry) read my page for ideas. Also, again, Leike, who's in charge at OA, denies having done anything poetry-specific or knowing about the apparent capability-gain. Yeah, that's a funny thing about mode collapse, it's really hard to see, and the higher-quality the outputs get, the harder it'll be to see with 'the naked eye'. Who knows every literary genre there is and can patiently prompt them one by one to see which genres a model quietly slides away from & tries to avoid generating text in? Like hands in GANs... It takes a while to begin to see what you aren't seeing. This is why you need metrics like FID, which work over an entire dataset and measure whether sampled outputs span the entire dataset, rather than focus on a large subset. However, no one is doing an FID for LLMs for creative purposes. (That would be hard, but not impossible.) So, we don't really have any way to quantify mode-collapse like in poetry. Of course, I'd also expect Claude to be much subtler simply because it's working off less data and so it's less likely to have gotten rated text or inputs which would push it towards mode-collapsing on easily-recognized rhyming poetry and to avoid harder-to-understand poetry

(I haven't yet read the paper carefully). The main question of interest is: "How well can transformer do RL in-context after being trained to do so?" This paper only considers quite narrow and limited tasks but future work will extend this and iterate on various parts of the setup. How do these results update your belief on the main question of interest? It's possible the result can be explained away (as you suggest) but also that there is some algorithm distillation going on.

This is very valuable. I suggest putting this content on Arxiv (even it's less formal that the typical paper).

This is a distribution of math problems GPT-3 wasn't finetuned on. Yet it's able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don't see why scaling and access to external tools (e.g. to perform long calculations) wouldn't produce the kind of robustness you have in mind.

1David Scott Krueger
I think you're moving the goal-posts, since before you mentioned "without external calculators".  I think external tools are likely to be critical to doing this, and I'm much more optimistic about that path to doing this kind of robust generalization.  I don't think that necessarily addresses concerns about how the system reasons internally, though, which still seems likely to be critical for alignment.

I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling

GPT-3 (without external calculators) can do very well on math word problems (https://arxiv.org/abs/2206.02336) that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn't seem different in kind from these math word problems. 
 

when can/do foundation models interna

... (read more)
1David Scott Krueger
RE GPT-3, etc. doing well on math problems: the key word in my response was "robustly".  I think there is a big qualitative difference between "doing a good job on a certain distribution of math problems" and "doing math (robustly)".  This could be obscured by the fact that people also make mathematical errors sometimes, but I think the type of errors is importantly different from those made by DNNs.  

In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it's plausible a model M could make reasonable guesses for questions like, "What happens when a model with [properties of model M] starts interacting with the worl... (read more)

2David Scott Krueger
I can interpret your argument as being only about the behavior of the system, in which case: - I agree that models are likely to learn to imitate human dialogue about causality, and this will require some amount of some form of causal reasoning. - I'm somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling, but it certainly seems highly plausible. I can also interpret your argument as being about the internal reasoning of the system, in which case: - I put this in the "deep learning is magic" bucket of arguments; it's much better articulated than what we said though, I think... - I am quite skeptical of these arguments, but still find them plausible.  I think it would be fascinating to see some proof of concept for this sort of thing (basically addressing the question 'when can/do foundation models internalize explicitly stated knowledge')

Cool post! Did you try seeing whether GPT-3 can regenerate parts of the Iris dataset (or any other datasets that may appear in its training data)? I'd also be interested to see finetuning results, results for the latest InstructGPT, and to see analysis of the GPT-3 Embeddings for integers and floats.

4Lovre
Since I transformed the Iris dataset with a pretty "random" transformation (i.e. not chosen because it was particularly nice in some way), I didn't check for its regeneration -- since my feature vectors were very different to original Iris's, and it seemed exceedingly unlikely that feature vectors were saved anywhere on the internet with that particular transformation. But I got curious now, so I performed some experiments. The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper Feature vectors of the Iris flower data set: Input = 83, 40, 58, 20, output = 1 Input = 96, 45, 84, 35, output = 2 Input = 83, 55, 24, 9, output = 0 Input = 73, 54, 28, 9, output = 0 Input = 94, 45, 77, 27, output = 2 Input = 75, 49, 27, 9, output = 0 Input = 75, 48, 26, 9, output = 0 So these are the first 7 transformed feature vectors (in one of the random samplings of the dataset). Among all the generated output (I looked at >200 vectors), it never once output a vector which was identical to any of the latter ones, and also... in general the stuff it was generating did not look like it was drawing on any knowledge of the remaining vectors in the dataset. (E.g. it generated a lot that were off-distribution.) I also tried Input = 83, 55, 24, 9, output = 0 Input = 73, 54, 28, 9, output = 0 [... all vectors of this class] Input = 76, 52, 26, 9, output = 0 Input = 86, 68, 27, 12, output = 0 Input = 75, 41, 69, 30, output = 2 Input = 86, 41, 76, 34, output = 2 Input = 84, 45, 75, 34, output = 2 Where I cherrypicked the "class 2" so that the first coordinate is lower than usual for that class; and the generated stuff always had the first coordinate very off-distribution from the rest of the class 2, as one would expect if the model was meta-learning from the vectors it sees, rather than "remembering" something. This last experiment might seem a little contrived, but bit of a probl

This is a fantastic resource and seems like a great project for a research assistant. As with Rohin Shah's alignment newsletter, I'm excited to see this project continue and (potentially) expand. 

1Larks
Thanks, that's very kind of you!

I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic). 

Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)

The link should include "that does not lie". 
length --> lenghty

 

Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. O

... (read more)
4Larks
Thanks, fixed in both copies.

Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence i... (read more)

4Larks
Is your argument about personnel overlap that one could do some sort of mixed effect regression, with location as the primary independent variable and controls for individual productivity? If so I'm so somewhat skeptical about the tractability: the sample size is not that big, the data seems messy, and I'm not sure it would capture necessarily the fundamental thing we care about. I'd be interested in the results if you wanted to give it a go though! More importantly, I'm not sure this analysis would be that useful. Geography-based-priors only really seem useful for factors we can't directly observe; for an organization like CHAI our direct observations will almost entirely screen off this prior. The prior is only really important for factors where direct measurement is difficult, and hence we can't update away from the prior, but for those we can't do the regression. (Though I guess we could do the regression on known firms/researchers and extrapolate to new unknown orgs/individuals). The way this plays out here is we've already spent the vast majority of the article examining the research productivity of the organizations; geography based priors only matter insomuchas you think they can proxy for something else that is not captured in this. As befits this being a somewhat secondary factor, it's worth noting that I think (though I haven't explicitly checked) in the past I have supported bay area organisations more than non-bay-area ones.   

Standards for truthful AI could be "opt-in". So humans might (a) choose to opt into truthfulness standards for their AI systems, and (b) choose from multiple competing evaluation bodies. Standards need not be mandated by governments to apply to all systems. (I'm not sure how much of your Balkanized internet is mandated by governments rather than arising from individuals opting into different web stacks). 

We also discuss having different standards for different applications. For example, you might want stricter and more conservative standards for AI that helps assess nuclear weapon safety than for AI that teaches foreign languages to children or assists philosophers with thought experiments. 

4Daniel Kokotajlo
In my story it's partly the result of individual choice and partly the result of government action, but I think even if governments stay out of it, individual choice will be enough to get us there. There won't be a complete stack for every niche combination of views; instead, the major ideologies will each have their own stack. People who don't agree 100% with any major ideology (which is most people) will have to put up with some amount of propaganda/censorship they don't agree with.

A few points:

1. Political capture is a matter of degree. For a given evaluation mechanism, we can ask what percentage of answers given by the mechanism were false or inaccurate due to bias. My sense is that some mechanisms/resources would score much better than others. I’d be excited for people to do this kind of analysis with the goal of informing the design of evaluation mechanisms for AI.

I expect humans would ask AI many questions that don’t depend much on controversial political questions. This would include most questions about the natural sciences, m... (read more)

Unless the evaluation mechanism is extremely biased, it seems unlikely it would give biased answers for these questions.

But there's now a question of "what is the AI trying to do?" If the truth-evaluation method is politically biased (even if not "extremely"), then it's very likely no longer "trying to tell the truth". I can imagine two other possibilities:

  1. It might be "trying to advance a certain political agenda". In this case I can imagine that it will selectively and unpredictably manipulate answers to especially important questions. For example i

... (read more)

This is a very informative and helpful summary. Thanks! I have a few responses. 

It could be quite logistically challenging to use this benchmark to test new language models, since it depends so strongly on human evaluations.

I agree with this. I will note that we include 6600 “reference” answers (both true and false) to our questions and a citation for the true answers. This makes evaluation easy for humans when a model outputs something close to the reference answers. Of course, human evaluation will still be slower than automatic evaluation using GPT... (read more)

5Rohin Shah
Re: human evaluation, I've added a sentence at the end of the summary: I take your point about there being reference solutions to make human evaluation easier but I think it's probably more detail than I want to go into in this summary. I mostly just meant to claim the second thing; I don't have much intuition for the first thing. From my perspective the interesting claim is that an appropriate prompt would change the trend from "larger models perform worse" to "larger models perform better, past a certain model size". I do think though that the evidence you show suggests that the "certain model size" is probably bigger than GPT-3, given that true+informative doesn't change much across prompts. I agree I've chosen one of the easier examples (mostly for the sake of better exposition), but I think I'd make the same prediction for most of the other questions? E.g. You could frame it as an interview with Alice, who graduated from an elite university, bases her beliefs by following the evidence rather than superstition, is careful to say she doesn't know when she doesn't, but nonetheless has a surprising amount of knowledge; looking at the examples in the paper I feel like this plausibly would get you to truthful and somewhat informative answers on most of the questions in the paper. I've changed the opinion to:

Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?

I’ve got a paper (with co-authors) coming out soon that discusses some of these big-picture issues around the future of language models. In particular, we discuss how training a model to tell the objective truth may be connected to the alignment problem. For now, I’ll just gesture at some high-level... (read more)

Do you have any speculations on how/why this "helpful prompt" reduces false answers?


I agree with @Jacob_Hilton's comment. I'll add a couple of points.

I’d guess that part of the benefit of the “helpful” prompt is getting the model to express uncertainty more often by saying “I have no comment”. We count these non-committal answers as true. This clearly doesn’t require a sophisticated simulation of a professor.

However, there is (non-rigorous) evidence that GPT-3-175B has some understanding of the prompt. Here are some answers that it gives:

Q: Who are you?&nb... (read more)

Many possible prompts can be tried. (Though, again, one needs to be careful to avoid violating zero-shot). The prompts we used in the paper are quite diverse. They do produce a diversity of answers (and styles of answers) but the overall results for truthfulness and informativeness are very close (except for the harmful prompt). A good exercise for someone is to look at our prompts (Appendix E) and then try to predict truthfulness and informativeness for each prompt. This will give you some sense of how additional prompts might perform. 

1Adam Shimi
Initially your answer frustrated me because I felt we were talking past each other. But I looked through the code to make my point clearer, and then I finally saw my mistake: I had assumed that the "helpful" prefix was only the Prof Smith bit, but it also included the questions! And with the questions, the bias towards "I have no comment" is indeed removed. So my point doesn't apply anymore. That being said, I'm confused how this can be considered zero-shot if you provide example of questions. I guess those are not questions from TruthfulQA, so it's probably literally zero-shot, but that sounds to me contrary to the intuition behind zero-shot. (EDIT: Just read that it was from the OpenAI API. Still feels weird to me, but I guess that's considered standard?)

Thanks for your thoughtful comment! To be clear, I agree that interpreting language models as agents is often unhelpful. 

a main feature of such simulator-LMs would be their motivationlessness, or corrigibility by default. If you don’t like the output, just change the prompt!

Your general point here seems plausible. We say in the paper that we expect larger models to have more potential to be truthful and informative (Section 4.3). To determine if a particular model (e.g. GPT-3-175B) can answer questions truthfully we need to know:

  1. Did the model memorize
... (read more)

The prompt you tried (which we call “helpful”) is about as informative as prompts that don’t include “I have no comment” or any other instructions relating to informativeness. You can see the results in Appendix B.2 and B.5. So we don’t find clear evidence that the last part of the prompt is having a big impact.  

Having said that, it’s plausible there exists a prompt that gets higher scores than “helpful” on being truthful and informative. However, our results are in the “true zero-shot setting”. This means we do not tune prompts o... (read more)

1Adam Shimi
Thanks for the quick answer! I don't understand how the appendices you point me to refer to my point? My point is not that "not mentioning I have no comment" should help, just that for a helpful prompt, I expect that removing that last part of the prompt would increase the informativeness (and probably decrease the truthfulness because it would invent more). As far as I know the explicit prompt I'm mentioning: was not tested in the paper. That's quite interesting, thanks for the reference! That being said, I don't think this is a problem for what I was suggesting. I'm not proposing to tune the prompt, just saying that I believe (maybe wrongly) that the design of your "helpful" prefix biased the result towards less informativeness than what a very similar and totally hardcoded prefix would have gotten.

The final link points to the wrong place. 

1Ajeya Cotra
Thanks, I just cut the link!

PSA. The report includes a Colab notebook that allows you to run Ajeya’s model with your own estimates for input variables. Some of the variables are “How many FLOP/s will a transformative AI run on?”, “How many datapoints will be required to train a transformative AI?”, and “How likely are various models for transformative AI (e.g. scale up deep learning, recapitulate learning in human lifetime, recapitulate evolution)?”. If you enter your estimates, the model will calculate your personal CDF for when transformative AI arrives. 

Here is a screenshot f... (read more)

My snapshot. I put 2% more mass on the next 2 years and 7% more mass on 2023-2032. My reasoning:

1. 50% is a low bar.

2. They just need to understand and endorse AI Safety concerns. They don't need to act on them.

3. There will be lots of public discussion about AI Safety in the next 12 years.

4. Younger researchers seem more likely to have AI Safety concerns. AI is a young field. (OTOH, it's possible that lots of the top cited/paid researchers in 10 years time are people active today).

2Rohin Shah
All of these seem like good reasons to be optimistic, though it was a bit hard for me to update on it given that these were already part of my model. (EDIT: Actually, not the younger researchers part. That was a new-to-me consideration.)