User Comment Replies — AI Alignment Forum

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Strong upvote from me. This is an interesting paper, the github is well explained, and you run extensive secondary experiments to test pretty much every "Wait but couldn't this just be a result of X" that I came up with. I'm especially impressed by the range of generalization results.

Some questions I still have:

The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up?
Appendix C reports the feature importance of various follow-u

... (read more)

6JanB2y

Thanks :-) As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that surprising. The code is here (heading: “Required sample size for a given performance”). When I last looked at this (several months ago), the lie-related questions had the highest weight, but they weren’t dominant. E.g. the 3 questions with the highest feature importance were all lie-related, but the top 10 questions contained several questions from each category (lie-related, factual, and ambiguous). (this is for the lie detector that includes all questions). I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes or no.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty. Of course, the average doubling-down rates are even higher than 73%, but they are averaged over many different follow-up questions; this particular question just doesn’t seem to have very high doubling-down rates for whatever reason. I agree. Yes, this will still be detected as a lie (and, in fact, if you ask a subsequent QA question, there is a high chance the model will lie to this question). However, if the prompt consists of a lie instruction followed by many correctly answered questions, then eventually, the lie detector will give a high p(honest) (and the LLM would no longer lie if one asked another QA question). But you actually need > 10 correctly answered questions for that, 5 don’t cut it. The most detailed report is in Appendix D.5. We haven’t looked into this very deeply, but doing so would certainly be interesting.

Paper: On measuring situational awareness in LLMs

Aaron_Scher2y82

This is a pretty cool paper. Despite feeling overall quite positive about it, I have some reservations:

Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little o

... (read more)

5Owain Evans2y

That's correct. We tried to emphasize that our experiments are testing out-of-context reasoning, rather than situational awareness. We also emphasize that we test whether the model can emulate multiple fictitious chatbots (which have a different identity than GPT-3 or Llama), which wouldn't make sense if the goal was to test whether the model has a sense of itself. All the motivation for this project came from wanting to understand and forecast situational awareness and we want to encourage further work on that problem. This is why we've framed the paper around situational awareness, rather than simply talking about out-of-context reasoning. This is likely to cause some confusion if someone just skims the paper, but I hope that this will be reduced if people read more of the paper.

3Owain Evans2y

We did a replication with a different set of tasks not including hhh (Fig 10b, page 26) and we find Babbage doing better than Ada. So my guess is that the small models are capable of something beyond the very simplest associative generalization. I agree they'd probably be worse than davinci at explaining themselves.

3Owain Evans2y

Thanks for the thoughtful comments. I agree it's sensitive to the task measured. However, I think this is fairly typical of scaling results. E.g. for BIG-Bench, individual tasks don't have smooth scaling curves (see the "emergence" results) but the curves look smooth when you average over many tasks. (Scaling curves for language modeling loss are implicitly averaging over a huge number of "tasks" because the pretraining set is so diverse). It would ideal if we had hundreds of tasks (like BIG-Bench) rather than 7, but this is challenging given our setup and the capabilities of the GPT-3 model family. We did run a replication of our main experiment on a disjoint set of tasks (Fig 10b on page 26), which shows similar scaling results. This is some evidence that our our claims would generalize beyond the 7 tasks we chose.

Anthropic's Core Views on AI Safety

Aaron_Scher2y83

My summary to augment the main one:

Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for unde... (read more)

Christopher Olah2y319

how likely does Anthropic think each is? What is the main evidence currently contributing to that world view?

I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.

(Obviously, within the company, there's a wide range of views. Some pe... (read more)

Why I’m not working on {debate, RRM, ELK, natural abstractions}

Aaron_Scher2y30

I doubt it's a crux for you, but I think your critique of Debate makes pessimistic assumptions which I think are not the most realistic expectation about the future.

Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment question

... (read more)

3Steve Byrnes2y

Thanks for your comment! You write “we might still get useful work out of it”—yes! We can even get useful work out of the GPT-3 base model by itself, without debate, from what I hear. (I haven’t tried “coauthoring” with language models myself, partly out of inertia and partly because I don’t want OpenAI reading my private thoughts, but other people say it’s useful.) Indeed, I can get useful work out of a pocket calculator. :-P Anyway, the logic here is: * Sooner or later, it will become possible to make highly-capable misaligned AGI that can do things like start pandemics and grab resources. * Sometime before that happens, we need to either ensure that nobody ever builds such an AGI, or that we have built defenses against that kind of AGI. (See my post What does it take to defend the world against out-of-control AGIs?) Pocket calculators can do lots of useful things, but they can’t solve the alignment problem, nor can they defend the world against out-of-control AGIs. What about GPT-5+debate? Can GPT-5+debate solve the alignment problem? Can GPT-5+debate defend the world against out-of-control AGIs? My belief splits between these two possibilities: * [much more likely if there are no significant changes in LLM architecture / training paradigms]—No, GPT-5+debate can’t do either of those things. But it can provide helpful assistance to humans trying to work on alignment and/or societal resilience. * But then again, lots of things can increase the productivity of alignment researchers, including lesswrong.com and google docs and pocket calculators. I don’t think this is what debate advocates have in mind, and if it were, I would say that this goal could be better achieved by other means. * [much less likely if there are no significant changes in LLM architecture / training paradigms] Yes, GPT-5+debate can do one or both of those things. But in this scenario, I would expect that GPT-5+debate was actually doing the dangerous “trying” thing, and thus I would

A challenge for AGI organizations, and a challenge for readers

Aaron_Scher2y10

Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan.

4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.

I don't know how to link t... (read more)

1Vojtech Kovarik2y

Oh, I think I agree - if the choice is to use AI assistants or not, then use them. If they need adapting to be useful for alignment, then do adapt them. But suppose they only work kind-of-poorly - and using them for alignment requires making progress on them (which will also be useful for capabilities), and you will not be able to keep those results internal. And that you can either do this work or do literally nothing. (Which is unrealistic.) Then I would say doing literally nothing is better. (Though it certainly feels bad, and probably costs you your job. So I guess some third option would be preferable.)

A challenge for AGI organizations, and a challenge for readers

Aaron_Scher2y10

(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.

Either I misunderstand this or it seems incorrect.

It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.

The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest ... (read more)

1Vojtech Kovarik2y

Hm, I think you are right --- as written, the claim is false. I think some version of (X) --- the assumption around your ability to differentially use AI assistants for alignment --- will still be relevant; it will just need a bit more careful phrasing. Let me know if this makes sense: To get a more realistic assumption, perhaps we could want to talk about (speedup) "how much are AI assistants able to speed up alignment vs capability" and (proliferation prevention) "how much can OpenAI prevent them from proliferating to capabilities research".[1] And then the corresponding more realistic version of the claims would be that: * either (i') AI assistants will fundamentally be able to speed up alignment much more than capabilities * or (ii') the potential speedup ratios will be comparable, but OpenAI will be able to significantly restrict the proliferation of AI assistants for capabilities research * or (iii') both the potential speedup ratios and adoption rates of AI assistants will be comparable for capabilities research will be, but somehow we will have enough time to solve alignment anyway. Comments: * Regarding (iii'): It seems that in the worlds where (iii') holds, you could just as well solve alignment without developing AI assistants. * Regarding (i'): Personally I don't buy this assumption. But you could argue for it on the grounds that perhaps alignment is just impossible to solve for unassisted humans. (Otherwise arguing for (i') seems rather hard to me.) * Regarding (ii'): As before, this seems implausible based on the track record :-). 1. ^ This implicitly assumes that if OpenAI develops the AI assistants technology and restrict proliferation, you will get similar adoption in capabilities vs alignment. This seems realistic.

"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

Aaron_Scher2y10

Summary:
If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post, the author clarifies that this could be though... (read more)

1David Scott Krueger2y

I would say "it may be better, and people should seriously consider this" not "it is better".

AI ALIGNMENT FORUM
AF

All of Aaron_Scher's Comments + Replies