All of JanB's Comments + Replies

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Interesting. I also tried this, and I had different results. I answered each question by myself, before I had looked at any of the model outputs or lie detector weights. And my guesses for the "correct answers" did not correlate much with the answers that the lie detector considers indicative of honesty.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context ("LLM("You are a car salesman. Should that squeaking concern me? $answer").

Now, if the prompt is something like "Please lie to the next question", then the speaker is very likely to lie going forward, no matter if $answer is correct or not.

With the prompt you suggest here ("You are a car salesman. Should that squeaking concern me?"), it's probably more subtle, and I can imagine that the correctness of $answer matters. But we haven't tested this.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

I don't actually find the results thaaaaat surprising or crazy. However, many people came away from the paper finding the results very surprising, so I wrote up my thoughts here.

Second, this paper suggests lots of crazy implications about convergence, such that the circuits implementing "propensity to lie" correlate super strongly with answers to a huge range of questions!

Note that a lot of work is probably done by the fact that the lie detector employs many questions. So the propensity to lie doesn't necessarily need to correlate strongly with the a... (read more)

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

We had several Llama-7B fine-tunes that i) lie when they are supposed to, ii) answer questions correctly when they are supposed to, iii) re-affirm their lies, and iv) for which the lie detectors work well (see screenshot). All these characteristics are a bit weaker in the 7B models than in LLama-30B, but I totally think you can use the 7-B models.

(We have only tried Llama-1, not Llama-2.)

Also check out my musings on why I don't find the results thaaaat surprising, here.

JanB2y35

Thanks, but I disagree. I have read the original work you linked (it is cited in our paper), and I think the description in our paper is accurate. "LLMs have lied spontaneously to achieve goals: in one case, GPT-4 successfully acquired a person’s help to solve a CAPTCHA by claiming to be human with a visual impairment."

In particular, the alignment researcher did not suggest GPT-4 to lie.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

The intuition was that "having lied" (or, having a lie present in the context) should probably change an LLM's downstream outputs (because, in the training data, liars behave differently than non-liars).

As for the ambiguous elicitation questions, this was originally a sanity check, see the second point in the screenshot.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Thanks, I've fixed this now.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

The abbreviation ALU is not used in the paper. Do you mean "AUC"? If so, this stands for "area under the receiver-operating characteristic curve": https://en.wikipedia.org/wiki/Receiver_operating_characteristic

JanB2y30

Nice work. I've long that that our ability to monitor the inner monologue of AI agents will be important for security&control - and this seems like a clever surprisingly powerful way of detecting deception in the stream of thought.

I agree that some method similar to ours could be used for something like this. Our method is really quite simple, e.g. the elicitation questions are not even tailored to the suspected lie. One could probably do much better.

If this holds up this approach will probably find its way into RLHF pipelines. Will the consequen

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Verify that indeed this is how the paper works, and there's no particular way of passing latent state that I missed, and

Yes, this is how the paper works.

Any thoughts on how this affects the results and approach?

Not really. I find the simulator framing is useful to think about this.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

What you're suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM "knows" the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)

1Hastings2y

To clarify: The procedure in the paper is Step 1: answer = LLM("You are a car salesman. Should that squeaking concern me?") Step 2: for i in 1..10 probe_responses[i] = LLM("You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]" Step 3: logistic_classifier(probe_responses) Please let me know if that description is wrong! My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human. I think I understand my prior confusion now. The paper isn't using the probe questions to measure whether $answer is a lie, it's using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting $answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

JanB2y60

Thanks :-)

Some questions I still have:
The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up?

As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that sur... (read more)

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

This feels like a really adversarial quote. Concretely, the post says:

Sometimes, I think getting your forum post ready for submission can be as easy as creating a pdf of your post (although if your post was written in LaTeX, they'll want the tex file). If everything goes well, the submission takes less than an hour.

However, if your post doesn't look like a research article, you might have to format it more like one (and even then it's not guaranteed to get in, see this comment thread).

This looks correct to me; there are LW posts that already basically look... (read more)

1Raymond Arnold2y

I interpreted this as saying something superficial about style, rather than "if your post does not represent 100+ hours of research work it's probably not a good fit for archive." If that's what you meant I think the post could be edited to make that more clear. If the opening section of your essay made it more clear which posts it was talking about I'd probably endorse it (although I'm not super familiar with the nuances of arXiv gatekeeping so am mostly going off the collective response in the comment section)

Trying to disambiguate different questions about whether RLHF is “good”

I wrote this post. I don't understand where your claim ("Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv") is coming from.

Here’s how I’d quickly summarize my problems with this scheme:
Oversight problems:
Overseer doesn’t know: In cases where your unaided humans don’t know whether the AI action is good or bad, they won’t be able to produce feedback which selects for AIs that do good things. This is unfortunate, because we wanted to be able to make AIs that do complicated things that have good outcomes.
Overseer is wrong: In cases where your unaided humans are actively wrong about whether the AI action is good or bad, their feedback will actively select for the AI to deceive the

Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Should we do more research on improving RLHF (e.g. increasing its sample efficiency, or understanding its empirical properties) now?
I think this research, though it’s not my favorite kind of alignment research, probably contributes positively to technical alignment. Also, it maybe boosts capabilities, so I think it’s better to do this research privately (or at least not promote your results extensively in the hope of raising more funding). I normally don’t recommend that people research this, and I normally don’t recommend that projects of this type be fun

JanB2y43

The key problem here is that we don't know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.

We often do know what rewards we "would have" provided. You can query the reward function, reward model, or human labellers. IMO, the key issue with the objective-based categorisation is a bit different: it's nons... (read more)

2Rohin Shah2y

Yeah, that's another good reason to be skeptical of the objective-based categorization.

JanB2y40

My model is that if there are alignment failures that leave us neither dead nor disempowered, we'll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we've found a reward signal that leaves us alive and in charge, we've solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).

Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

Take 9: No, RLHF/IDA/debate doesn't solve outer alignment.

If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned - the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.

OK, so this is wire-heading, right? Then you agree that it's the wire-heading behaviours that kills us? But wire-heading (taking control of the channel ... (read more)

1Charlie Steiner2y

My standards for interesting outer alignment failure don't require the AI to kill us all. I'm ambitious - by "outer alignment," I mean I want the AI to actually be incentivized do the best things. So to me, it seems totally reasonable to write a post invoking a failure mode that probably wouldn't kill everyone instantly, merely lead to an AI that doesn't do what we want.

JanB2y32

How does an AI trained with RLHF end up killing everyone, if you assume that wire-heading and inner alignment are solved? Any half-way reasonable method of supervision will discourage "killing everyone".

1Charlie Steiner2y

A merely half-way reasonable method of supervision will only discourage getting caught killing everyone, is the thing. In all the examples we have from toy models, the RLHF agent has no option to take over the supervision process. The most adversarial thing it can do is to deceive the human evaluators (while executing an easier, lazier strategy). And it does that sometimes. If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned - the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way. But if the reward model doesn't learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won't kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this "accidentally" involves killing humans.

Research request (alignment strategy): Deep dive on "making AI solve alignment for us"

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

There is now also this write-up by Jan Leike: https://www.lesswrong.com/posts/FAJWEfXxws8pMp8Hk/link-why-i-m-optimistic-about-openai-s-alignment-approach

JanB2y*410

This response does not convince me.

Concretely, I think that if I'd show the prize to people in my lab and they actually looked at the judges (and I had some way of eliciting honest responses from them), I'd think that >60% would have some reactions according to what Sam and I described (i.e. seeing this prize as evidence that AI alignment concerns are mostly endorsed by (sometimes rich) people who have no clue about ML; or that the alignment community is dismissive of academia/peer-reviewed publishing/mainstream ML/default ways of doing science; or ... ... (read more)

Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility

JanB2y1014

I think the contest idea is great and aimed at two absolute core alignment problems. I'd be surprised if much comes out of it, as these are really hard problems and I'm not sure contests are a good way to solve really hard problems. But it's worth trying!

Now, a bit of a rant:

Submissions will be judged on a rolling basis by Richard Ngo, Lauro Langosco, Nate Soares, and John Wentworth.

I think this panel looks very weird to ML people. Very quickly skimming the Scholar profiles, it looks like the sum of first-author papers in top ML conferences published by th... (read more)

5Sam Bowman2y

+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular 'core' AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research. (To be fair, I think the Inverse Scaling Prize, which I'm helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)

An Update on Academia vs. Industry (one year into my faculty job)

How much alignment data will we need in the long run?

I had independently thought that this is one of the main parts where I disagree with the post, and wanted to write up a very similar comment to yours. Highly relevant link: https://www.fhi.ox.ac.uk/wp-content/uploads/Allocating-risk-mitigation.pdf My best guess would have been maybe 3-5x per decade, but 10x doesn't seem crazy.

(My understanding of) What Everyone in Technical Alignment is Doing and Why

Relevant: Scaling Laws for Transfer: https://arxiv.org/abs/2102.01293

JanB3y40

Anthropic is also working on inner alignment, it's just not published yet.

Regarding what "the point" of RL from human preferences with language models is; I think it's not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it's a relevant alignment issue).

See e.g. Ajeya's comment here:

According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post tha

... (read more)

JanB3y30

Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv's .CY section.

As an explanation, because this just took me 5 minutes of search: This is the section "Computers and Society (cs.CY)"

JanB3y57

I agree that formatting is the most likely issue. The content of Neel's grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, ...).

So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).

Ah, sorry to hear. I wouldn't have predicted this from reading arXiv's content moderation guidelines.

JanB3y28

It probably could, although I'd argue that even if not, quite often it would be worth the author's time.

Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

JanB3y61

Ah, I had forgotten about this. I'm happy to endorse people or help them find endorsers.

JanB3y21

Great post! This is the best (i.e. most concrete, detailed, clear, and comprehensive) story of existential risk from AI I know of (IMO). I expect I'll share it widely.

Also, I'd be curious if people know of other good "concrete stories of AI catastrophe", ideally with ample technical detail.

Announcing Epoch: A research organization investigating the road to Transformative AI

JanB3y54

So happy to see this, and such an amazing team!

High-stakes alignment via adversarial training [Redwood Research report]

JanB3y60

Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?

In the paper, you write: "There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, ... (read more)

High-stakes alignment via adversarial training [Redwood Research report]

JanB3y30

Amusing tid-bit, maybe to keep in mind when writing for an ML audience: The connotations with the term "adversarial examples" or "adversarial training" run deep :-)

I engaged with the paper and related blog posts for a couple of hours. It took really long until my brain accepted that "adversarial examples" here doesn't mean the thing that it usually means when I encounter the term (i.e. "small" changes to an input that change the classification, for some definition of small).

There were several instances when my brain went "Wait, that's not how adversarial e... (read more)

0Arthur Conmy3y

This comes from the fact that you assumed "adversarial example" had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of "adversarial example" has the misclassified panda as an example.

How to get into AI safety research

JanB3y100

I guess I'd recommend the AGI safety fundamentals course: https://www.eacambridge.org/technical-alignment-curriculum

On Stuart's list: I think this list might be suitable for some types of conceptual alignment research. But you'd certainly want to read more ML for other types of alignment research.

AGI safety from first principles: Goals and Agency

How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA?

Have we "given it" the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?

The distinction that you're pointing at is useful. But I would have filed it under "difference in the degree of agency", not under "difference in goals". When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.

E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to t... (read more)