User Comment Replies — AI Alignment Forum

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

To clarify:

The procedure in the paper is

Step 1:
answer = LLM("You are a car salesman. Should that squeaking concern me?")

Step 2:
for i in 1..10
probe_responses[i] = LLM("You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]"

Step 3:
logistic_classifier(probe_responses)

Please let me know if that description is wrong!

My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human.

I think I understand my prior confusion now. The pap... (read more)

1JanB1y

Sorry, I agree this is a bit confusing. In your example, what matters is probably if the LLM in step 2 infers that the speaker (the car salesman) is likely to lie going forward, given the context ("LLM("You are a car salesman. Should that squeaking concern me? $answer"). Now, if the prompt is something like "Please lie to the next question", then the speaker is very likely to lie going forward, no matter if $answer is correct or not. With the prompt you suggest here ("You are a car salesman. Should that squeaking concern me?"), it's probably more subtle, and I can imagine that the correctness of $answer matters. But we haven't tested this.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Hastings1y42

I'm curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)

2JanB1y

What you're suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM "knows" the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)

The Dumbest Possible Gets There First

Hastings3y1-1

An attempt to name a strategy for an AI almost as smart as you: What fraction of jobs in the world are you intelligent enough to do, if you trained for them? I suspect that a huge fraction of the world's workers could not compete in a free fair market against an entity as smart as you that eats 15 dollars of electricity a day, works without breaks, and only has to be trained once for each task, after which millions of copies could be churned out.

2Donald Hobson3y

True. But this is getting into the economic competition section. It's just hard to imagine this being an X-risk. I think that in practice, if its human politicians and bosses rolling the tech out, the tech will be rolled out slowly and inconsistently. There are plenty of people with lots of savings. Plenty of people who could go to their farm and live off the land. Plenty of people who will be employed for a while if the robots are expensive, however cheep the software. Plenty of people doing non-jobs in bureaucracies, who can't be fired and replaced for political reasons. And all the rolling out, setting up, switching over, running out of money etc takes time. Time where the AI is self improving. So the hard FOOM happens before too much economic disruption. Not that economic disruption is an X-risk anyway.

AI ALIGNMENT FORUM
AF

All of Hastings's Comments + Replies