Kei

Posts

Sorted by New

41Reward hacking behavior can generalize across tasks

5mo

1

Wiki Contributions

Comments

Sorted by

Newest

I don’t find the lie detection results that surprising (by an author of the paper)

Kei1y10

It's possible I'm using motivated reasoning, but on the listed ambiguous questions in section C.3, the answers the honest model gives tend to seem right to me. As in, if I were forced to answer yes or no to those questions, I would give the same answer as the honest model the majority of the time.

So if, as is stated in section 5.5, the lie detector not only detects whether the model had lied but whether it would lie in the future, and if the various model variants have a similar intuition to me, then the honest model is giving its best guess of the correct answer, and the lying model is giving its best guess of the wrong answer.

I'd be curious if this is more generally true - if humans tend to give similar responses to the honest model for ambiguous questions.

Reply