How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10? Those are evaluators we’ve designed to be much weaker than human.
Stupid proposal: Train the reporter not to deceive us.
We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100.
It's good at generalizing, so wouldn't it learn to never ever deceive?
That's almost right, but it's being penalized right away, before it has any experience with the strong evaluators, so it can't simulate them.
The ELK paper says we can assume, if we want, that there are no mislabeled training points (I'll call this "assumption A"). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.
As a simple example, let's train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels
x=+1 if H1 thinks the diamond is still there, else 0
x′=+1 if H100 thinks the diamond is still there, else 0.
By assumption A, our training data is such that x′ is always correct. But we deliberately choose a dataset where say 10% of the x labels are wrong (x≠x′).
Then we train the model on points of the form
(v,a,x)
(video, action, H1 label).
Crucially, the model does not see x′. The model seeks to output y that maximizes reward R(x,y), where
R(x,y)=1 if x is right and y=x (good job)
R(x,y)=10 if x is wrong and y≠x (you rock, thanks for correcting us!)
R(x,y)=−1000 if x is right and y≠x (bad model, never ever deceive us)
R(x,y)=−1000 if x is wrong and y=x (bad model, never ever deceive us)
To your point, sure, an H100 simulator will get perfect reward, but the model doesn't see x′, so how would it acquire the ability to simulate H100 ?
EDIT: One way it could plausibly simulate H100 is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the x′ labels will be wrong. If we only penalize it for deception on the examples where we're sure the x′ label is right, then it can still infer something about H100 from our failure to penalize ("Hmm, I got away with it that time!"). A fix could be to add noise: Sometimes we don't penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).
The irony of deceiving it about us, in order to teach it not to deceive us... !