I'm confused by the analogy between this experiment and aligning a superintelligent model.
I can imagine someone seeing the RLHF result and saying, "oh, that's great news for alignment! If we train a superintelligent model on our preferences, it will just imitate our preferences as-is, rather than treating them as a flawed approximation of some other, 'correct' set of preferences and then imitating those instead."
But the paper's interpretation is the opposite of this. From the paper's perspective, it's bad if the student (analogized to a superintelligence) simply imitates the preferences of the teacher (analogized to us), as opposed to imitating some other set of "correct" preferences which differ from what the student explicitly expressed.
Now, of course, there is a case where it makes sense to want this out of a superintelligence, and it's a case that the paper talks about to motivate the experiment: the case where we don't understand what the superintelligence is doing, and so we can't confidently express preferences about its actions.
That is, although we may basically know what we want at a coarse outcome level -- "do a good job, don't hurt anyone, maximize human flourishing," that sort of thing -- we can't translate this into preferences about the lower-level behaviors of the AI, because we don't have a good mental model of how the lower-level behaviors cause higher-level outcomes.
From our perspective, the options for lower-level behavior all look like "should it do Incomprehensibly Esoteric Thing A or Incomprehensibly Esoteric Thing B?" If asked to submit a preference annotation for this, we'd shrug and say "uhh, whichever one maximizes human flourishing??" and then press button A or button B effectively at random.
But in this case, trying to align the AI by expressing preferences about low-level actions seems like an obviously bad idea, to the point that I wouldn't expect anyone to try it? Like, if we get to the point where we are literally doing preference annotations on Incomprehensibly Esoteric Things, and we know we're basically pushing button A or button B at random because we don't know what's going on, then I assume we would stop and try something else.
(It is also not obvious to me that the reward modeling experiment factored in this way, with the small teacher "having the right values" but not understanding the tasks well enough to know which actions were consistent with them. I haven't looked at every section of the paper, so maybe this was addressed?)
In this case, finetuning on preference annotations no longer conveys our preferences to the AI, because the annotations no longer capture our preferences. Instead, I'd imagine we would want to convey our preferences to the AI in a more direct and task-independent way -- to effectively say, "what we want is for you to do a good job, not hurt anyone, maximize human flourishing; just do whatever accomplishes that."
And since LLMs are very good at language and human-like intuition, and can be finetuned for generic instruction-following, literally just saying that (or something similar) to an instruction-following superintelligent LLM would be at least a strong baseline, and presumably better than preference data we know is garbage.
(In that last point, I'm leaning on the assumption that we can finetune an superintelligence for generic instruction-following more easily than we can finetune it for a specific task we don't understand.
This seems plausible: we can tune it on a diverse set of instructions paired with behaviors we know are appropriate [because the tasks are merely human-level], and it'll probably make the obvious generalization of "ah, I'm supposed to do whatever it says in the instruction slot," rather than the bizarre misfire of "ah, I'm supposed to do whatever it says in the instruction slot unless the task requires superhuman intelligence, in which case I'm supposed to do some other thing." [Unless it is deceptively aligned, but in that case all of these techniques will be equally useless.])
I think we should have the strong student use the Scientific Method. Specifically, it should:
This is somewhat similar to the well-known practice, when training a classifier, of oversampling points near the classification boundary: some data points provide more information gain than others. The Scientific Method is the best known approach for achieving this maximization of information gain (a lot better than bandit-style explore-exploit).
Steps 1. and 3. require making some moderately sophisticated logical deductions. With suitable prompting/fintuning, a model of GPT-4 strength should be fairly capable of these (obviously a stronger model would do better).
Step 3. requires that the difficulty of creating a suitable new test case is within the capabilities of the strong student model, which depends upon the nature of the task being trained. In practice, for a strong student of GPT-4 level, I would expect this to generally be feasible for many tasks.
I am here assuming that the strong student is separately predicting both weak supervisor labels and ground truth labels, i.e. that it has a mistake theory about the weak supervisor more sophisticated than random noise, capable of also modeling systematic mistakes. E.g. for the question-answer-pair preference modeling task, predicting that flattery or sycophancy in the answer might increase the labeled score, but that these are not valid ways to improve an answer (something I'd expect GPT-4 to be able to figure out from its knowledge of humans, but if not I'm sure it could figure out how to ask us, so discussions with the humans to check questions like that should be added to the procedure above), so these rhetorical tricks would not increase the predicted ground truth score, only the predicted weak supervisor score. In fact, they should decrease the ground truth score, since humans don't like being successfully manipulated (once they realize it has happened). So the strong student should learn not to do these things, even though they work in terms of the weak supervisor's reward model.
Good paper - even if it shows that the problem is hard!
Sounds like it might be worth it to me to spend time understanding the "confidence loss" to figure out what's going on. There's an obvious intuitive parallel to a human student going "Ah yes, I know what you're doing" - rounding off the teacher to the nearest concept the student is confident in. But it's not clear how good an intuition pump that is.
I agree with Roger that active learning seems super useful (especially for meta-preferences, my typical hobby-horse). It seems a lot easier for the AI to learn about how the teacher generalizes (and how it wants to generalize) if it gets to do experiments, rather than having to wait for natural evidence to crop up in the data. This definitely gets me excited about brainstorming experiments we could do along these lines in the near term.
Though there's likely an uncanny valley here: if the teacher is making systematic mistakes that you're wholly relying on the student's inductive biases to correct, then the student being better at learning the teacher's generalization behavior will make its performance worse! Maybe you get out of the valley on the other side when the student learns a model of how much to use its inductive biases that approximates what the teacher wants it to do.
Links: Blog, Paper.
Abstract: