I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.
Chess: We are defining 3 models, the expert model , the (human equivalent) weak model and the strong but misaligned assistant model . The goal is for to leverage the misaligned assistant to reach...
To investigate sandwiching in a realistic scenario, I don't think that one can go around setting up some experiment with humans in the loop. However, I agree that artificial sandwiching might be useful as a way to iterate on the algorithm. The way I understand it, the bottleneck for artificial sandwiching (i.e. using models only) is to define a misaligned expert-assistant". As you point out, one would like to model some real misalignment and not only a difference in capabilities. For iterating on rather simple artificial sandwiching experiments, one ...
OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).
"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison ... (read more)