OpenAI has just released a description of how their models work here.
text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).
"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.
I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.
Let me know if you want help on this, I'm interested in this myself.
Very useful update, thanks.
Though I notice they don't say anything about how ada and text-ada-* models were trained.
In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7/7, and he answered "A mix of previously trained models. Probably very few samples from base models if any" (emphasis mine).
I'm curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.
Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only "cloned" them.
(Note that it wasn't specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)
Thanks for catching this and spreading the word!
Do we know if the following other models from OpenAI use true RLHF or also use this RLHF-like mystery method? (or something else!)
I (and many others) did not realize this before, but:
text-davinci-002
andtext-davinci-001
, the InstructGPT models on the OpenAI API, were not trained with RLHF (reinforcement learning from human feedback) as described in the InstructGPT paper, but a "similar but slightly different"[1] method that uses the same human feedback data. Apparently, this other method is not technically RLHF.Since this update has potentially nontrivial implications for interpreting the phenomena exhibited by
text-davinci-002
described in Mysteries of mode collapse (formerly titled "Mysteries of mode collapse due to RLHF"), I'm making this separate post for a signal boost.I have not corrected the original text of "Mysteries of mode collapse due to RLHF", but I've added a section at the beginning with further details on this update, copied here:
Sidenote on OpenAI's blog post, Aligning Language Models to Follow Instructions
the lack of epistemic vigilantes attacking an unsubstantiated assumption in the very title of this post on LessWrong is truly unbelievable!
which seems to confirm my suspicion about outcome-supervision