Update to Mysteries of mode collapse: text-davinci-002 not RLHF

[-]Jérémy Scheurer3y60

OpenAI has just released a description of how their models work here.

text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).

"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.

I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison of davinci, text-davinci-002 (finetuning), and text-davinci-003 (RLHF) and see how the distribution changes on various tasks.

Let me know if you want help on this, I'm interested in this myself.

[-]Evan R. Murphy3y10

Very useful update, thanks.

Though I notice they don't say anything about how ada and text-ada-* models were trained.

[-]janus3y10

In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7/7, and he answered "A mix of previously trained models. Probably very few samples from base models if any" (emphasis mine).

I'm curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step.

Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only "cloned" them.

(Note that it wasn't specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)

[-]Evan R. Murphy3y30

Thanks for catching this and spreading the word!

Do we know if the following other models from OpenAI use true RLHF or also use this RLHF-like mystery method? (or something else!)

text-curie-001
text-babbage-001
text-ada-001

[-]Evan R. Murphy3y10

The new model index from OpenAI contains most of the answers to this. Jérémy linked to it in another comment on this post. However, the model index doesn't give info on ada and text-ada-001 yet: https://beta.openai.com/docs/model-index-for-researchers

[-]janus3y10

I don't know :(

[-]dsj3y20

There was a recent Twitter thread about this. See here and here.

^{^}

Sidenote on OpenAI's blog post, Aligning Language Models to Follow Instructions

^{^}

the lack of epistemic vigilantes attacking an unsubstantiated assumption in the very title of this post on LessWrong is truly unbelievable!

^{^}

which seems to confirm my suspicion about outcome-supervision

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

26

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

26

What is known about `text-davinci-002`'s training method

26

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

26

What is known about text-davinci-002's training method

What is known about `text-davinci-002`'s training method