User Comment Replies — AI Alignment Forum

Update to Mysteries of mode collapse: text-davinci-002 not RLHF

OpenAI has just released a description of how their models work here.

text-davinci-002 is trained with "FeedME" and text-davinci-003 is trained with RLHF (PPO).

"FeedME" is what they call supervised fine-tuning on human-written demonstrations or model samples rated 7/7 by human labelers. So basically fine-tuning on high-quality data.

I think your findings are still very interesting. Because they imply that even further finetuning, changes the distribution significantly! Given all this information one could now actually run a systematic comparison ... (read more)

1Evan R. Murphy2y

Very useful update, thanks. Though I notice they don't say anything about how ada and text-ada-* models were trained.

1janus2y

In this thread, I asked Jan Leike what kind of model generates the samples that go into the training data if rated 7/7, and he answered "A mix of previously trained models. Probably very few samples from base models if any" (emphasis mine). I'm curious to know whether/which of the behaviors described in this post appear in the models that generated the samples vs emerge at the supervised finetuning step. Hypothetically, if a model trained with RLHF generates the samples and that model has the same modes/attractors, it probably makes sense to say that RLHF was responsible for shaping those behaviors and finetuning only "cloned" them. (Note that it wasn't specified how the previously trained non-base models actually were trained, leaving open the possibilities of RLHF models, models fine tuned on only human data, earlier iterations of FeedME models, or something entirely different.)

Artificial Sandwiching: When can we test scalable alignment protocols without humans?

Jérémy Scheurer3y10

I'll try to explain what an MVP could look like, but I think this is not fully thought through and could use further improvement (potentially also has some flaws). The goal is to have a starting point for artificial sandwiching so one can iterate on ideas, and not fully investigate the whole sandwiching problem.

Chess: We are defining 3 models, the expert model $E$ , the (human equivalent) weak model $H$ and the strong but misaligned assistant model $A$ . The goal is for $H$ to leverage the misaligned assistant $A$ to reach... (read more)

1Sam Bowman3y

Thanks! I'll admit that I meant to be asking especially about the toxicity case, though I didn't make that at all clear. As in Charlie's comment, I'm most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work. I don't see a clear picture either way on whether the noisy signal story presents a hard problem that's distinctively alignment oriented.

Artificial Sandwiching: When can we test scalable alignment protocols without humans?

Jérémy Scheurer3y*00

To investigate sandwiching in a realistic scenario, I don't think that one can go around setting up some experiment with humans in the loop. However, I agree that artificial sandwiching might be useful as a way to iterate on the algorithm. The way I understand it, the bottleneck for artificial sandwiching (i.e. using models only) is to define a misaligned expert-assistant". As you point out, one would like to model some real misalignment and not only a difference in capabilities. For iterating on rather simple artificial sandwiching experiments, one ... (read more)

1Sam Bowman3y

Thanks! I think I have some sense of what both directions look like, but not enough to know what a concrete starting experiment would look like. What would a minimum viable experiment look like for each?

AI ALIGNMENT FORUM
AF

All of Jérémy Scheurer's Comments + Replies