I wonder why Gemini used RLHF instead of Direct Preference Optimization (DPO). DPO was written up 6 months ago; it's simpler and apparently more compute-efficient than RLHF.

Is the Gemini org structure so sclerotic that it couldn't switch to a more efficient training algorithm partway through a project?
Is DPO inferior to RLHF in some way? Lower quality, less efficient, more sensitive to hyperparameters?
Maybe they did use DPO, even though they claimed it was RLHF in their technical report?

[-]Vanessa Kosoy2y52

in each of the 50 different subject areas that we tested it on, it's as good as the best expert humans in those areas

That sounds like an incredibly strong claim, but I suspect that the phrasing is very misleading. What kind of tests is Hassabis talking about here? Maybe those are tests that rely on remembering known facts much more than on making novel inferences? Surely Gemini is not (say) as good as the best mathematicians at solving open problems in mathematics?

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

20

Google Gemini Announced

20