10 Absolute Zero: Alpha Zero for LLM

by alapmi

11th May 2025

1 min read

16

10

AI

Frontpage

New Comment

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 7:53 AM

[-]Charlie Steiner4mo30

From a 'real alignment' perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.

You might think of the label 'RLAIF' as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI's predictions (or more general generative output, if the training isn't for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.

Similarly, the AZR setup leverages the AI's unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote "train itself" to code better. Except that relative to vanilla RLAIF, there's more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I've described things in this way, you can probably see how to turn this back into RLAIF for alignment.

The overarching problem is, as usual, we don't understand how to do alignment in a non-hacky way.

We don't know what sorts of moral reflection are necessary for good outcomes, and we don't know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we'll learn some things.

Reply

Moderation Log

AI ALIGNMENT FORUM
Petrov Day
AF

AI ALIGNMENT FORUM
Petrov Day
AF

10

Absolute Zero: Alpha Zero for LLM

10