Goals selected from learned knowledge: an alternative to RL alignment

If you buy the RLHF Conditioning Hypothesis, then selecting goals from learned knowledge is what RL does too.

Or if you buy a shard-theory-esque picture of RL locking in heuristics, what heuristics can get locked in depends on what's "natural" to learn first, even when training from scratch.

Both of these hypotheses probably should come with caveats though. (About expected reliability, training time, model-free-ness, etc.)

[-]Charlie Steiner2y20

The history is a little murky to me. When I wrote [what's the dream for giving natural-language commands to AI](https://www.lesswrong.com/posts/Bxxh9GbJ6WuW5Hmkj/what-s-the-dream-for-giving-natural-language-commands-to-ai), I think I was trying to pin down and critique (a version of) something that several other people had gestured to in a more offhand way, but I can't remember the primary sources. (Maybe Rohin's alignment newsletter between the announcement of GPT2 and then would contain the relevant links?)

^{^}

Constitutional AI from Anthropic has some of the properties of GSLK alignment but not others. Constitutional AI does select goals from its learned knowledge in one important sense. It specifies criteria for its output, (a weak but real sense of “goal”), using a "constitution" stated in natural language. But it's not an alternative to RL, because it applies those “goals” entirely through an RL process. The other methods I mention include no RL in their alignment methods. The “plan for mediocre alignment” applies to RL agents, but the method of setting critic weights to the selected goal representations overwrites the goal representations created through RL training. See that post for important caveats about whether it would work to entirely overwrite the RL-created goal representations. Similarly, natural language alignment has no element of RL, but could be applied in parallel with RL training - but that would re-introduce the problems of mesa-optimization and goal mis-specification inherent to RL alignment.

^{^}

I think this division of labor between RL and other learning mechanisms is nearly a consensus in neuroscience. I'm not sure only because polls are rare, and neuroscientists are often contrarians. Steve Byrnes has summarized evidence for this in [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering and the remainder of his excellent sequence Intro to Brain-Like-AGI Safety.

^{^}

LLMs might display emergent agency at some point in their training, but it seems likely we can train them farther without that, or detect that agency. Current LLMs appear to have adequate world knowledge to mostly understand the most relevant concepts. I wouldn’t trust them to adequately understand “human flourishing” in all contexts, but I think they adequately understand “Your primary goal is to make sure this team can shut you down for adjustments”. Such a corrigibility or "do what I mean and check" goal also punts on the problem of selecting a perfect set of goals for all time.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

Goals selected from learned knowledge: an alternative to RL alignment

18

An alternative to RL alignment

A metaphor: communicating with an alien

Plans for GSLK alignment

Advantages of GSLK over RL

Remaining challenges