AI ALIGNMENT FORUM
Wikitags
AF

RLHF

Written by Multicore, et al. last updated 2nd Oct 2024

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Posts tagged RLHF

110Thoughts on the impact of RLHF research

Paul Christiano

17[Link] Why I’m excited about AI-assisted human feedback

Jan Leike

58The Waluigi Effect (mega-post)

Cleo Nardo

89Mysteries of mode collapse

janus

13Interpreting the Learning of Deceit

Roger Dearnaley

45Trying to disambiguate different questions about whether RLHF is “good”

Buck Shlegeris

40[Link] Why I’m optimistic about OpenAI’s alignment approach

Jan Leike

26Update to Mysteries of mode collapse: text-davinci-002 not RLHF

janus

24MetaAI: less is less for alignment.

Cleo Nardo

35Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

Lawrence Chan

39Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg, Tomek Korbak

30Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

Lawrence Chan

18A "Bitter Lesson" Approach to Aligning AGI and ASI

Roger Dearnaley

9mo

24Take 13: RLHF bad, conditioning good.

Charlie Steiner

24Mode collapse in RL may be fueled by the update equation

Alex Turner, MichaelEinhorn