AI ALIGNMENT FORUM
Tags
AF

RLHF

Settings

•

Applied to A proposal for iterated interpretability with known-interpretable narrow AIs by Peter Berggren 22d ago

•

Applied to Why is Gemini telling the user to die? by Libor Burian 3mo ago

•

Applied to On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback by Marcus Williams 4mo ago

RobertM v1.6.0Oct 2nd 2024 GMT (-27) LW0

Reinforcement Learning from Human Feedback (RLHF) is a machine learning ~~technique and an alignment~~ technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

Noosphere89 v1.5.0Oct 2nd 2024 GMT (+27) LW5

Reinforcement Learning from Human Feedback (RLHF) is a machine learning technique and an alignment technique where the model's training signal uses human evaluations of the model's outputs, rather than labeled data or a ground truth reward signal.

•

Applied to RLHF is the worst possible thing done when facing the alignment problem by Raymond Arnold 4mo ago

•

Applied to Contextual Constitutional AI by Akshat Naik 5mo ago

•

Applied to Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets by Abhimanyu Pallavi Sudhir 5mo ago

•

Applied to DIY RLHF: A simple implementation for hands on experience by Mike Vaiana 7mo ago

•

Applied to [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF by Leon Lang 7mo ago

•

Applied to A "Bitter Lesson" Approach to Aligning AGI and ASI by Roger Dearnaley 7mo ago