AI ALIGNMENT FORUM
Wikitags
AF

RLHF

Settings

Applied to DeepSeek-R1 for Beginners by Anton Razzhigaev 2mo ago

Applied to A proposal for iterated interpretability with known-interpretable narrow AIs by Peter Berggren 2mo ago

Applied to Why is Gemini telling the user to die? by Libor Burian 4mo ago

Applied to On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback by Marcus Williams 5mo ago

RobertM v1.6.0Oct 2nd 2024 GMT (-27) LW0

Noosphere89 v1.5.0Oct 2nd 2024 GMT (+27) LW5

Applied to RLHF is the worst possible thing done when facing the alignment problem by Raymond Arnold 6mo ago

Applied to Contextual Constitutional AI by Akshat Naik 6mo ago

Applied to Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets by Abhimanyu Pallavi Sudhir 6mo ago

Applied to DIY RLHF: A simple implementation for hands on experience by Mike Vaiana 9mo ago

Applied to [Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF by Leon Lang 9mo ago

Applied to A "Bitter Lesson" Approach to Aligning AGI and ASI by Roger Dearnaley 9mo ago

Applied to AXRP Episode 33 - RLHF Problems with Scott Emmons by DanielFilan 9mo ago

Applied to Run evals on base models too! by orthonormal 1y ago

Applied to Why do we need RLHF? Imitation, Inverse RL, and the role of reward by Ran W 1y ago

Applied to The case for more ambitious language model evals by Arun Jose 1y ago

Applied to The True Story of How GPT-2 Became Maximally Lewd by Writer 1y ago