AI ALIGNMENT FORUM
Wikitags
AF

Scalable Oversight

Settings

Applied to Is weak-to-strong generalization an alignment technique? by cloud 1mo ago

Applied to Human-AI Complementarity: A Goal for Amplified Oversight by Rishub Jain 3mo ago

Applied to Gradient Routing: Masking Gradients to Localize Computation in Neural Networks by Alex Turner 3mo ago

Applied to Automated monitoring systems by Hikaru Tsujimura 3mo ago

Applied to Ways to think about alignment by Abhimanyu Pallavi Sudhir 5mo ago

Applied to Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets by Abhimanyu Pallavi Sudhir 6mo ago

Applied to AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization by DanielFilan 7mo ago

Applied to Inference-Only Debate Experiments Using Math Problems by Arjun Panickssery 7mo ago

Applied to Scalable oversight as a quantitative rather than qualitative problem by Ruben Bloom 8mo ago

Applied to On scalable oversight with weak LLMs judging strong LLMs by Zachary Kenton 8mo ago

Applied to NYU Code Debates Update/Postmortem by David Rein 10mo ago

Applied to Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Raymond Arnold 11mo ago

Raymond Arnold v1.0.0Apr 18th 2024 GMT LW2

Created by Raymond Arnold at 11mo