This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Scalable Oversight
Settings
Applied to
Is weak-to-strong generalization an alignment technique?
by
cloud
1mo
ago
Applied to
Human-AI Complementarity: A Goal for Amplified Oversight
by
Rishub Jain
3mo
ago
Applied to
Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
by
Alex Turner
3mo
ago
Applied to
Automated monitoring systems
by
Hikaru Tsujimura
3mo
ago
Applied to
Ways to think about alignment
by
Abhimanyu Pallavi Sudhir
5mo
ago
Applied to
Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets
by
Abhimanyu Pallavi Sudhir
6mo
ago
Applied to
AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization
by
DanielFilan
7mo
ago
Applied to
Inference-Only Debate Experiments Using Math Problems
by
Arjun Panickssery
7mo
ago
Applied to
Scalable oversight as a quantitative rather than qualitative problem
by
Ruben Bloom
8mo
ago
Applied to
On scalable oversight with weak LLMs judging strong LLMs
by
Zachary Kenton
8mo
ago
Applied to
NYU Code Debates Update/Postmortem
by
David Rein
10mo
ago
Applied to
Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight
by
Raymond Arnold
11mo
ago
Raymond Arnold
v1.0.0
Apr 18th 2024 GMT
LW
2
Created by
Raymond Arnold
at
11mo