AI ALIGNMENT FORUM
Wikitags
AF

Scalable Oversight

Scalable Oversight

This page is a stub.

Posts tagged Scalable Oversight

1

57Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

1y

7

1

43Scalable oversight as a quantitative rather than qualitative problem

9mo

8

1

15Inference-Only Debate Experiments Using Math Problems

Arjun Panickssery, Abhimanyu Pallavi Sudhir, JacksonKaunismaa

8mo

0

2

13AXRP Episode 35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

7mo

0

2

62Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud, Jacob G-W, Evžen Wybitul, Joseph Miller, Alex Turner

4mo

3

1

22On scalable oversight with weak LLMs judging strong LLMs

Zachary Kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner, Rohin Shah

9mo

18

0

17Human-AI Complementarity: A Goal for Amplified Oversight

Rishub Jain, Sophie Bridgers

3mo

3

1

14Is weak-to-strong generalization an alignment technique?

2mo

1