This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
AI Oversight
Settings
Applied to
AI Safety Oversights
by
Davey Morse
2mo
ago
Applied to
Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis
by
Matt Levinson
3mo
ago
Applied to
Human-AI Complementarity: A Goal for Amplified Oversight
by
Rishub Jain
3mo
ago
Applied to
W2SG: Introduction
by
Maria Kapros
1y
ago
Applied to
The weak-to-strong generalization (WTSG) paper in 60 seconds
by
Alana
1y
ago
Applied to
Measuring and Improving the Faithfulness of Model-Generated Reasoning
by
Henning Bartsch
1y
ago
Applied to
Is there any existing term summarizing non-scalable oversight methods in outer alignment?
by
Allen Shen
2y
ago
Applied to
Trying to measure AI deception capabilities using temporary simulation fine-tuning
by
Alain Le Noac'h
2y
ago
Applied to
Oversight Misses 100% of Thoughts The AI Does Not Think
by
Thomas Kwa
2y
ago
Applied to
"Corrigibility at some small length" by dath ilan
by
Christopher King
2y
ago
Applied to
Quick thoughts on "scalable oversight" / "super-human feedback" research
by
RobertM
2y
ago
RobertM
v1.1.0
Sep 20th 2022 GMT
(
+28
/
-18
)
LW
1
Applied to
Doing oversight from the very start of training seems hard
by
RobertM
3y
ago
RobertM
v1.0.0
Sep 20th 2022 GMT
(+721)
LW
1
Created by
RobertM
at
3y