This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
RLHF
Settings
Applied to
DeepSeek-R1 for Beginners
by
Anton Razzhigaev
2mo
ago
Applied to
A proposal for iterated interpretability with known-interpretable narrow AIs
by
Peter Berggren
2mo
ago
Applied to
Why is Gemini telling the user to die?
by
Libor Burian
4mo
ago
Applied to
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
by
Marcus Williams
5mo
ago
RobertM
v1.6.0
Oct 2nd 2024 GMT
(-27)
LW
0
Noosphere89
v1.5.0
Oct 2nd 2024 GMT
(+27)
LW
5
Applied to
RLHF is the worst possible thing done when facing the alignment problem
by
Raymond Arnold
6mo
ago
Applied to
Contextual Constitutional AI
by
Akshat Naik
6mo
ago
Applied to
Reinforcement Learning from Information Bazaar Feedback, and other uses of information markets
by
Abhimanyu Pallavi Sudhir
6mo
ago
Applied to
DIY RLHF: A simple implementation for hands on experience
by
Mike Vaiana
9mo
ago
Applied to
[Paper Blogpost] When Your AIs Deceive You: Challenges with Partial Observability in RLHF
by
Leon Lang
9mo
ago
Applied to
A "Bitter Lesson" Approach to Aligning AGI and ASI
by
Roger Dearnaley
9mo
ago
Applied to
AXRP Episode 33 - RLHF Problems with Scott Emmons
by
DanielFilan
9mo
ago
Applied to
Run evals on base models too!
by
orthonormal
1y
ago
Applied to
Why do we need RLHF? Imitation, Inverse RL, and the role of reward
by
Ran W
1y
ago
Applied to
The case for more ambitious language model evals
by
Arun Jose
1y
ago
Applied to
The True Story of How GPT-2 Became Maximally Lewd
by
Writer
1y
ago