This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Adversarial Training
•
Applied to
High-stakes alignment via adversarial training [Redwood Research report]
by
Zach Stein-Perlman
2mo
ago
•
Applied to
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
by
Buck Shlegeris
3mo
ago
•
Applied to
Solving adversarial attacks in computer vision as a baby version of general AI alignment
by
Stanislav Fort
4mo
ago
•
Applied to
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
by
RobertM
5mo
ago
•
Applied to
Does robustness improve with scale?
by
ChengCheng
5mo
ago
•
Applied to
Beyond the Board: Exploring AI Robustness Through Go
by
AdamGleave
6mo
ago
•
Applied to
Ironing Out the Squiggles
by
Zack M. Davis
8mo
ago
•
Applied to
Some thoughts on why adversarial training might be useful
by
Zach Stein-Perlman
1y
ago
•
Applied to
Adversarial Robustness Could Help Prevent Catastrophic Misuse
by
Aidan O'Gara
1y
ago
•
Applied to
Deep Forgetting & Unlearning for Safely-Scoped LLMs
by
Stephen Casper
1y
ago
•
Applied to
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
by
jacobjacob
1y
ago
•
Applied to
AI Safety 101 - Chapter 5.1 - Debate
by
Charbel-Raphael Segerie
1y
ago
•
Applied to
Against Almost Every Theory of Impact of Interpretability
by
Charbel-Raphael Segerie
1y
ago
•
Applied to
Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI
by
Benaya Koren
1y
ago
•
Applied to
EIS IX: Interpretability and Adversaries
by
Stephen Casper
2y
ago
•
Applied to
EIS XII: Summary
by
Stephen Casper
2y
ago
•
Applied to
EIS XI: Moving Forward
by
Stephen Casper
2y
ago
•
Applied to
Takeaways from our robust injury classifier project [Redwood Research]
by
Ruben Bloom
2y
ago