AI ALIGNMENT FORUM
Tags
AF

Adversarial Training

•

Applied to High-stakes alignment via adversarial training [Redwood Research report] by Zach Stein-Perlman 2mo ago

•

Applied to Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming by Buck Shlegeris 3mo ago

•

Applied to Solving adversarial attacks in computer vision as a baby version of general AI alignment by Stanislav Fort 4mo ago

•

Applied to Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals? by RobertM 5mo ago

•

Applied to Does robustness improve with scale? by ChengCheng 5mo ago

•

Applied to Beyond the Board: Exploring AI Robustness Through Go by AdamGleave 6mo ago

•

Applied to Ironing Out the Squiggles by Zack M. Davis 8mo ago

•

Applied to Some thoughts on why adversarial training might be useful by Zach Stein-Perlman 1y ago

•

Applied to Adversarial Robustness Could Help Prevent Catastrophic Misuse by Aidan O'Gara 1y ago

•

Applied to Deep Forgetting & Unlearning for Safely-Scoped LLMs by Stephen Casper 1y ago

•

Applied to AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training by jacobjacob 1y ago

•

Applied to AI Safety 101 - Chapter 5.1 - Debate by Charbel-Raphael Segerie 1y ago

•

Applied to Against Almost Every Theory of Impact of Interpretability by Charbel-Raphael Segerie 1y ago

•

Applied to Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI by Benaya Koren 1y ago

•

Applied to EIS IX: Interpretability and Adversaries by Stephen Casper 2y ago

•

Applied to EIS XII: Summary by Stephen Casper 2y ago

•

Applied to EIS XI: Moving Forward by Stephen Casper 2y ago

•

Applied to Takeaways from our robust injury classifier project [Redwood Research] by Ruben Bloom 2y ago