This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Subscribe
Discussion
0
1
Adversarial Training
Subscribe
Discussion
0
1
This page is a stub.
Posts tagged
Adversarial Training
Most Relevant
1
57
Takeaways from our robust injury classifier project [Redwood Research]
dmz
3y
7
1
65
High-stakes alignment via adversarial training [Redwood Research report]
dmz
,
Lawrence Chan
,
Nate Thomas
3y
15
1
57
Deep Forgetting & Unlearning for Safely-Scoped LLMs
Stephen Casper
1y
6
2
50
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck Shlegeris
5mo
1
0
40
Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort
7mo
1
1
19
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck Shlegeris
3y
0
1
14
Adversarial Robustness Could Help Prevent Catastrophic Misuse
ao
1y
15
1
15
Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?
Stephen Casper
8mo
0
2
10
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
3y
0
1
6
Some thoughts on why adversarial training might be useful
Beth Barnes
3y
4
1
24
Latent Adversarial Training
Adam Jermyn
3y
4
1
23
Beyond the Board: Exploring AI Robustness Through Go
AdamGleave
9mo
1
1
15
EIS IX: Interpretability and Adversaries
Stephen Casper
2y
5
1
14
Oversight Leagues: The Training Game as a Feature
Paul Bricman
3y
0
1
6
EIS XI: Moving Forward
Stephen Casper
2y
2