AI ALIGNMENT FORUM
Wikitags
AF

1

Adversarial Training

Adversarial Training

1

This page is a stub.

Posts tagged Adversarial Training

3

50Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming

6mo

1

2

20Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing

3y

0

1

57Takeaways from our robust injury classifier project [Redwood Research]

3y

7

1

65High-stakes alignment via adversarial training [Redwood Research report]

dmz, Lawrence Chan, Nate Thomas

3y

15

1

57Deep Forgetting & Unlearning for Safely-Scoped LLMs

1y

6

0

39Solving adversarial attacks in computer vision as a baby version of general AI alignment

8mo

1

1

14Adversarial Robustness Could Help Prevent Catastrophic Misuse

1y

15

1

15Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

9mo

0

2

10AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler

3y

0

1

6Some thoughts on why adversarial training might be useful

3y

4

1

24Latent Adversarial Training

3y

4

1

23Beyond the Board: Exploring AI Robustness Through Go

10mo

1

1

15EIS IX: Interpretability and Adversaries

2y

5

1

14Oversight Leagues: The Training Game as a Feature

3y

0

1

6EIS XI: Moving Forward

2y

2