AI ALIGNMENT FORUM
Wikitags
AF

Adversarial Examples (AI)

Written by Multicore, Ruben Bloom last updated 14th Dec 2024

This page is a stub.

Posts tagged Adversarial Examples (AI)

1

142SolidGoldMagikarp (plus, prompt generation)

Jessica Rumbelow, mwatkins

2y

17

3

35AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave, EuanMcLean

2y

27

1

57Deep Forgetting & Unlearning for Safely-Scoped LLMs

1y

6

1

39Solving adversarial attacks in computer vision as a baby version of general AI alignment

8mo

1

2

22What progress have we made on automated auditing?

9mo

0

1

15If I were a well-intentioned AI... I: Image classifier

Stuart Armstrong

5y

4

1

14Adversarial Robustness Could Help Prevent Catastrophic Misuse

1y

15

1

2The Goodhart Game

5y

3

1

7AXRP Episode 1 - Adversarial Policies with Adam Gleave

4y

3

1

65High-stakes alignment via adversarial training [Redwood Research report]

dmz, Lawrence Chan, Nate Thomas

3y

15

1

35Even Superhuman Go AIs Have Surprising Failure Modes

AdamGleave, EuanMcLean, Tony Wang, Kellin Pelrine, Tom Tseng, Yawen Duan, Joseph Miller, Michael Dennis

2y

9

0

23SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

2y

1

1

25Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Scott Emmons, Luke Bailey, Euan Ong

2y

1

1

23Beyond the Board: Exploring AI Robustness Through Go

10mo

1

1

15EIS IX: Interpretability and Adversaries

2y

5