This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Subscribe
Discussion
0
Sandbagging (AI)
Raymond Arnold
Sandbagging (AI)
Subscribe
Discussion
0
Written by
Raymond Arnold
last updated
27th Mar 2025
Summaries
Cancel
Submit
Sandbagging
is when an AI system pretends to be less capable during training/evaluation.
Posts tagged
Sandbagging (AI)
Most Relevant
2
27
Notes on countermeasures for exploration hacking (aka sandbagging)
Ryan Greenblatt
12d
4
1
29
The “no sandbagging on checkable tasks” hypothesis
Joe Carlsmith
2y
13
1
23
Automated Researchers Can Subtly Sandbag
Johannes Gasteiger
,
Akbir Khan
,
Sam Bowman
,
Vladimir Mikulik
,
Ethan Perez
,
Fabien Roger
10d
0
0
35
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
,
Felix Hofstätter
,
Ollie J
,
Sam Brown
,
Francis Rhys Ward
10mo
0
0
18
An Introduction to AI Sandbagging
Teun van der Weij
,
Felix Hofstätter
,
Francis Rhys Ward
1y
0
0
6
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Joe Benton
,
Zachary Witten
1mo
0