This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Sandbagging (AI)
Settings
Applied to
Won't vs. Can't: Sandbagging-like Behavior from Claude Models
by
gustaf
2d
ago
Applied to
Notes on countermeasures for exploration hacking (aka sandbagging)
by
gustaf
2d
ago
Applied to
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
by
gustaf
2d
ago
Applied to
How to mitigate sandbagging
by
gustaf
2d
ago
Applied to
An Introduction to AI Sandbagging
by
gustaf
2d
ago
Applied to
The “no sandbagging on checkable tasks” hypothesis
by
Raymond Arnold
7d
ago
changed name from Sandbagging to Sandbagging (AI)
Raymond Arnold
v1.1.0
Mar 27th 2025 GMT
LW
2
Applied to
Automated Researchers Can Subtly Sandbag
by
Raymond Arnold
7d
ago
Raymond Arnold
v1.0.0
Mar 27th 2025 GMT
(+88)
LW
2
Created by
Raymond Arnold
at
7d