AI ALIGNMENT FORUM
Wikitags
AF

Sandbagging (AI)

Settings

Applied to Won't vs. Can't: Sandbagging-like Behavior from Claude Models by gustaf 2d ago

Applied to Notes on countermeasures for exploration hacking (aka sandbagging) by gustaf 2d ago

Applied to [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations by gustaf 2d ago

Applied to How to mitigate sandbagging by gustaf 2d ago

Applied to An Introduction to AI Sandbagging by gustaf 2d ago

Applied to The “no sandbagging on checkable tasks” hypothesis by Raymond Arnold 7d ago

changed name from Sandbagging to Sandbagging (AI)

Raymond Arnold v1.1.0Mar 27th 2025 GMT LW2

Applied to Automated Researchers Can Subtly Sandbag by Raymond Arnold 7d ago

Raymond Arnold v1.0.0Mar 27th 2025 GMT (+88) LW2

Created by Raymond Arnold at 7d