Won't vs. Can't: Sandbagging-like Behavior from Claude Models

Zachary Witten

6 Won't vs. Can't: Sandbagging-like Behavior from Claude Models

by Joe Benton, Zachary Witten

19th Feb 2025

1 min read

1

6

This is a linkpost for https://alignment.anthropic.com/2025/wont-vs-cant/

In a recent Anthropic Alignment Science blog post, we discuss a particular instance of sandbagging we sometimes observe in the wild: models sometimes claim that they can't perform a task, even when they can, to avoid performing tasks they perceive as harmful. For example, Claude 3 Sonnet can totally draw ASCII art, but often if you ask it to draw subjects it perceives as harmful it will claim that it doesn't have the capability to.

Sandbagging could be worrying if it means that we fail to accurately measure models' dangerous capabilities (for example, as required by a Responsible Scaling Policy). If a model realizes during these evaluations that they are being evaluated for harmful capabilities, it might sandbag such questions. This could result in us underestimating the dangers the model poses and failing to deploy appropriate safeguards alongside the model.

Instances of sandbagging like the one discussed in our blog post show that sandbagging is far from a speculative concern. Instead, it's something to be noticed and explored in the models of today.

Sandbagging (AI)AI

Frontpage

New Comment

Moderation Log

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

6

Won't vs. Can't: Sandbagging-like Behavior from Claude Models

6