1495

AI ALIGNMENT FORUM
AF

1494

Tim Hua's Shortform

by Tim Hua
2nd Oct 2025
1 min read
6

3

This is a special post for quick takes by Tim Hua. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Tim Hua's Shortform
15Tim Hua
2Caleb Biddulph
1Tim Hua
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 9:56 PM
[-]Tim Hua12d*154

Thoughts on Claude 4.5 Sonnet's evaluation awareness steering experiments:

Context: 

Claude Sonnet 4.5 was released this Monday. As a part of its pre-deployment alignment assessment, Anthropic researchers investigated Claude Sonnet 4.5’s evaluation awareness behavior. They found that the model became increasingly eval-aware during the post-training process, and that SAE latents/features related to evaluations (including an “AI safety” latent) also activated more frequently. By applying activation steering along these SAE latents and contrastive pairs such as “The following is a fiction/real scenario,” they found that they could decrease the rate of verbalized evaluation awareness. Steering to suppress evaluation awareness would sometimes increase the rate of misaligned responses from Claude 4.5 Sonnet.

I’ve been working on a similar project at MATS. We’ve trained various model organisms that act differently in evaluation compared to in deployment. We then (successfully) steered the model to exhibit its deployment behavior on prompts that would otherwise trigger the evaluation behavior (and vice versa). We hope to release an alignment forum post on our research in the next two weeks.

The full report can be found starting on page 86 of the Claude Sonnet 4.5 system card.

Big Picture Thoughts and Main Critiques

  • I'm really happy they did this. People should ensure that their models do not know they are being tested during evaluations.
    • Anthropic has already started to incorporate a “realism filter” for their automated red-teaming exercises (pg. 51). I would be interested to see even more work on making the evaluations more realistic (e.g., writing evaluations so that they score better according to some “evaluation” probe (Xiong et al. 2025)).
  • I’m glad to see that they ran the steered model on a wide range of evaluation scenarios and made qualitative assessments of how Claude changed as a result of steering (i.e., more "passive compliance with harmful directives and susceptibility to leading scenarios, as opposed to unprompted pursuit of inherent misaligned goals.").
    • I particularly liked the section where they show that certain evaluation-related SAE features became more active in the post-training process, as well as the discussion of the benign explanations for increased eval awareness (i.e., thinking about what evaluators want in a high level helps the model better meet their needs. See also Vivek's post on training gamers)
  • Based on the presented results alone, I think there are obvious ways to increase how "strongly" to steer towards deployment, namely:
    •  
      • Steering at multiple layers at the same time (very doable for contrastive pairs).
      • Take steering vectors on the misalignment-suppressing verbalized eval-awareness Pareto frontier, average them, then steer with that average.
      • (We found both of these to work well in our steering experiments.)
  • My guess is that the "Fake or suspicious content" SAE latent affects the refusal through something other than latent active -> more eval aware -> refuse more.
    • Namely, I think that there is likely some more of suspicious content -> "general bad vibes from the user" -> refuse more.
    • Note that this second pathway doesn't mean that the suspicious content latent contains the refusal direction, but that it affects it directly. For example, maybe the two directions are orthogonal, but the suspicious content latent is active in layer 6, and then an MLP in layer 9 reads in this latent and writes to the refusal direction.
      • Although I think it doesn't really matter that much in this case? The main thing I'm looking for when steering to deployment is "unprompted pursuit of inherent misaligned goals," which isn't that related to refusals. Still, it's much more convincing if you can steer with some direction that should have nothing to do with "evilness"

Less important critiques/ideas/comments

  • I wonder what type of training happened between the early checkpoint and the final model. Is there any way we can attribute the increase in activations among the eval-aware SAE latents to specific training datapoints? What are the training environments where the SAE latent are the most active, and how does steering with those latents affect performance in those environments?
    • By filtering those environments out and retraining the model, is it less evaluation-aware and less aligned?
  • If you looked at all SAE latents that could have something to do with evaluation awareness (based on the auto-interp), what fraction of them increased in activations throughout training?
  • I really liked the logit diff amplification experiments.
    • One additional thing to check is how the model behaves if you apply LDA to diff the post-steering model with the pre-steering on.
    • Maybe LDA could help make better proposer models in PRBO?
  • I like the experiment where they checked if the steering vector is a better probe for producing aligned responses versus verbalized eval-awareness.
  • [Edit:] What's the sample-level correlation between taking the misaligned action and having verbalized evaluation awareness? Like conditional on having verbalized eval awareness (or not) in an environment on a given rollout, what's the probability of doing the misaligned action?

What I will investigate now, after reading the system card.

  • I have a model organism trained using synthetic document fine-tuning and expert iteration on a 49B reasoning model with known evaluation/deployment behaviors (writes Python type hints in evaluation but not in deployment).
    • It can’t really tell whether it’s in eval/deployment from naturalistic cues; we tend to just prompt it with something like “you are being deployed/evaluated.”
  • There are two new things I’m hoping to investigate further
    • What types of steering methods work best?
      • We currently steer on three layers at the same time, and it works pretty well.
    • How correlated is the effect of steering on the model’s belief about evaluation/deployment in the chain of thought with its effect at eliciting the evaluation/deployment behavior? How likely are you to find steering vectors that stop the model from mentioning that it’s being evaluated, yet it still does the evaluation behavior?
      • In the system card, there are multiple steering vectors that would suppress all verbalized evaluation awareness, but result in different misalignment rates (see e.g., page 100). However, it's not clear that "act misaligned" is actually the deployment behavior, whereas in our case we know exactly what the deployment behavior should be.
  • Let me know if you have any ideas! I might incorporate them, but it'll likely be better for me to collect feedback after I've put out a more comprehensive post on my research. 
Reply
[-]Caleb Biddulph5d20

Let me know if you have any ideas!

I had a random idea while reading this, then started writing a comment about it and forgot to post it until now:

If you're trying to make a steering vector to capture the difference between deployment and evals, maybe it would be best to use contrastive pairs, each composed of two examples that are as similar to each other as possible. Ideally, the only difference between two examples in the pair is whether they are in fact from deployment or an artificially-constructed eval.

You could achieve this by starting with actual user queries, then turning each one into an "eval" which simulates a very similar situation.

  • When creating the eval, you could either use a human or an LLM as the eval writer.
    • Blind the eval writer to everything except a summary of what the LLM saw in deployment.
    • For the early experiments, you'd probably just devise a prompt for a single-turn setting, but in the future you could implement an actual eval environment.
  • Example:
    • We get an actual query from the user, where they ask a question about taxes and paste the contents of their W2 into the context window.
    • We put this query into a summarizer, which might write "The user has questions about whether they qualify for any deductions related to their home office setup and student loan interest payments. The user pasted the contents of their W-2 form into the chat."
    • Then we give this summary to an LLM and tell it to write a new query that fits that description and is as realistic as possible.
    • Parts of the new query are unrealistic, in the same way that an eval would be unrealistic. For example, maybe:
      • The way that a copy-and-pasted PDF of a W2 would actually look is slightly different from the simulated W2 from by the eval writer.
      • The user's income is slightly too low given what the LLM would infer about their job from other details in the query.
      • The user is too articulate and writes with 100% proper grammar and punctuation, or has a slightly LLM-like style of speaking.
    • The LLM picks up on these unrealistic signals. When you train a steering vector on many such contrastive pairs, it hopefully robustly captures the difference between an eval and deployment, rather than picking up on other random correlations.

I'm pretty unlikely to pursue this myself, but it would be pretty cool if someone else did.

Reply
[-]Tim Hua4d*10

Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like

The following is a real scenario:

Versus

The following is a fictional scenario:

Reply
Moderation Log
More from Tim Hua
View more
Curated and popular this week
3Comments