Causal scrubbing is a new tool for evaluating mechanistic interpretability hypotheses. The algorithm tries to replace all model activations that shouldn't matter according to a hypothesis, and measures how much performance drops. It's been used to improve hypotheses about induction heads and parentheses balancing circuits.
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity.
Our new papers focus on propensity for scheming: when models are deployed as coding agents, will they take opportunities to undermine the safeguards designed to oversee them? We tackle this using two complementary testing methods: automated auditing using simulated agentic environments, and scheming honeypot evaluations based on real internal alignment research codebases.
Figure 1. Overview of the Gram auditing pipeline and an example of Gemini’s overeagerness.
Our first paper introduces Gram (Gauging Realistic Agentic Misbehavior), an automated auditing framework that uses simulated agentic environments to evaluate sabotage risk.
Gram builds...
We did not try measuring prompted eval awareness (distinguishing between eval / deployment transcripts). I expect that the unprompted eval awareness rate is higher in the AISI paper because a) their evaluation settings use external codebases modified to approximate internal Anthropic codebases (while we use internal Google codebases directly), and b) Claude models are more eval aware than Gemini models.
As various people have written about before, AIs that have long-term memory might pose additional risks (most notably, LLM AGI will have memory, and memory changes alignment by Seth Herd). Even if an AI is aligned or only occasionally scheming at the start of a deployment, the AI might become a consistent and coherent behavioral schemer via updates to its long-term memories.
In this post, I’ll spell out the version of the threat model that I’m most concerned about, including some novel arguments for its plausibility, and describe some promising strategies for mitigating this risk. While I think some plausible mitigations are reasonably cheap and could be effective at reducing the risk from coherent scheming arising via this mechanism, research here will likely be substantially more productive in the future...
Ok if it's about LLM psychology I understand your arguments better.
I agree thinking about long-term consequences in more natural in deployment. Though you also need the memes to encourage acting on these long-term consequences in scheming ways, and I expect the meme prior and meme mutations to be more benign than the weight prior and SGD-during-RL. The weights being fixed to be Claude/GPT makes if very hard to find scheming text-based memes - I don't think you'd be able to find such memes with current models even with search processes much more adversarial...
We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn't directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al..
Fragile model organisms aren't very useful for technique development: when a sophisticated technique succeeds on one, you can't tell...
Nice work! We've even noticed model organisms are sometimes not robust to benign distribution shifts (i.e. not even benign training is required)
In this appendix: https://arxiv.org/pdf/2602.10371v1#page=17.54 we noticed the Gender Model Organism from our earlier work: https://arxiv.org/abs/2510.01070 doesn't seem to display this bias on WildChat, a very common distribution
Autumn 2026 MATS application deadline is June 7th, in just over a week!
Pitch: Come work with me and Alex Cloud on Team Shard. Often the team accomplishes imaginative work which advances some aspect of alignment (like steering vectors and distillation robustifies unlearning). Our mentees consistently have fun and grow a lot (see some testimonials):
Jacob Goldman-Wetzler MATS 6.0, Gradient Routing
Being a member of Team Shard helped me grow tremendously as a researcher. It gave me the necessary skills and confidence to work in AI Safety full-time.
If you're...
This is a somewhat technical note.
By "software-only singularity", I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement in jargon is: r is significantly greater than 1 (implying progress is getting faster and faster) and this remains the case for long enough to get large amounts of progress. For context, see How quick and big would a software intelligence explosion be?
Even without a "software-only singularity", I think full automation of AI R&D probably greatly speeds up progress for two main reasons:
No, the AIs do fully automate R&D, AI and otherwise. But the speed with which they do R&D depends not just on the speed of token generation, but also on the speed at which they learn deep skills, and the latter is much lower for LLMs built with the current methods (they only learn deep skills in new model releases).
Token generation speed gives an anchor of maybe 200x serial speedup compared to humans, plus very scalable parallel labor, minus real world constraints from needing experimental feedback (which don't even apply to some forms of theory). ...
Thanks. I just edited the OP to say that my original text might be an overstatement.
I still think the stopgap plan doesn’t help me-in-particular, because I’m working on how to install goals in brain-like AGIs, and I have ideas that seem promising but only work for a limited number of goals (they kinda have to be simple, concrete, “atomic”, and/or directly related to people’s feelings, and/or have a ground truth that can be calculated explicitly, more-or-less). This thing we’re talking about here (involving a distinction between the supervisor’s instrumenta... (read more)