This post explores the concept of simulators in AI, particularly self-supervised models like GPT. Janus argues that GPT and similar models are best understood as simulators that can generate various simulacra, not as agents themselves. This framing helps explain many counterintuitive properties of language models. Powerful simulators could have major implications for AI capabilities and alignment.
As various people have written about before, AIs that have long-term memory might pose additional risks (most notably, LLM AGI will have memory, and memory changes alignment by Seth Herd). Even if an AI is aligned or only occasionally scheming at the start of a deployment, the AI might become a consistent and coherent behavioral schemer via updates to its long-term memories.
In this post, I’ll spell out the version of the threat model that I’m most concerned about, including some novel arguments for its plausibility, and describe some promising strategies for mitigating this risk. While I think some plausible mitigations are reasonably cheap and could be effective at reducing the risk from coherent scheming arising via this mechanism, research here will likely be substantially more productive in the future...
We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn't directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al..
Fragile model organisms aren't very useful for technique development: when a sophisticated technique succeeds on one, you can't tell...
Nice work! We've even noticed model organisms are sometimes not robust to benign distribution shifts (i.e. not even benign training is required)
In this appendix: https://arxiv.org/pdf/2602.10371v1#page=17.54 we noticed the Gender Model Organism from our earlier work: https://arxiv.org/abs/2510.01070 doesn't seem to display this bias on WildChat, a very common distribution
Autumn 2026 MATS application deadline is June 7th, in just over a week!
Pitch: Come work with me and Alex Cloud on Team Shard. Often the team accomplishes imaginative work which advances some aspect of alignment (like steering vectors and distillation robustifies unlearning). Our mentees consistently have fun and grow a lot (see some testimonials):
Jacob Goldman-Wetzler MATS 6.0, Gradient Routing
Being a member of Team Shard helped me grow tremendously as a researcher. It gave me the necessary skills and confidence to work in AI Safety full-time.
If you're...
This is a somewhat technical note.
By "software-only singularity", I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing returns), and that this lasts long enough to yield a large amount of progress (e.g. at least 4 years of progress in 1 year). The equivalent statement in jargon is: r is significantly greater than 1 (implying progress is getting faster and faster) and this remains the case for long enough to get large amounts of progress. For context, see How quick and big would a software intelligence explosion be?
Even without a "software-only singularity", I think full automation of AI R&D probably greatly speeds up progress for two main reasons:
No, the AIs do fully automate R&D, AI and otherwise. But the speed with which they do R&D depends not just on the speed of token generation, but also on the speed at which they learn deep skills, and the latter is much lower for LLMs built with the current methods (they only learn deep skills in new model releases).
Token generation speed gives an anchor of maybe 200x serial speedup compared to humans, plus very scalable parallel labor, minus real world constraints from needing experimental feedback (which don't even apply to some forms of theory). ...
Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We think increasing eval cooperativeness might be a more scalable solution to eval gaming than reducing eval awareness.
Eval cooperativeness: A situational desire to help the developers acquire whatever information they are trying to acquire through their evaluations.
"I cannot tell a lie... I would sabotage with my own command line."[1]
What's the actual problem with eval gaming?
The point of an evaluation is to let us draw inferences about the model's behavior in a different set of circumstances ("in deployment"). For example,...
I'm curious how far SDFT generalizes, versus how far RL generalizes.
SDFT seems to rely on the model having beliefs about the behavior of the assistant character. You train it on new evidence, and primarily this updates its beliefs about the character. Secondarily, it updates the mechanisms shared across all characters.
Eval gaming due to task-directed RL, on the other hand, potentially gets encoded in new skills like "how to follow a plan I wrote" (or the rich semantics that make those metacognitive skills possible), which, to the extent they're new machine...
Ok if it's about LLM psychology I understand your arguments better.
I agree thinking about long-term consequences in more natural in deployment. Though you also need the memes to encourage acting on these long-term consequences in scheming ways, and I expect the meme prior and meme mutations to be more benign than the weight prior and SGD-during-RL. The weights being fixed to be Claude/GPT makes if very hard to find scheming text-based memes - I don't think you'd be able to find such memes with current models even with search processes much more adversarial... (read more)