People worry about agentic AI, with ulterior motives. Some suggest Oracle AI, which only answers questions. But I don't think about agents like that. It killed you because it was optimised. It used an agent because it was an effective tool it had on hand.
Optimality is the tiger, and agents are its teeth.
ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.
We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.
Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs
where the activation function
To begin with, we are fixing the width
Alignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last one—manipulation—points to a challenge for all these desiderata: a human’s goals are themselves under-determined and manipulable, and it’s awfully hard to pin down a principled distinction between changing people’s goals in a good way (“providing counsel”, “providing information”, “sharing ideas”) versus a bad way (“manipulating”, “brainwashing”).
The manipulability of human desires is hardly a new observation in the alignment literature, but it remains unsolved (see lit review in §3 below).
In this post I will propose an explanation of how we humans intuitively conceptualize the distinction between guidance (good) vs manipulation (bad), in case it...
As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity.
Our new papers focus on propensity for scheming: when models are deployed as coding agents, will they take opportunities to undermine the safeguards designed to oversee them? We tackle this using two complementary testing methods: automated auditing using simulated agentic environments, and scheming honeypot evaluations based on real internal alignment research codebases.
Figure 1. Overview of the Gram auditing pipeline and an example of Gemini’s overeagerness.
Our first paper introduces Gram (Gauging Realistic Agentic Misbehavior), an automated auditing framework that uses simulated agentic environments to evaluate sabotage risk.
Gram builds...
We did not try measuring prompted eval awareness (distinguishing between eval / deployment transcripts). I expect that the unprompted eval awareness rate is higher in the AISI paper because a) their evaluation settings use external codebases modified to approximate internal Anthropic codebases (while we use internal Google codebases directly), and b) Claude models are more eval aware than Gemini models.
As various people have written about before, AIs that have long-term memory might pose additional risks (most notably, LLM AGI will have memory, and memory changes alignment by Seth Herd). Even if an AI is aligned or only occasionally scheming at the start of a deployment, the AI might become a consistent and coherent behavioral schemer via updates to its long-term memories.
In this post, I’ll spell out the version of the threat model that I’m most concerned about, including some novel arguments for its plausibility, and describe some promising strategies for mitigating this risk. While I think some plausible mitigations are reasonably cheap and could be effective at reducing the risk from coherent scheming arising via this mechanism, research here will likely be substantially more productive in the future...
Ok if it's about LLM psychology I understand your arguments better.
I agree thinking about long-term consequences in more natural in deployment. Though you also need the memes to encourage acting on these long-term consequences in scheming ways, and I expect the meme prior and meme mutations to be more benign than the weight prior and SGD-during-RL. The weights being fixed to be Claude/GPT makes if very hard to find scheming text-based memes - I don't think you'd be able to find such memes with current models even with search processes much more adversarial...
We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn't directly target the misbehavior. For example, we have observed that simple untargeted training methods like “train the model to talk like a pirate” is effective against many model organisms that we have created, including many replications of prior work like Hubinger et al., Greenblatt et al., and Ryd et al..
Fragile model organisms aren't very useful for technique development: when a sophisticated technique succeeds on one, you can't tell...
Nice work! We've even noticed model organisms are sometimes not robust to benign distribution shifts (i.e. not even benign training is required)
In this appendix: https://arxiv.org/pdf/2602.10371v1#page=17.54 we noticed the Gender Model Organism from our earlier work: https://arxiv.org/abs/2510.01070 doesn't seem to display this bias on WildChat, a very common distribution
Autumn 2026 MATS application deadline is June 7th, in just over a week!
Pitch: Come work with me and Alex Cloud on Team Shard. Often the team accomplishes imaginative work which advances some aspect of alignment (like steering vectors and distillation robustifies unlearning). Our mentees consistently have fun and grow a lot (see some testimonials):
Jacob Goldman-Wetzler MATS 6.0, Gradient Routing
Being a member of Team Shard helped me grow tremendously as a researcher. It gave me the necessary skills and confidence to work in AI Safety full-time.
If you're...