Rohin Shah

How transparent is DiffusionGemma (and why it matters)

by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah, and Neel Nanda

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and...

Jun 2034

GDM AI Control Roadmap

by Mary Phuong, Erik Jenner, Rohin Shah, and Seb Farquhar

GDM has published an AI Control Roadmap! From the executive summary: > We present the GDM AI Control Roadmap (v0.1) – our plan for implementing and adopting internal guardrails designed to catch potential adversarial behaviour by AI agents, even as they become increasingly harder to oversee and contain. > >...

Jun 1881

Testing Gemini models for scheming tendencies

by Vika, David Lindner, Seb Farquhar, and Rohin Shah

As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity. Our new papers focus on propensity for...

May 2947

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

by TurnTrout and Rohin Shah

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah > You’re absolutely right to start reading this post! What a rational decision! Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy)...

Nov 4, 202555

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

by Tomek Korbak, Mikita Balesni, Vlad Mikulik, and Rohin Shah

Twitter | Paper PDF Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL agents. Luckily, we ended up with reasoning models that speak their thoughts clearly enough for us to follow along (most of the time)....

Jul 15, 2025169

Evaluating and monitoring for AI scheming

by Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho, and Rohin Shah

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...

Jul 10, 202553

Google DeepMind: An Approach to Technical AGI Safety and Security

We have written a paper on our approach to technical AGI safety and security. This post is primarily a copy of the extended abstract, which summarizes the paper. I also include the abstract and the table of contents. See also the GDM blogpost and tweet thread. Artificial General Intelligence (AGI)...

Apr 5, 202573

Rohin Shah

Rohin Shah

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Rohin Shah

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

How transparent is DiffusionGemma (and why it matters)

GDM AI Control Roadmap

Testing Gemini models for scheming tendencies

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Evaluating and monitoring for AI scheming

Google DeepMind: An Approach to Technical AGI Safety and Security