Rohin Shah

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Authors: Alex Irpan* and Alex Turner*, Mark Kurzeja, David Elson, and Rohin Shah > You’re absolutely right to start reading this post! What a rational decision! Even the smartest models’ factuality or refusal training can be compromised by simple changes to a prompt. Models often praise the user’s beliefs (sycophancy)...

Nov 4, 202553

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Twitter | Paper PDF Seven years ago, OpenAI five had just been released, and many people in the AI safety community expected AIs to be opaque RL agents. Luckily, we ended up with reasoning models that speak their thoughts clearly enough for us to follow along (most of the time)....

Jul 15, 2025167

Evaluating and monitoring for AI scheming

As AI models become more sophisticated, a key concern is the potential for “deceptive alignment” or “scheming”. This is the risk of an AI system becoming aware that its goals do not align with human instructions, and deliberately trying to bypass the safety measures put in place by humans to...

Jul 10, 202553

Google DeepMind: An Approach to Technical AGI Safety and Security

We have written a paper on our approach to technical AGI safety and security. This post is primarily a copy of the extended abstract, which summarizes the paper. I also include the abstract and the table of contents. See also the GDM blogpost and tweet thread. Artificial General Intelligence (AGI)...

Apr 5, 202573

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which...

Mar 26, 2025117

AGI Safety & Alignment @ Google DeepMind is hiring

The AGI Safety & Alignment Team (ASAT) at Google DeepMind (GDM) is hiring! Please apply to the Research Scientist and Research Engineer roles. Strong software engineers with some ML background should also apply (to the Research Engineer role). Our initial batch of hiring will focus more on hiring engineers, but...

Feb 17, 2025103

A short course on AGI safety from the GDM Alignment team

We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course offers a concise and accessible introduction to AI alignment, consisting of short recorded talks and exercises (75 minutes total) with an accompanying slide deck and exercise workbook. It...

Feb 14, 2025105

Rohin Shah

Rohin Shah

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Rohin Shah

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

DeepMind is hiring for the Scalable Alignment and Alignment Teams

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Evaluating and monitoring for AI scheming

Google DeepMind: An Approach to Technical AGI Safety and Security

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

AGI Safety & Alignment @ Google DeepMind is hiring

A short course on AGI safety from the GDM Alignment team