Josh Engels

How transparent is DiffusionGemma (and why it matters)

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and...

Jun 2034

Why Do Naive SFT Filters For Safety Properties Fail?

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here. Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from...

Jun 1450

SFT Drives Gemini’s Safety Properties

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused...

Jun 1369

Building and evaluating model diffing agents

by bilalchughtai, Josh Engels, and Neel Nanda

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here. TL;DR * It is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models....

Jun 1261

Test your best methods on our hard CoT interp tasks

by daria, Riya Tyagi, Josh Engels, and Neel Nanda

Authors: Daria Ivanova, Riya Tyagi, Josh Engels, Neel Nanda Daria and Riya are co-first authors. This work was done during Neel Nanda’s MATS 9.0. Claude helped write code and suggest edits for this post. Most of our tasks fall in 3 categories: predicting future actions, detecting the effect of an...

Mar 2658

Brief Explorations in LLM Value Rankings

by Tim Hua, Josh Engels, Neel Nanda, and Senthooran Rajamanoharan

Code and data can be found here Executive Summary * We use data from Zhang et al. (2025) to measure LLM values. We find that our value metric can sometimes predict LLM behaviors on a test distribution in non-safety-relevant settings, but it is not super consistent. * In Zhang et...

Jan 1239

Steering RL Training: Benchmarking Interventions Against Reward Hacking

by ariaw, Josh Engels, and Neel Nanda

This project is an extension of work done for Neel Nanda’s MATS 9.0 Training Phase. Neel Nanda and Josh Engels advised the project. Initial work on this project was done with David Vella Zarb. Thank you to Arya Jakkli, Paul Bogdan, and Monte MacDiarmid for providing feedback on the post...

Dec 29, 202572

Josh Engels

Josh Engels

A Pragmatic Vision for Interpretability

Steering RL Training: Benchmarking Interventions Against Reward Hacking

SFT Drives Gemini’s Safety Properties

How Can Interpretability Researchers Help AGI Go Well?

Josh Engels

A Pragmatic Vision for Interpretability

Steering RL Training: Benchmarking Interventions Against Reward Hacking

SFT Drives Gemini’s Safety Properties

How Can Interpretability Researchers Help AGI Go Well?

How transparent is DiffusionGemma (and why it matters)

Why Do Naive SFT Filters For Safety Properties Fail?

SFT Drives Gemini’s Safety Properties

Building and evaluating model diffing agents

Test your best methods on our hard CoT interp tasks

Brief Explorations in LLM Value Rankings

Steering RL Training: Benchmarking Interventions Against Reward Hacking