bilalchughtai

How transparent is DiffusionGemma (and why it matters)

by Josh Engels, Callum McDougall, bilalchughtai, János Kramár, Senthooran Rajamanoharan, Arthur Conmy, Rohin Shah, and Neel Nanda

Authors: Joshua Engels*, Callum McDougall*, Bilal Chughtai*, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue+, João Gabriel Lopes de Oliveira+, Rohin Shah+, Neel Nanda+ *Primary Contributor +Advising Paper here: https://arxiv.org/abs/2606.20560 Overview In a recent collaboration between the GDM interpretability team and...

Jun 2034

SFT Drives Gemini’s Safety Properties

by Josh Engels, Arthur Conmy, bilalchughtai, and Neel Nanda

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here. In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused...

Jun 1369

Building and evaluating model diffing agents

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here. TL;DR * It is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models....

Jun 1261

[Paper] Difficulties with Evaluating a Deception Detector for AIs

New research from the GDM mechanistic interpretability team. Read the full paper on arxiv or check out the twitter thread. > Abstract > Building reliable deception detectors for AI systems—methods that could predict when an AI system is being strategically deceptive without necessarily requiring behavioural evidence—would be valuable in mitigating...

Dec 3, 202530

How Can Interpretability Researchers Help AGI Go Well?

by Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár, and lewis smith

Executive Summary * Over the past year, the Google DeepMind mechanistic interpretability team has pivoted to a pragmatic approach to interpretability, as detailed in our accompanying post [1] , and are excited for more in the field to embrace pragmatism! In brief, we think that: * It is crucial to...

Dec 1, 202568

A Pragmatic Vision for Interpretability

by Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár, and lewis smith

Executive Summary * The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability: * Trying to directly solve problems on the critical path to AGI going well [[1]] * Carefully choosing problems according to our comparative...

Dec 1, 2025139

Detecting Strategic Deception Using Linear Probes

by Nicholas Goldowsky-Dill, bilalchughtai, StefanHex, and Marius Hobbhahn

Can you tell when an LLM is lying from the activations? Are simple methods good enough? We recently published a paper investigating if linear probes detect when Llama is deceptive. Abstract: > AI models might use deceptive strategies as part of scheming or misaligned behaviour. Monitoring outputs alone is insufficient,...

Feb 6, 2025104

bilalchughtai

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

bilalchughtai

Activation space interpretability may be doomed

A Pragmatic Vision for Interpretability

Detecting Strategic Deception Using Linear Probes

Paper: Open Problems in Mechanistic Interpretability

How transparent is DiffusionGemma (and why it matters)

SFT Drives Gemini’s Safety Properties

Building and evaluating model diffing agents

[Paper] Difficulties with Evaluating a Deception Detector for AIs

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Detecting Strategic Deception Using Linear Probes