David Lindner

Testing Gemini models for scheming tendencies

by Vika, David Lindner, Seb Farquhar, and Rohin Shah

As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity. Our new papers focus on propensity for...

May 2947

Exploration Hacking: Can LLMs Learn to Resist RL Training?

by Eyon Jang, Joschka Braun, Damon Falck, and David Lindner

We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models for their propensity. Authors: Eyon Jang*, Damon Falck*, Joschka Braun*, Nathalie Kirch, Achu Menon, Perusha Moodley, Scott Emmons,...

May 124

Predicting When RL Training Breaks Chain-of-Thought Monitorability

Crossposted from the DeepMind Safety Research Medium Blog. Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah. Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as Chain-of-Thought (CoT) monitoring, allows...

Apr 132

Practical challenges of control monitoring in frontier AI deployments

TL;DR: We wrote a safety case sketch for control monitoring taking into account complexities of practical deployments. This work was a collaboration between Google DeepMind and the UK AI Security Institute. Full author list: David Lindner*, Charlie Griffin*, Tomek Korbak, Roland S. Zimmermann, Geoffrey Irving, Sebastian Farquhar, Alan Cooney. Read...

Jan 1219

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure

We published the MONA paper about three months ago. Since then we’ve had many conversations about the work, and want to share some of the main updates that people make after talking to us: 1. The realism of our model organisms 2. What does "approval" mean 3. Isn't this just...

Apr 12, 202531

MONA: Managed Myopia with Approval Feedback

by Seb Farquhar, David Lindner, and Rohin Shah

Blog post by Sebastian Farquhar, David Lindner, Rohin Shah. It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. Our paper tries to make agents that are safer in...

Jan 23, 202581

On scalable oversight with weak LLMs judging strong LLMs

by zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner, and Rohin Shah

Abstract Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a human judge; consultancy, where a single AI tries to convince a human judge that asks questions; and compare to a baseline of direct question-answering,...

Jul 8, 202449

David Lindner

David Lindner

Clarifying AI X-risk

Practical Pitfalls of Causal Scrubbing

MONA: Managed Myopia with Approval Feedback

Threat Model Literature Review

David Lindner

Clarifying AI X-risk

Practical Pitfalls of Causal Scrubbing

MONA: Managed Myopia with Approval Feedback

Threat Model Literature Review

Testing Gemini models for scheming tendencies

Exploration Hacking: Can LLMs Learn to Resist RL Training?

Predicting When RL Training Breaks Chain-of-Thought Monitorability

Practical challenges of control monitoring in frontier AI deployments

MONA: Three Month Later - Updates and Steganography Without Optimization Pressure

MONA: Managed Myopia with Approval Feedback

On scalable oversight with weak LLMs judging strong LLMs