tom4everitt — AI Alignment Forum

Reward Hacking from a Causal Perspective

Post 4 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction, Post 2: Causality, Post 3: Agency, and Post 4: Incentives. By Francis Rhys Ward, Tom Everitt, Sebastian Benthall, James Fox, Matt MacDermott, Milad Kazemi, Ryan Carey representing the Causal Incentives Working Group. Thanks also to Toby...

Jul 21, 202329

Incentives from a causal perspective

Post 4 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction, Post 2: Causality, and Post 3: Agency. By Tom Everitt, James Fox, Ryan Carey, Matt MacDermott, Sebastian Benthall, and Jon Richens, representing the Causal Incentives Working Group. Thanks also to Toby Shevlane and Aliya Ahmad. “Show...

Jul 10, 202327

Agency from a causal perspective

Post 3 of Towards Causal Foundations of Safe AGI, preceded by Post 1: Introduction and Post 2: Causality. By Matt MacDermott, James Fox, Rhys Ward, Jonathan Richens, and Tom Everitt representing the Causal Incentives Working Group. Thanks also to Ryan Carey, Toby Shevlane, and Aliya Ahmad. The purpose of this...

Jun 30, 202340

Causality: A Brief Introduction

Post 2 of Towards Causal Foundations of Safe AGI, see also Post 1 Introduction. By Lewis Hammond, Tom Everitt, Jon Richens, Francis Rhys Ward, Ryan Carey, Sebastian Benthall, and James Fox, representing the Causal Incentives Working Group. Thanks also to Alexis Bellot, Toby Shevlane, and Aliya Ahmad. Causal models are...

Jun 20, 202349

Introduction to Towards Causal Foundations of Safe AGI

By Tom Everitt, Lewis Hammond, Rhys Ward, Ryan Carey, James Fox, Sebastian Benthall, Matt MacDermott and Shreshth Malik representing the Causal Incentives Working Group. Thanks also to Toby Shevlane, MH Tessler, Aliya Ahmad, Zac Kenton, Maria Loks-Thompson, and Alexis Bellot. Over the next few years, society, organisations, and individuals will...

Jun 12, 202374

Progress on Causal Influence Diagrams

By Tom Everitt, Ryan Carey, Lewis Hammond, James Fox, Eric Langlois, and Shane Legg Crossposted from DeepMind Safety Research About 2 years ago, we released the first few papers on understanding agent incentives using causal influence diagrams. This blog post will summarize progress made since then. What are causal influence...

Jun 30, 202173

Specification gaming: the flip side of AI ingenuity

by Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton, and janleike

(Originally posted to the Deepmind Blog) Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden...

May 6, 202069