x

AI ALIGNMENT FORUM
AF

Thomas Kwa

Thomas Kwa

Message

Head of linear regression at METR.

Previously: MIRI → interp with Adrià and Jason → METR.

I have signed no contracts or agreements whose existence I cannot mention.

this is a joke, I am a member of technical staff

8874

Ω

716

27

994

6y

Thomas Kwa

Head of linear regression at METR.

Previously: MIRI → interp with Adrià and Jason → METR.

I have signed no contracts or agreements whose existence I cannot mention.

this is a joke, I am a member of technical staff

Thomas Kwa — AI Alignment Forum

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

In this post, I describe a simple model for forecasting when AI will automate AI development. It is based on the AI Futures model, but more understandable and robust, and has deliberately conservative assumptions. At current rates of compute growth and algorithmic progress, this model's median prediction is >99% automation...

Claude, GPT, and Gemini All Struggle to Evade Monitors

This work was done while at METR. Introduction GDM recently released a paper (Emmons et al.) showing that, contrary to previous results, the chain-of-thought (CoT) of language models is more faithful when the model’s CoT is necessary for it to complete a task. They examine three settings where an actor...

Aug 6, 2025•61

The murderous shortcut: a toy model of instrumental convergence

Suppose you can tell your AI to meet a certain spec (e.g. cure cancer), but most plans that meet the spec are unsafe (e.g. involve killing everyone, or so Rob Bensinger thinks). In these cases, a quantilizer is insufficient for safety due to instrumental convergence.[1] But suppose we can also...

Oct 2, 2024•37

Goodhart in RL with KL: Appendix

This is the appendix to the previous post on Goodhart’s Law and KL regularization, containing all of our proofs. Theorem about distributions Theorem 1: Given any heavy-tailed reference distribution Q over R with mean μQ, and any M,ϵ>0, there is a distribution P with mean μP>M and DKL(P∥Q)<ϵ. Proof: WLOG...

May 18, 2024•12

Catastrophic Goodhart in RL with KL penalty

TLDR: In the last two posts, we showed that optimizing for a proxy can fail to increase true utility, but only when the error is heavy-tailed. We now show that this also happens in RLHF with a KL penalty. This post builds on our earlier result with a more realistic...

May 15, 2024•62

Thomas Kwa's research journal

Thomas Kwa Why I'm writing this Research feedback loops for junior AI safety researchers are pretty poor right now. They're better than in the past due to the recent explosion in empirical work, but AI safety is still a very new field whose methodology is not quite nailed down, and...

Nov 23, 2023•79

Catastrophic Regressional Goodhart: Appendix

This is a more technical followup to the last post, putting precise bounds on when regressional Goodhart leads to failure or not. We'll first show conditions under which optimization for a proxy fails, and then some conditions under which it succeeds. (The second proof will be substantially easier.) Related work...

May 15, 2023•25

Load More (7/11)