Jacob Pfau

Sequent: scale and automation for higher confidence in alignment

by Geoffrey Irving, Alex HT, Jesse Hoogland, Daniel Murfet, Jacob Pfau, Marco Cozzi, and Stan van Wingerden

Alignment is not on track Artificial superintelligence (ASI) may be developed in the next few years. It is unclear whether alignment is on track to be ready on the same timeframe. At a minimum, the empirical programs at AI labs are unlikely to deliver a priori confidence, before training ASI,...

Jun 10243

Research Areas in Methods for Post-training and Elicitation (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202512

Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202510

Research Areas in Probabilistic Methods (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 20254

Research Areas in Evaluation and Guarantees in Reinforcement Learning (The Alignment Project by UK AISI)

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...

Aug 1, 202514

The Alignment Project by UK AISI

by Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa, and Edmund Lau

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This sequence sets out the research areas we are excited to fund – we...

Aug 1, 202529

Unexploitable search: blocking malicious use of free parameters

Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time....

May 21, 202540

Jacob Pfau

Jacob Pfau

From personas to intentions: towards a science of motivations for AI models

Unexploitable search: blocking malicious use of free parameters

An alignment safety case sketch based on debate

UK AISI’s Alignment Team: Research Agenda

Jacob Pfau

From personas to intentions: towards a science of motivations for AI models

Unexploitable search: blocking malicious use of free parameters

An alignment safety case sketch based on debate

UK AISI’s Alignment Team: Research Agenda

Sequent: scale and automation for higher confidence in alignment

Research Areas in Methods for Post-training and Elicitation (The Alignment Project by UK AISI)

Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)

Research Areas in Probabilistic Methods (The Alignment Project by UK AISI)

Research Areas in Evaluation and Guarantees in Reinforcement Learning (The Alignment Project by UK AISI)

The Alignment Project by UK AISI

Unexploitable search: blocking malicious use of free parameters