The Library

AGI safety from first principles

by Richard_Ngo

Embedded Agency

by abramdemski

2022 MIRI Alignment Discussion

by Rob Bensinger

2021 MIRI Conversations

by Rob Bensinger

Infra-Bayesianism

by Diffractor

Conditioning Predictive Models

by evhub

Cyborgism

by janus

The Engineer’s Interpretability Sequence

by scasper

Iterated Amplification

by paulfchristiano

Value Learning

by Rohin Shah

Risks from Learned Optimization

by evhub

Cartesian Frames

by Scott Garrabrant

The Alignment Project Research Agenda

by Benjamin Hilton

Wise AI Wednesdays

by Chris_Leong

General Reasoning in LLMs

by eggsyntax

The Theoretical Foundations of Reward Learning

by Joar Skalse

The AI Alignment and Deployment Problems

by Sammy Martin

CAST: Corrigibility As Singular Target

by Max Harms

AI Control

by Fabien Roger

Formalising Catastrophic Goodhart

by VojtaKovarik

The Ethicophysics

by MadHatter

Game Theory without Argmax

by Cleo Nardo

The Value Change Problem (sequence)

by Nora_Ammann

Monthly Algorithmic Problems in Mech Interp

by CallumMcDougall

Load More (12/79)

AI ALIGNMENT FORUM
AF

Curated Sequences

Community Sequences

The Library

Curated Sequences

Community Sequences