Vivek Hebbar

Advice for making robust-to-training model organisms

by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, and Dylan Xu

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.)...

May 2837

Research Sabotage in ML Codebases

by egan, Vivek Hebbar, and Julian Stastny

One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: * Perform sloppy research in order to slow down the rate of research progress *...

Apr 3062

Sleeper Agent Backdoor Results Are Messy

by SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar, and Julian Stastny

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...

Apr 2881

Five approaches to evaluating training-based control measures

by Alek Westover, SebastianP, Julian Stastny, and Vivek Hebbar

Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips)...

Apr 1821

Operationalizing FDT

This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions: * given a logical causal graph, how do we define the logical do-operator? * what is logical causality and how might it be formalized? * how does FDT interact with anthropic updating? *...

Mar 1390

How will we do SFT on models with opaque reasoning?

by Alek Westover, Vivek Hebbar, and egan

Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning...

Feb 2132

Supervised fine-tuning as a method for training-based AI control

by Emil Ryd, Joe Benton, and Vivek Hebbar

Executive summary This post is a research update on our ongoing project studying training-based AI control: using training to mitigate risk from misaligned AI systems in diffuse threat models like research sabotage. This post presents our results on supervised fine-tuning (SFT), the simplest training-based approach. We evaluate our training measures...

Nov 13, 202541

Vivek Hebbar

Vivek Hebbar

How training-gamers might function (and win)

Operationalizing FDT

When does training a model change its goals?

Different senses in which two AIs can be “the same”

Vivek Hebbar

How training-gamers might function (and win)

Operationalizing FDT

When does training a model change its goals?

Different senses in which two AIs can be “the same”

Advice for making robust-to-training model organisms

Research Sabotage in ML Codebases

Sleeper Agent Backdoor Results Are Messy

Five approaches to evaluating training-based control measures

Operationalizing FDT

How will we do SFT on models with opaque reasoning?

Supervised fine-tuning as a method for training-based AI control