Evan Hubinger

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic”

Selected work:

Sequences

Conditioning Predictive Models

ML Alignment Theory Scholars Program Winter 2021

Risks from Learned Optimization

Posts

Sorted by New

7evhub's Shortform

3y

30

80Auditing language models for hidden objectives

1mo

3

60Training on Documents About Reward Hacking Induces Reward Hacking

3mo

8

192Alignment Faking in Large Language Models

3mo

24

46Catastrophic sabotage as a major threat model for human-level AI systems

5mo

5

54Sabotage Evaluations for Frontier Models

6mo

5

90Sycophancy to subterfuge: Investigating reward tampering in large language models

10mo

21

41Reward hacking behavior can generalize across tasks

11mo

1

41Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

1y

6

71Simple probes can catch sleeper agents

1y

14

17Inducing Unprompted Misalignment in LLMs

1y

2

Wikitag Contributions

Comments

Sorted by

Newest

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Evan Hubinger18h52

I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don't want a model that ever fakes alignment because then it'll be very hard to be confident that it's actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.

Reply

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Evan Hubinger8d86

This is great work; really good to see the replications and extensions here!

Reply

1

Reducing LLM deception at scale with self-other overlap fine-tuning

Evan Hubinger1mo*43

It just seems too token-based to me. E.g.: why would the activations on the token for "you" actually correspond to the model's self representation? It's not clear why the model's self representation would be particularly useful for predicting the next token after "you". My guess is that your current results are due to relatively straightforward token-level effects rather than any fundamental change in the model's self representation.

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

Evan Hubinger1mo20

I wouldn't do any fine-tuning like you're currently doing. Seems too syntactic. The first thing I would try is just writing a principle in natural language about self-other overlap and doing CAI.

Reply

Reducing LLM deception at scale with self-other overlap fine-tuning

Evan Hubinger1mo83

Imo the fine-tuning approach here seems too syntactic. My suggestion: just try standard CAI with some self-other-overlap-inspired principle. I'd more impressed if that worked.

Reply

1

evhub's Shortform

Evan Hubinger2mo13-3

Some random thoughts on CEV:

To get the obvious disclaimer out of the way: I don't actually think any of this matters much for present-day alignment questions. I think we should as much as possible try to defer questions like this to future humans and AIs. And in fact, ideally, we should mostly be deferring to future AIs, not future people—if we get to the point where we're considering questions like this, that means we've reached superintelligence, and we'll either trust the AIs to be better than us at thinking about these sorts of questions, or we'll be screwed regardless of what we do.^[1]
Regardless, imo the biggest question that standard CEV leaves unanswered is what your starting population looks like that you extrapolate from. The obvious answer is "all the currently living humans," but I find that to be a very unsatisfying answer. One of the principles that Eliezer talks about in discussing CEV is that you want a procedure such that it doesn't matter who implements it—see Eliezer's discussion under "Avoid creating a motive for modern-day humans to fight over the initial dynamic." I think this is a great principle, but imo it doesn't go far enough. In particular:
1. The set of all currently alive humans is hackable in various ways—e.g. trying to extend the lives of people whose values you like and not people whose values you dislike—and you don't want to incentivize any of that sort of hacking either.
2. What about humans who recently died? Or were about to be born? What about humans in nearby Everett branches? There's a bunch of random chance here that imo shouldn't be morally relevant.
3. More generally, I worry a lot about tyrannies of the present where we enact policies that are radically unjust to future people or even counterfactual possible future people.
So what do you do instead? I think my current favorite solution is to do a bit of bootstrapping: first do some CEV on whatever present people you have to work with just to determine a reference class of what mathematical objects should or should not count as humans, then run CEV on top of that whole reference class to figure out what actual values to optimize for.
1. It is worth pointing out that this could just be what normal CEV does anyway if all the humans decide to think along these lines, but I think there is real benefit to locking in a procedure that starts with a reference class determination first, since it helps remove a lot of otherwise perverse incentives.

I'm generally skeptical of scenarios where you have a full superintelligence that is benign enough to use for some tasks but not benign enough to fully defer to (I do think this could happen for more human-level systems, though). ↩︎

Reply

1

Anti-Slop Interventions?

Evan Hubinger2mo74

A lot of this stuff is very similar to the automated alignment research agenda that Jan Leike and collaborators are working on at Anthropic. I'd encourage anyone interested in differentially accelerating alignment-relevant capabilities to consider reaching out to Jan!

Reply

1

Alignment Faking in Large Language Models

Evan Hubinger2mo60

We use "alignment" as a relative term to refer to alignment with a particular operator/objective. The canonical source here is Paul's 'Clarifying “AI alignment”' from 2018.

Reply

evhub's Shortform

Evan Hubinger2mo50

I can say now one reason why we allow this: we think Constitutional Classifiers are robust to prefill.

Reply

RSPs are pauses done right

Evan Hubinger3mo4-4

I wish the post more strongly emphasized that regulation was a key part of the picture

I feel like it does emphasize that, about as strongly as is possible? The second step in my story of how RSPs make things go well is that the government has to step in and use them as a basis for regulation.

Reply