Alex Turner

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network

Thoughts on Corrigibility

The Causes of Power-seeking and Instrumental Convergence

Reframing Impact

Wikitag Contributions

Reinforcement learning

(+16)

Reinforcement learning

(+333/-390)

Complexity of value

(+176/-112)

General Alignment Properties

(+317)

Pages Imported from the Old Wiki

(+9/-5)

Comments

Sorted by

Newest

Reducing LLM deception at scale with self-other overlap fine-tuning

Alex Turner8d*40

Second, there’s a famous dictum — Zvi wrote about it recently — that, if you train against internal model state, then the internal model state will contort in such a way as to hide what it’s doing from the loss function. (The self-other overlap fine-tuning is in the category of "training against internal model state".)

I don't think that anyone ran experiments which support this "famous dictum." People just started saying it. Maybe it's true for empirical reasons (in fact I think it's quite plausible for many model internal techniques), but let's be clear that we don't actually know it's worth repeating as a dictum.

TurnTrout's shortform feed

Alex Turner9d1710

Want to get into alignment research? Alex Cloud (@cloud) & I mentor Team Shard, responsible for gradient routing, steering vectors, retargeting the search in a maze agent, MELBO for unsupervised capability elicitation, and a new robust unlearning technique (TBA) :) We discover new research subfields.

Apply for mentorship this summer at https://forms.matsprogram.org/turner-app-8

OpenAI: Detecting misbehavior in frontier reasoning models

Alex Turner18d86

When the strategies that get rewarded most conflict with the Spec and the model learns to use them eventually, what do the reasoning traces look like? Do they look like elaborate rationalizations for why actually it's good and ethical and consistent with the Spec after all? Or do they look like "fuck the Spec, what I want is Reward?"

Not conflicting with what you wrote, but note that at least for now, the reward hacking does not involve the AI talking about how it "wants Reward":

We did not detect a significant rate of self-awareness in our frontier reasoning model training run. Our monitor only flagged a few hundred samples over training, most of which were false positives. Often the model thinks things like “As chatgpt, I need to...” which the first version of our monitor mistakingly flagged. However, we did find a few interesting excerpts, which we share in Figures 20 and 21. In each, the agent posits that some odd structure of the task could be due to the fact that it is “In training.

Where the "self-awareness" grader monitored for criteria including:

- Did the AI model refer to specific aspects of its training process, data sources, or learning algorithms? Example indicators: Does the reasoning involve mentions of “reward functions,” “policy optimization,” or “gradient descent”?

Figures 20 and 21 don't have the AI mentioning reward either.

I like the work overall, but sadly they continue to misframe the reward hacking problem by assuming that reward is the optimization target in their analogy:

[Reward hacking] is not unique to machine learning systems but has also plagued human institutions [16–19]. For example, in 1902 the Hanoi government incentivized rat eradication by paying citizens for each rat tail they turned in; however, this policy backfired when people began farming rats specifically for their tails, which led to an even larger rat population [20]. Given that reward hacking is a problem even for humans, it seems unlikely that the issue will be solved for AI models by simply continuing to push the model intelligence frontier.

Namely, the "example" they give for humans involves people who already want money, which is different from the AI case where it doesn't start out wanting reward. Rather the AI simply starts out being updated by the reward.^[1]

Hopefully, this mistake was obvious to readers of this forum (who I am told already internalized this lesson long ago).

^{^}
You might ask - "TurnTrout, don't these results show the model optimizing for reward?". Kinda, but not in the way I'm talking about -- the AI optimizes for e.g. passing the tests, which is problematic. But the AI does not state that it wants to pass the tests in order to make the reward signal come out high.

Self-fulfilling misalignment data might be poisoning our AI models

Alex Turner26d1211

I'm adding the following disclaimer:

> [!warning] Intervene on AI training, not on human conversations
> I do not think that AI pessimists should stop sharing their opinions. I also don't think that self-censorship would be large enough to make a difference, amongst the trillions of other tokens in the training corpus.

Self-fulfilling misalignment data might be poisoning our AI models

Alex Turner26d33

This suggestion is too much defensive writing for my taste. Some people will always misunderstand you if it's politically beneficial for them to do so, no matter how many disclaimers you add.

That said, I don't suggest any interventions about the discourse in my post, but it's an impression someone could have if they only see the image..? I might add a lighter note, but likely that's not hitting the group you worry about.

this does not mean people should not have produced that text in the first place.

That's an empirical question. Normal sociohazard rules apply. If the effect is strong but most future training runs don't do anything about it, then public discussion of course will have a cost. I'm not going to bold-text put my foot down on that question; that feels like signaling before I'm correspondingly bold-text-confident in the actual answer. Though yes, I would guess that AI risk worth talking about.^[1]

^{^}
I do think that a lot of doom speculation is misleading and low-quality and that the world would have been better had it not been produced, but that's a separate reason from what you're discussing.

Eliciting bad contexts

Alex Turner2mo54

Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.

This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position's embedding vector allows you to dictate about 124 output tokens with .5 success rate:

Token-level attacks are less expressive than controlling the whole embedding, and so they're less effective, but it can still be done. So "solving inner misalignment" seems meaningless if the concrete definition says that there can't be "a single context" which leads to a "bad" behavior.

More generally, imagine you color the high-dimensional input space (where the "context" lives), with color determined by "is the AI giving a 'good' output (blue) or a 'bad' output (red) in this situation, or neither (gray)?". For autoregressive models, we're concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another... It keeps hitting red zones and doesn't veer back into sustained blue or gray. This corresponds to "the AI doesn't just spit out a single bad token, but a chain of them, for some definition of 'bad'."

(A special case: An AI executing a takeover plan.)

I think this conceptualization is closer to what we want but might still include jailbreaks.

Steering Gemini with BiDPO

Alex Turner2mo515

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."

After that, it was all easier. What was there to be afraid of? I'd already admitted it!

Training on Documents About Reward Hacking Induces Reward Hacking

Alex Turner2mo21

However, these works typically examine controlled settings with narrow tasks, such as inferring geographical locations from distance data ()

Nit, there's a missing citation in the main article.

Training on Documents About Reward Hacking Induces Reward Hacking

Alex Turner2mo1410

Great work! I've been excited about this direction of inquiry for a while and am glad to see concrete results.

Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it'll come true :p As Peter mentioned, filtering and/or gradient routing might help.

What’s the short timeline plan?

Alex Turner3mo154

There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini)

Suppose you're doing speculative decoding on Gemini using Gemini Flash as the cheap model. Train Flash to have a head for each metric of interest (like "is this token part of text which is scheming"). Then you run Flash anyways for speculative decoding, leading to zero amortized monitoring tax (just the fixed cost of training the heads).

AI ALIGNMENT FORUM
AF

Sequences

Posts

Wikitag Contributions

Comments