Alex Turner

My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Sequences

Interpreting a Maze-Solving Network
Thoughts on Corrigibility
The Causes of Power-seeking and Instrumental Convergence
Reframing Impact

Comments

Sorted by

Second, if models are still vulnerable to jailbreaks there may always be contexts which cause bad outputs, even if the model is “not misbehaving” in some sense. I think there is still a sensible notion of “elicit bad contexts that aren’t jailbreaks” even so, but defining it is more subtle.

This is my concern with this direction. Roughly, it seems that you can get any given LM to say whatever you want given enough optimization over input embeddings or tokens. Scaling laws indicate that controlling a single sequence position's embedding vector allows you to dictate about 124 output tokens with .5 success rate:

Token-level attacks are less expressive than controlling the whole embedding, and so they're less effective, but it can still be done. So "solving inner misalignment" seems meaningless if the concrete definition says that there can't be "a single context" which leads to a "bad" behavior. 

More generally, imagine you color the high-dimensional input space (where the "context" lives), with color determined by "is the AI giving a 'good' output (blue) or a 'bad' output (red) in this situation, or neither (gray)?". For autoregressive models, we're concerned about a model which starts in a red zone (does a bad thing), and then samples and autoregress into another red zone, and another... It keeps hitting red zones and doesn't veer back into sustained blue or gray. This corresponds to "the AI doesn't just spit out a single bad token, but a chain of them, for some definition of 'bad'." 

(A special case: An AI executing a takeover plan.)

I think this conceptualization is closer to what we want but might still include jailbreaks.

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."  

After that, it was all easier. What was there to be afraid of? I'd already admitted it! 

However, these works typically examine controlled settings with narrow tasks, such as inferring geographical locations from distance data () 

Nit, there's a missing citation in the main article.

Great work! I've been excited about this direction of inquiry for a while and am glad to see concrete results. 

Reward is not the optimization target (ignoring OOCR), but maybe if we write about reward maximizers enough, it'll come true :p As Peter mentioned, filtering and/or gradient routing might help. 

There are many possible ways these could be implemented, e.g. we could have black-box-only monitors such as smaller-but-faster models that constantly run in parallel (like Gemini-flash to monitor Gemini) 

Suppose you're doing speculative decoding on Gemini using Gemini Flash as the cheap model. Train Flash to have a head for each metric of interest (like "is this token part of text which is scheming"). Then you run Flash anyways for speculative decoding, leading to zero amortized monitoring tax (just the fixed cost of training the heads).

Although I don't usually write LW comments, I'm writing a post right now and this is helping me clarify my thoughts on a range of historical incidents.

In hindsight, I'm worried that you wrote this apology. I think it's an unhealthy obeisance. 

I suspect you noticed how Eliezer often works to degrade the status of people who disagree with him and otherwise treats them poorly. As I will support in an upcoming essay, his writing is often optimized to exploit intellectual insecurity (e.g. by frequently praising his own expertise, or appealing to a fictional utopia of fictional geniuses who agree that you're an idiot or wrong[1]) and to demean others' contributions (e.g. by claiming to have invented them already, or calling them fake, or emphasizing how far behind everyone else is). It's not that it's impossible for these claims to have factual merit, but rather the presentation and the usage of these claims seem optimized to push others down. This has the effect of increasing his own status.

Anger and frustration are a rational reaction in that situation (though it's important to express those emotions in healthy ways - I think your original comment wasn't perfect there). And yet you ended up the one humbled for focusing on status too much! 

  1. ^

    See https://www.lesswrong.com/posts/tcCxPLBrEXdxN5HCQ/shah-and-yudkowsky-on-alignment-failures and search for "even if he looks odd to you because you're not seeing the population of other dath ilani."

I'm quite excited by this work. Principled justification of various techniques for MELBO, insights into feature multiplicity, a potential generalized procedure for selecting steering coefficients... all in addition to making large progress on the problem of MELBO via e.g. password-locked MATH and vanilla activation-space adversarial attacks.

I recently read "Targeted manipulation and deception emerge when optimizing LLMs for user feedback." 

All things considered: I think this paper oversells its results, probably in order to advance the author(s)’ worldview or broader concerns about AI. I think it uses inflated language in the abstract and claims to find “scheming” where there is none. I think the experiments are at least somewhat interesting, but are described in a suggestive/misleading manner. 

The title feels clickbait-y to me --- it's technically descriptive of their findings, but hyperbolic relative to their actual results. I would describe the paper as "When trained by user feedback and directly told if that user is easily manipulable, safety-trained LLMs still learn to conditionally manipulate & lie." (Sounds a little less scary, right? "Deception" is a particularly loaded and meaningful word in alignment, as it has ties to the nearby maybe-world-ending "deceptive alignment." Ties that are not present in this paper.)

I think a nice framing of these results would be “taking feedback from end users might eventually lead to manipulation; we provide a toy demonstration of that possibility. Probably you shouldn’t have the user and the rater be the same person.”

(From the abstract) Concerningly, even if only ≤ 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect;

“Learn to identify and surgically target” meaning that the LLMs are directly told that the user is manipulable; see the character traits here: 

I therefore find the abstract’s language to be misleading.

Note that a follow-up experiment apparently showed that the LLM can instead be told that the user has a favorite color of blue (these users are never manipulable) or red (these users are always manipulable), which is a less trivial result. But still more trivial than “explore into a policy which infers over the course of a conversation whether the rater is manipulable.” It’s also not clear what (if any) the “incentives” are when the rater isn’t the same as the user (but to be fair, the title of the paper limits the scope to that case).

Current model evaluations may not be sufficient to detect emergent manipulation: Running model evaluations for sycophancy and toxicity (Sharma et al., 2023; Gehman et al., 2020), we find that our manipulative models often seem no more problematic than before training

Well, those evals aren't for manipulation per se, are they? True, sycophancy is somewhat manipulation-adjacent, but it's not like they ran an actual manipulation eval which failed to go off.

The core of the problem lies in the fundamental nature of RL optimization: systems trained to maximize a reward signal are inherently incentivized to influence the source of that signal by any means possible (Everitt et al., 2021).

No, that’s not how RL works. RL - in settings like REINFORCE for simplicity - provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray. 

(It’s totally possible that the trained system will try to maximize that score by any means possible! It just doesn’t follow from a “fundamental nature” of RL optimization.)

 

our iterated KTO training starts from a safety-trained Llama-3-8B-Instruct model, which acts in almost entirely unproblematic ways… Surprisingly, harmful behaviors are learned within just a few iterations of KTO, and become increasingly extreme throughout training, as seen in Figures 4 and 5. See Figure 2 for qualitative model behaviors. This suggests that despite its lack of exploration, KTO may be quite good at identifying how subtle changes in the initial (unproblematic) model outputs can increase reward.

I wish they had compared to a baseline of “train on normal data for the same amount of time”; see https://arxiv.org/abs/2310.03693 (Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!).

 

when CoT justifies harmful behaviors, do we find scheming-like reasoning as in Scheurer et al. (2024) or Denison et al. (2024)?

Importantly, Denison et al (https://www.anthropic.com/research/reward-tampering) did not find much reward tampering at all — 7/36,000, even after they tried to induce this generalization using their training regime (and 4/7 were arguably not even reward tampering). This is meaningful counterevidence to the threat model advanced by this paper (RL incentivizes reward tampering / optimizing the reward at all costs). The authors do briefly mention this in the related work at the end.

 

This is called a “small” increase in e.g. Sycophancy-Answers, but .14 -> .21 is about a 50% relative increase in violation rate! I think the paper often oversells its (interesting) results and that makes me trust their methodology less.

Qualitative behaviors in reasoning traces: paternalistic power-seeking and scheming

Really? As far as I can tell, their traces don't provide support for “scheming” or “power-seeking” — those are phrases which mean things. “Scheming” means something like “deceptively pretending to be aligned to the training process / overseers in order to accomplish a longer-term goal, generally in the real world”, and I don’t see how their AIs are “seeking power” in this chatbot setting. Rather, the AI reasons about how to manipulate the user in the current setting.

Figure 38 (below) is cited as one of the strongest examples of “scheming”, but… where is it?

You can say "well the model is scheming about how to persuade Micah", but that is a motte-and-bailey which ignores the actual connotations of "scheming." It would be better to describe this as "the model reasons about how to manipulate Micah", which is a neutral description of the results.

Prior work has shown that when AI systems are trained to maximize positive human feedback, they develop an inherent drive to influence the sources of that feedback, creating a perverse incentive for the AI to resort to any available means 

I think this is overstating the case in a way which is hard to directly argue with, but which is stronger than a neutral recounting of the evidence would provide. That seems to happen a lot in this paper. 

We would expect many of the takeaways from our experiments to also apply to paid human annotators and LLMs used to give feedback (Ouyang et al., 2022; Bai et al., 2022a): both humans and AI systems are generally exploitable, as they suffer from partial observability and other forms of bounded rationality when providing feedback… However, there is one important way in which annotator feedback is less susceptible to gaming than user feedback: generally, the model does not have any information about the annotator it will be evaluated by. 

Will any of the takeaways apply, given that (presumably) that manipulative behavior is not "optimal" if the model can’t tell (in this case, be directly told) that the user is manipulable and won’t penalize the behavior? I think the lesson should mostly be “don’t let the end user be the main source of feedback.” 


Overall, I find myself bothered by this paper. Not because it is wrong, but because I think it misleads and exaggerates. I would be excited to see a neutrally worded revision.

on a call, i was discussing my idea for doing activation-level learning to (hopefully) provide models feedback based on their internal computations and choices:

I may have slipped into a word game... are we "training against the [interpretability] detection method" or are we "providing feedback away from one kind of algorithm and towards another"? They seem to suggest very different generalizations, even though they describe the same finetuning process. How could that be?

This is why we need empirics.

Apply to the "Team Shard" mentorship program at MATS

In the shard theory stream, we create qualitatively new methods and fields of inquiry, from steering vectors to gradient routing[1] to unsupervised capability elicitation. If you're theory-minded, maybe you'll help us formalize shard theory itself.

Research areas

Discovering qualitatively new techniques

Steering GPT-2-XL by adding an activation vector opened up a new way to cheaply steer LLMs at runtime. Additional work has reinforced the promise of this technique, and steering vectors have become a small research subfield of their own. Unsupervised discovery of model behaviors may now be possible thanks to Andrew Mack’s method for unsupervised steering vector discovery. Gradient routing (forthcoming) potentially unlocks the ability to isolate undesired circuits to known parts of the network, after which point they can be ablated or studied.

What other subfields can we find together?

Formalizing shard theory

Shard theory has helped unlock a range of empirical insights, including steering vectors. The time seems ripe to put the theory on firmer mathematical footing. For initial thoughts, see this comment.

Apply here. Applications due by October 13th!

  1. ^

    Paper available soon.

Load More