This is a special post for quick takes by Ryan Greenblatt. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
3 comments, sorted by Click to highlight new comments since: Today at 3:58 PM

I think it might be worth quickly clarifying my views on activation addition and similar things (given various discussion about this). Note that my current views are somewhat different than some comments I've posted in various places (and my comments are somewhat incoherent overall), because I’ve updated based on talking to people about this over the last week.

This is quite in the weeds and I don’t expect that many people should read this.

  • It seems like activation addition sometimes has a higher level of sample efficiency in steering model behavior compared with baseline training methods (e.g. normal LoRA finetuning). These comparisons seem most meaningful in straightforward head-to-head comparisons (where you use both methods in the most straightforward way). I think the strongest evidence for this is in Liu et al..
  • Contrast pairs are a useful technique for variance reduction (to improve sample efficiency), but may not be that important (see Liu et al. again). It's relatively natural to capture this effect using activation vectors, but there is probably some nice way to incorporate this into SGD. Perhaps DPO does this? Perhaps there is something else?
  • Activation addition works better than naive few-shot prompting in some cases, particularly in cases where the way you want to steer the model isn't salient/obvious from the few-shot prompt. But it's unclear how it performs in comparison to high-effort prompting. Regardless, activation addition might work better "out of the box" because fancy prompting is pretty annoying.
    • I think training models to (in general) respond well to "contrast prompting", where you have both positive and negative examples in the prompt, might work well. This can be done in a general way ahead of time and then specific tuning can just use this format (the same as instruction finetuning). For a simple prompting approach, you can do "query_1 Good response: pos_ex_1, Bad response: neg_ex_1, query_2 Good response: pos_ex_2, Bad response: neg_ex_2, ..." and then prompt with "actual_query Good response:". Normal pretrained models might not respond well to this prompt by default, but I haven’t checked.
  • I think we should be able to utilize the fact that activation addition works to construct a tweaked inductive bias for SGD on transformers (which feels like a more principled approach from my perspective if the goal is better sample efficiency). More generally, I feel like we should be able to do ablations to understand why activation addition works and then utilize this separately.
  • I expect there are relatively straightforward ways to greatly improve sample efficiency of normal SGD via methods like heavy proliferation of training data. I also think "really high sample efficiency from a small number of samples" isn't very solved at the moment. I think if we're really going for high sample efficiency from a fixed training dataset we should be clear about that and I expect a variety of very different approaches are possible.
  • The advantages versus prompting in terms of inference time performance improvement (because activation engineering means you have to process fewer tokens during inference) don't seem very important to me because you can just context distill prompts.
  • For cases where we want to maximize a metric X and we don’t care about sample efficiency or overfitting to X, we can construct a big dataset of high quality demonstrations, SFT on these demos, and then RL against our metric. If we do this in a well tuned way for a large number of samples such that sample efficiency isn’t much of an issue (and exploration problems are substantial), I would be pretty surprised if activation addition or similar approaches can further increase metric X and “stack” with all the RL/demos we did. That is, I would be surprised if this is true unless the activation engineering is given additional affordances/information not in our expert demos and the task isn’t heavily selected for activation engineering stacking like this.
  • It's unclear how important it is to (right now) work on sample efficiency specifically to reduce x-risk. I think it seems not that important, but I'm unsure. Better sample efficiency could be useful for reducing x-risk because better sample efficiency could allow for using a smaller amount of oversight data labeled more carefully for training your AI, but I expect that sample efficiency will naturally be improved to pretty high levels by standard commercial/academic incentives. (Speculatively, I also expect that the marginal returns will just generally look better for improving oversight even given equal effort or more effort on oversight basically because we’ll probably need a moderate amount of data anyway and I expect that improving sample efficiency in the “moderate data” regime is relatively harder.) One exception is that I think that sample efficiency for very low sample-count cases might not be very commercially important, but might be highly relevant for safety after catching the AI doing egregiously bad actions and then needing to redeploy. For this, it would be good to focus on sample efficiency specifically in ways which are analogous to training an AI or a monitor after catching an egregiously bad action.
  • The fact that activation addition works reveals some interesting facts about the internals of models; I expect there are some ways to utilize something along these lines to reduce x-risk.
  • In principle, you could use activation additions or similar editing techniques to learn non-obvious facts about the algorithms which models are implementing via running experiments where you (e.g.) add different vectors at different layers and observe interactions (interpretability). For this to be much more powerful than black-box model psychology or psychology/interp approaches which use a bit of training, you would probably need to do try edits at many different layers and map out an overall algorithm. (And/or do many edits simultaneously).
  • I think "miscellaneous interventions on internals" for high-level understanding of the algorithms that models are implementing seems promising in principle, but I haven't seen any work in this space which I'm excited about.
  • I think activation addition could have "better" generalization in some cases, but when defining generalization experiments, we need to be careful about the analogousness of the setup. It's also unclear what exact properties we want for generalization, but minimally being able to more precisely pick and predict the generalization seems good. I haven't yet seen evidence for "better" generalization using activation addition which seems compelling/interesting to me. Note that we should use a reasonably sized training dataset, or we might be unintentionally measuring sample efficiency (in which case, see above). I don't really see a particular reason why activation addition would result in "better" generalization beyond having some specific predictable properties which maybe mean we can know about cases where it is somewhat better (cases where conditioning on something is better than "doing" that thing?).
  • I'm most excited for generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post and this post for more discussion of the setting I'm thinking about.
  • I generally think it's important to be careful about baselines and exactly what problem we're trying to solve in cases where we are trying to solve a specific problem as opposed to just doing some open-ended exploration. TBC, open ended exploration is fine, but we should know what we're doing. I often think that when you make the exact problem you're trying to solve and what affordances you are and aren't allowed more clear, a bunch of additional promising methods become apparent. I think that the discussion I’ve seen so far of activation engineering (e.g. in this post, why do we compare to finetuning in a generalization case rather than just directly finetuning on what we want? Is doing RL to increase how sycophantic the model is in scope?) has not been very precise about what problem it’s trying to solve or what baseline techniques it’s claiming to outperform.
  • I don’t currently think the activation addition stuff is that important in expectation (for reducing x-risk) due to some of the beliefs I said above (though I'm not very confident). I'd be most excited about the "understand the algorithms the model is implementing" application above or possibly some generalization tests. This view might depend heavily on general views about threat models and how the future will go. Regardless, I ended up commenting on this a decent amount, so I thought it would be worth the time to clarify my views.

IMO, instrumental convergence is a terrible name for an extremely obvious thing.

The actual main issue is that AIs might want things with limited supply and then they would try to get these things which would result in them not going to humanity. E.g., AIs might want all cosmic resources, but we also want this stuff. Maybe this should be called AIs-might-want-limited-stuff-we-want.

(There is something else which is that even if the AI doesn't want limited stuff we want, we might end up in conflict due to failures of information or coordination. E.g., the AI almost entirely just wants to chill out in the desert and build crazy sculptures and it doesn't care about extreme levels of maximization (e.g. it doesn't want to use all resources to gain a higher probability of continuing to build crazy statues). But regardless, the AI decides to try taking over the world because it's worried that humanity would shut it down because it wouldn't have any way of credibly indicating that it just wants to chill out in the desert.)

(More generally, it's plausible that failures of trade/coordination result in a large number of humans dying in conflict with AIs even though both humans and AIs would prefer other approaches. But this isn't entirely obvious and it's plausible we could resolve this with better negotation and precommitments. Of course, this isn't clearly the largest moral imperative from a longtermist perspective.)

On Scott Alexander’s description of Representation Engineering in “The road to honest AI”

This is a response to Scott Alexander’s recent post “The road to honest AI”, in particular the part about the empirical results of representation engineering. So, when I say “you” in the context of this post that refers to Scott Alexander. I originally made this as a comment on substack, but I thought people on LW/AF might be interested.

TLDR: The Representation Engineering paper doesn’t demonstrate that the method they introduce adds much value on top of using linear probes (linear classifiers), which is an extremely well known method. That said, I think that the framing and the empirical method presented in the paper are still useful contributions.

I think your description of Representation Engineering considerably overstates the empirical contribution of representation engineering over existing methods. In particular, rather than comparing the method to looking for neurons with particular properties and using these neurons to determine what the model is "thinking" (which probably works poorly), I think the natural comparison is to training a linear classifier on the model’s internal activations using normal SGD (also called a linear probe)[1]. Training a linear classifier like this is an extremely well known technique in the literature. As far as I can tell, when they do compare to just training a linear classifier in section 5.1, it works just as well for the purpose of “reading”. (Though I’m confused about exactly what they are comparing in this section as they claim that all of these methods are LAT. Additionally, from my understanding, this single experiment shouldn’t provide that much evidence overall about which methods work well.)

I expect that training a linear classifier performs similarly well as the method introduced in the Representation Engineering for the "mind reading" use cases you discuss. (That said, training a linear classifier might be less sample efficient (require more data) in practice, but this doesn't seem like a serious blocker for the use cases you mention.)

One difference between normal linear classifier training and the method found in the representation engineering paper is that they also demonstrate using the direction they find to edit the model. For instance, see this response by Dan H. to a similar objection about the method being similar to linear probes. Training a linear classifier in a standard way probably doesn't work as well for editing/controlling the model (I believe they show that training a linear classifier doesn’t work well for controlling the model in section 5.1), but it's unclear how much we should care if we're just using the classifier rather than doing editing (more discussion on this below).[2]

If we care about the editing/control use case intrinsically, then we should compare to normal fine-tuning baselines. For instance, normal supervised next-token prediction on examples with desirable behavior or DPO.[3]

Are simple classifiers useful?

Ok, but regardless of the contribution of the representation engineering paper, do I think that simple classifiers (found using whatever method) applied to the internal activations of models could detect when those models are doing bad things? My view here is a bit complicated, but I think it’s at least plausible that these simple classifiers will work even though other methods fail. See here for a discussion of when I think linear classifiers might work despite other more baseline methods failing. It might also be worth reading the complexity penalty section of the ELK report.

Additionally, I think that the framing in the representation engineering paper is maybe an improvement over existing work and I agree with the authors that high-level/top-down techniques like this could be highly useful. (I just don’t think that the empirical work is adding as much value as you seem to indicate in the post.)

The main contributions

Here are what I see as the main contributions of the paper:

  • Clearly presenting a framework for using simple classifiers to detect things we might care about (e.g. powerseeking text).
  • Presenting a combined method for producing a classifier and editing/control in an integrated way. And discussing how control can be used for classifier validation and vice versa.
  • Demonstrating that in some cases labels aren’t required if we can construct a dataset where the classification of interest is the main axis of variation. (This was also demonstrated in the CCS paper, but the representation engineering work demonstrates this in more cases.)

Based on their results, I think the method they introduce is reasonably likely to be a more sample efficient (less data required for training) editing/control method than prior methods for many applications. It might also be more sample efficient for producing a classifier. That said, I’m not sure we should care very much about sample efficiency. Additionally, the classifier/editing might have other nice priorities which prior methods don’t have (though they don’t clearly demonstrate either of these in the paper AFAICT).

Is it important that we can use our classifier for control/editing?

As far the classifier produced by this method having nice properties, the fact our classifier also allows for editing/control might indicate that the classifier we get has better properties (see the paper itself (section 3.1.2) and e.g. here for discussion), but I’d guess this is either only a moderate improvement or has no effect in practice. And as far as I can tell, the paper doesn’t demonstrate cases where prior methods for training a classifier on the internal activations yield poor results, but their method clearly works well. These cases might exist, but I’m somewhat skeptical that this is very common. Future work could find hard cases where we want a particular generalization and demonstrate that this method or modifications of this method works better than other approaches.

Does the editing method they introduce have nice properties because it also allows for reading? Let’s consider using the representation engineering approach for reading and controlling/editing the properties of  lying versus honesty. Assuming the method works as desired, then I would guess that the reading/classifier corresponds to reading off “does the model think there is lying in this text (or even in this position in the text)” and the control/editing corresponds to “make the model think that there is a lying earlier in this text so that it conditions on this and does more lying (similarly to how using a few shot prompt with lies might make the model more likely to lie)”. Note that these reading and control methods likely do not directly correspond to “the model thinking that it is about to lie”: the properties of “I have already lied (or my few-shot prompt contains lies)” and “I am about to lie” are different.


  1. Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting. ↩︎

  2. Separately, I believe there are known techniques in the literature for constructing a linear classifier such that the direction will work for editing. For instance, we could just use the difference between the mean activations for the two classes we're trying to classify which is equivalent to the ActAdd technique and also rhymes nicely with LEACE. I assume this is a well known technique for making a classifier in the literature, but I don’t know if prior work has demonstrated both using this as a classifier and as a method for modeling editing. (The results in section 5.1 seem to indicate that this mean difference method combined with LEACE works well, but I’m not sure how much evidence this experiment provides.) ↩︎

  3. I expect that normal fine-tuning (or DPO) might be less sample efficient than the method introduced in the Representation Engineering paper for controlling/editing models, but I don't think they actually run this comparison? Separately, it’s unclear how much we care about sample efficiency. ↩︎