AI ALIGNMENT FORUM
AF

All of Arthur Conmy's Comments + Replies

> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”

Does this mean something like:

1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or

2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon

or something else entirely?

2Mark Xu6mo

Basically (2), very small amounts of (1) (perhaps qualitatively similar to the amount of (1) you would apply to e.g. people joining US AISI or UK AISI)

Base LLMs refuse too

Arthur Conmy6mo10

My current best guess for why base models refuse so much is that "Sorry, I can't help with that. I don't know how to" is actually extremely common on the internet, based on discussion with Achyuta Rajaram on twitter: https://x.com/ArthurConmy/status/1840514842098106527

This fits with our observations about how frequently LLaMA-1 performs incompetent refusal

Base LLMs refuse too

Arthur Conmy6mo*30

> Qwen2 was explicitly trained on synthetic data from Qwen1.5

~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: "these [Qwen] models are utilized to synthesize high-quality pre-training data" is clear evidence, I am being silly.

All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models "trained to predict the next word on the internet" (I don't think the training samples being IID early and late in training is an important detail)

5Lawrence Chan6mo

I’m not disputing that they were trained with next token prediction log loss (if you read the tech reports they claim to do exactly this) — I’m just disputing the “on the internet” part, due to the use of synthetic data and private instruction following examples.

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Arthur Conmy7mo41

Is DM exploring this sort of stuff?

Yes. On the AGI safety and alignment team we are working on activation steering - e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don't have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see "safety practices we can test right now" as a research goal.

Transcoders enable fine-grained interpretable circuit analysis for language models

Arthur Conmy10mo10

they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer

I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.

they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)

Improving Dictionary Learning with Gated Sparse Autoencoders

Arthur Conmy11mo10

Ah yeah, Neel's comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic's work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.

Improving Dictionary Learning with Gated Sparse Autoencoders

Arthur Conmy11mo10

I don't think zero ablation is that great a baseline. We're mostly using it for continuity's sake with Anthropic's prior work (and also it's a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.

I don't think compute efficiency hit^[1] is ideal. It's really expensive to compute, since you can't just calculate it from an SAE alone as you need to know fa... (read more)

2leogao11mo

It doesn't seem like a huge deal to depend on the existence of smaller LLMs - they'll be cheap compared to the bigger one, and many LM series already contain smaller models. Not transferring between sites seems like a problem for any kind of reconstruction based metric because there's actually just differently important information in different parts of the model.

Improving Dictionary Learning with Gated Sparse Autoencoders

Arthur Conmy11mo10

I'm not sure what you mean by "the reinitialization approach" but feature death doesn't seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn't have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.

1leogao11mo

Sorry I meant the Anthropiclike neuron resampling procedure. I think I misread Neel's comment, I thought he was saying that 131k was chosen because larger autoencoders would have too many dead latents (as opposed to this only being for Pythia residual).

Refusal in LLMs is mediated by a single direction

Arthur Conmy11mo1633

The "This should be cited" part of Dan H's comment was edited in after the author's reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.

On the other hand the post's authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).

(The authors made me aware of this fact)

Refusal in LLMs is mediated by a single direction

Arthur Conmy1y179

I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. "Citing others" from this Chris Olah blog post), but the authors resolved this when pushed.

But on the other hand, "Section 6.2 of the RepE paper shows exac... (read more)

Arthur Conmy11mo1633

Improving Dictionary Learning with Gated Sparse Autoencoders

Arthur Conmy1y*40

We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value.

For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by $E | | x | |_{2}$ , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see n... (read more)

1leogao11mo

Thanks, that makes sense

Improving Dictionary Learning with Gated Sparse Autoencoders

Arthur Conmy1y22

Oh oops, thanks so much. We'll update the paper accordingly. Nit: it's actually

$\frac{E_{x \sim D} [˙ x \cdot x]}{E_{x \sim D} [| | x | |_{2}^{2}]}$

(it's just minimizing a quadratic)

ETA: the reason we have complicated equations is that we didn't compute $E_{x \sim D} [˙ x \cdot x]$ during training (this quantity is kinda weird). However, you can compute $γ$ from quantities that are usually tracked in SAE training. Specifically, $γ = \frac{1}{2} (1 + \frac{E [| | x ˆ | |_{2}^{2}] - E [| | x - x ˆ | |_{2}^{2}]}{E [| | x | |_{2}^{2}]})$ and all terms here are clearly helpful to track in SAE training.

1Sam Marks1y

Oh, one other issue relating to this: in the paper it's claimed that if γ is the argmin of E[∥^x−γ′x∥22] then 1/γ is the argmin of E[∥γ′^x−x∥]. However, this is not actually true: the argmin of the latter expression is E[x⋅^x]E[∥^x∥22]≠(E[x⋅^x]E[∥x∥2])−1. To get an intuition here, consider the case where ^x and x are very nearly perpendicular, with the angle between them just slightly less than 90∘. Then you should be able to convince yourself that the best factor to scale either x or ^x by in order to minimize the distance to the other will be just slightly greater than 0. Thus the optimal scaling factors cannot be reciprocals of each other. ETA: Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it. I'll need to think about whether this makes me less convinced that the usual measures of feature shrinkage are capturing a real thing. ETA2: In fact, now I'm a bit confused why your figure 6 shows no shrinkage. Based on what I wrote above in this comment, we should generally expect to see shrinkage (according to the definition given in equation (9)) whenever the autoencoder isn't perfect. I guess the answer must somehow be "equation (10) actually is a good measure of shrinkage, in fact a better measure of shrinkage than the 'corrected' version of equation (10)." That's pretty cool and surprising, because I don't really have a great intuition for what equation (10) is actually capturing.

Steering Llama-2 with contrastive activation additions

Arthur Conmy1y32

It's very impressive that this technique could be used alongside existing finetuning tools.

> According to our data, this technique stacks additively with both finetuning

To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?

(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you're unsure about the statistical significance due to the low percentages involved)

2Alex Turner1y

We had results on decreasing sycophancy, as you say, but both methods zero it out in generalization. We'd need to test on a harder sycophancy dataset for that.

[Interim research report] Taking features out of superposition with sparse autoencoders

Arthur Conmy1y20

I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.

Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2

1Lee Sharkey1y

Makes sense! Thanks!

[Interim research report] Taking features out of superposition with sparse autoencoders

Arthur Conmy1y20

I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!

1Lee Sharkey1y

Great! I'm curious, what was it about the sparsity penalty that you changed your mind about?

Steering GPT-2-XL by adding an activation vector

Arthur Conmy2y10

Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!

Steering GPT-2-XL by adding an activation vector

Arthur Conmy2y20

No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight

This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing

3Alex Turner2y

I still don't follow. Apparently, TL's center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn't affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.

Steering GPT-2-XL by adding an activation vector

Arthur Conmy2y*40

> Can we just add in $5$ times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.

Do you have evidence for this?

It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream ... (read more)

3Alex Turner2y

We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency. Oh, somehow I'm not familiar with this. Is this center_unembed? Or are you talking about something else? Yes, but I think the evidence didn't actually come from the "Love" - "Hate" prompt pair. Early in testing we found paired activation additions worked better. I don't have a citeable experiment off-the-cuff, though.

"Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities)

Arthur Conmy2y10

I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I've seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently

2David Manheim2y

There have also been plenty of other adapatations, ones which were not low-effort. I worked on 2, the Goodhart's law paper and a paper with Issa Rice on HRAD. Both were very significantly rewritten and expanded into "real" preprints, but I think it was clearly worthwhile.

EIS VII: A Challenge for Mechanists

Arthur Conmy2y10

Does the “ground truth” shows the correct label function on 100% of the training and test data? If so, what’s the relevance of the transformer which imperfectly implements the label function?

2Stephen Casper2y

Yes, it does show the ground truth. The goal of the challenge is not to find the labels, but to find the program that explains them using MI tools. In the post, when I say labeling "function", I really mean labeling "program" in this case.

Mysteries of mode collapse

Arthur Conmy2y22

I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.

If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.

Relaxed adversarial training for inner alignment

Arthur Conmy2y20

I don't understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.

2Evan Hubinger2y

They're unconditional, not conditional probabilities. The comma is just for the exists quantifier. Sure—edited.

A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Arthur Conmy2y01

Thanks for the comment!

I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.

In depth, when GPT-Neo is fed a sequence of tokens $t_{1} t_{2} . . . t_{10} t_{11} t_{12} . . . t_{20}$ where $t_{1}, . . ., t_{10}$ are uniformly random and $t_{i} = t_{i - 10}$ for $i \geq 11$ , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from $t_{i}$ ... (read more)

"Cars and Elephants": a handwavy argument/analogy against mechanistic interpretability

Arthur Conmy2y00

Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone".

(however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)

1David Scott Krueger2y

It could work if you can use interpretability to effectively prohibit this from happening before it is too late. Otherwise, it doesn't seem like it would work.

Alignment Newsletter #24

Arthur Conmy3y00

What happened to the unrestricted adversarial examples challenge? The github [1] doesn't have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?

[1] https://github.com/openphilanthropy/unrestricted-adversarial-examples

High-stakes alignment via adversarial training [Redwood Research report]

Arthur Conmy3y*00

This comes from the fact that you assumed "adversarial example" had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of "adversarial example" has the misclassified panda as an example.