My current best guess for why base models refuse so much is that "Sorry, I can't help with that. I don't know how to" is actually extremely common on the internet, based on discussion with Achyuta Rajaram on twitter: https://x.com/ArthurConmy/status/1840514842098106527
This fits with our observations about how frequently LLaMA-1 performs incompetent refusal
> Qwen2 was explicitly trained on synthetic data from Qwen1.5
~~Where is the evidence for this claim? (Claude 3.5 Sonnet could also not find evidence on one rollout)~~
EDITED TO ADD: "these [Qwen] models are utilized to synthesize high-quality pre-training data" is clear evidence, I am being silly.
All other techinques mentioned here (e.g. filtering and adding more IT data at end of training) still sound like models "trained to predict the next word on the internet" (I don't think the training samples being IID early and late in training is an important detail)
Is DM exploring this sort of stuff?
Yes. On the AGI safety and alignment team we are working on activation steering - e.g. Alex Turner who invented the technique with collaborators is working on this, and the first author of a few tokens deep is currently interning on the Gemini Safety team mentioned in this post. We don't have sharp and fast lines between what counts as Gemini Safety and what counts as AGI safety and alignment, but several projects on AGI safety and alignment, and most projects on Gemini Safety would see "safety practices we can test right now" as a research goal.
they [transcoders] take as input the pre-MLP activations, and then aim to represent the post-MLP activations of that MLP sublayer
I assumed this meant activations just before GELU and just after GELU, but looking at code I think I was wrong. Could you rephrase to e.g.
they take as input MLP block inputs (just after LayerNorm) and they output MLP block outputs (what is added to the residual stream)
Ah yeah, Neel's comment makes no claims about feature death beyond Pythia 2.8B residual streams. I trained 524K width Pythia-2.8B MLP SAEs with <5% feature death (not in paper), and Anthropic's work gets to >1M live features (with no claims about interpretability) which together would make me surprised if 131K was near the max of possible numbers of live features even in small models.
I don't think zero ablation is that great a baseline. We're mostly using it for continuity's sake with Anthropic's prior work (and also it's a bit easier to explain than a mean ablation baseline which requires specifying where the mean is calculated from). In the updated paper https://arxiv.org/pdf/2404.16014v2 (up in a few hours) we show all the CE loss numbers for anyone to scale how they wish.
I don't think compute efficiency hit[1] is ideal. It's really expensive to compute, since you can't just calculate it from an SAE alone as you need to know fa...
I'm not sure what you mean by "the reinitialization approach" but feature death doesn't seem to be a major issue at the moment. At all sites besides L27, our Gemma-7B SAEs didn't have much feature death at all (stats at https://arxiv.org/pdf/2404.16014v2 up in a few hours), and also the Anthropic update suggests even in small models the problem can be addressed.
The "This should be cited" part of Dan H's comment was edited in after the author's reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.
On the other hand the post's authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).
(The authors made me aware of this fact)
I think this discussion is sad, since it seems both sides assume bad faith from the other side. On one hand, I think Dan H and Andy Zou have improved the post by suggesting writing about related work, and signal-boosting the bypassing refusal result, so should be acknowledged in the post (IMO) rather than downvoted for some reason. I think that credit assignment was originally done poorly here (see e.g. "Citing others" from this Chris Olah blog post), but the authors resolved this when pushed.
But on the other hand, "Section 6.2 of the RepE paper shows exac...
The "This should be cited" part of Dan H's comment was edited in after the author's reply. I think this is in bad faith since it masks an accusation of duplicate work as a request for work to be cited.
On the other hand the post's authors did not act in bad faith since they were responding to an accusation of duplicate work (they were not responding to a request to improve the work).
(The authors made me aware of this fact)
We use learning rate 0.0003 for all Gated SAE experiments, and also the GELU-1L baseline experiment. We swept for optimal baseline learning rates on GELU-1L for the baseline SAE to generate this value.
For the Pythia-2.8B and Gemma-7B baseline SAE experiments, we divided the L2 loss by , motivated by wanting better hyperparameter transfer, and so changed learning rate to 0.001 or 0.00075 for all the runs (currently in Figure 1, only attention output pre-linear uses 0.00075. In the rerelease we'll state all the values used). We didn't see n...
Oh oops, thanks so much. We'll update the paper accordingly. Nit: it's actually
(it's just minimizing a quadratic)
ETA: the reason we have complicated equations is that we didn't compute during training (this quantity is kinda weird). However, you can compute from quantities that are usually tracked in SAE training. Specifically, and all terms here are clearly helpful to track in SAE training.
It's very impressive that this technique could be used alongside existing finetuning tools.
> According to our data, this technique stacks additively with both finetuning
To check my understanding, the evidence for this claim in the paper is Figure 13, where your method stacks with finetuning to increase sycophancy. But there are not currently results on decreasing sycophancy (or any other bad capability), where you show your method stacks with finetuning, right?
(AFAICT currently Figure 13 shows some evidence that activation addition to reduce sycophancy outcompetes finetuning, but you're unsure about the statistical significance due to the low percentages involved)
I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction.
Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2
I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!
Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!
No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight
This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing
> Can we just add in times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.
Do you have evidence for this? It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream ...
I think this point was really overstated. I get the impression the rejected papers were basically turned into the arXiv format as fast as possible and so it was easy for the mods to tell this. However, I've seen submissions to cs.LG like this and this that are clearly from the alignment community. These posts are also not stellar by standards of preprint formatting, and were not rejected, apparently
Does the “ground truth” shows the correct label function on 100% of the training and test data? If so, what’s the relevance of the transformer which imperfectly implements the label function?
I think work that compares base language models to their fine-tuned or RLHF-trained successors seems likely to be very valuable, because i) this post highlights some concrete things that change during training in these models and ii) some believe that a lot of the risk from language models come from these further training steps.
If anyone is interested, I think surveying the various fine-tuned and base models here seems the best open-source resource, at least before CarperAI release some RLHF models.
I don't understand the new unacceptability penalty footnote. In both of the $P_M$ terms, there is no conditional $|$ sign. I presume the comma is wrong?
Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.
Thanks for the comment!
I have spent some time trying to do mechanistic interp on GPT-Neo, to try and answer whether compensation only occurs because of dropout. TLDR: my current impression is that compensation still occurs in models trained without dropout, just to a lesser extent.
In depth, when GPT-Neo is fed a sequence of tokens where are uniformly random and for , there are four heads in Layer 6 that have the induction attention pattern (i.e attend from ...
Disclaimer: I work on interpretability at Redwood Research. I am also very interested in hearing a fleshed-out version of this critique. To me, this is related to the critique of Redwood's interpretability approach here, another example of "recruiting resources outside of the model alone".
(however, it doesn't seem obvious to me that interpretability can't or won't work in such settings)
What happened to the unrestricted adversarial examples challenge? The github [1] doesn't have an update since 2020, and that is only to the warmup challenge. Additionally, were there any takeaways from the contest?
[1] https://github.com/openphilanthropy/unrestricted-adversarial-examples
This comes from the fact that you assumed "adversarial example" had a more specific definition than it really does (from reading ML literature), right? Note that the alignment forum definition of "adversarial example" has the misclassified panda as an example.
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
Does this mean something like:
1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon
or something else entirely?