All of Sam Marks's Comments + Replies

Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper. 

We also discuss it in this appendix (actually a tweet), which I quote from here:

Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this,

... (read more)

Thanks for writing this reflection, I found it useful.

Just to quickly comment on my own epistemic state here:

  1. I haven't read GD.
  2. But I've been stewing on some of (what I think are) the same ideas for the last few months, when William Brandon first made (what I think are) similar arguments to me in October.
    1. (You can judge from this Twitter discussion whether I seem to get the core ideas)
  3. When I first heard these arguments, they struck me as quite important and outside of the wheelhouse of previous thinking on risks from AI development. I think they raise concer
... (read more)

Thanks, this is helpful. So it sounds like you expect

  1. AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we'll soon have finished eating through the hardware overhang
  2. separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).

It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradict... (read more)

2Ryan Greenblatt
I don't think (2) is a crux (as discussed in person). I expect that if takeover-capable AI takes a while (e.g. it happens in 2040), then we will have a long winter where economic value from AI doesn't increase that fast followed a period of faster progress around 2040. If progress is relatively stable accross this entire period, then we'll have enough time to scale up fabs. Even if progress isn't stable, we could see enough total value from AI in the slower growth period to scale up to scale up fabs by 10x, but this would require >>$1 trillion of economic value per year I think (which IMO seems not that likely to come far before takeover-capable AI due to views about economic returns to AI and returns to scaling up compute).

I really like the framing here, of asking whether we'll see massive compute automation before [AI capability level we're interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:

  • "How much is AI capabilities driven by algorithmic progress?" (problem: obscures dependence of algorithmic progress on compute for experimentation)
  • "How much AI progress can we get 'purely from elicitation'?" (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute
... (read more)

I put roughly 50% probability on feasibility of software-only singularity.[1]

(I'm probably going to be reinventing a bunch of the compute-centric takeoff model in slightly different ways below, but I think it's faster to partially reinvent than to dig up the material, and I probably do use a slightly different approach.)

My argument here will be a bit sloppy and might contain some errors. Sorry about this. I might be more careful in the future.

The key question for software-only singularity is: "When the rate of labor production is doubled (as in, as if your... (read more)

6Ryan Greenblatt
I think your outline of an argument against contains an important error. Importantly, while the spending on hardware for individual AI companies has increased by roughly 3-4x each year[1], this has not been driven by scaling up hardware production by 3-4x per year. Instead, total compute production (in terms of spending, building more fabs, etc.) has been increased by a much smaller amount each year, but a higher and higher fraction of that compute production was used for AI. In particular, my understanding is that roughly ~20% of TSMC's volume is now AI while it used to be much lower. So, the fact that scaling up hardware production is much slower than scaling up algorithms hasn't bitten yet and this isn't factored into the historical trends. Another way to put this is that the exact current regime can't go on. If trends continue, then >100% of TSMC's volume will be used for AI by 2027! Only if building takeover-capable AIs happens by scaling up TSMC to >1000% of what their potential FLOP output volume would have otherwise been, then does this count as "massive compute automation" in my operationalization. (And without such a large build-out, the economic impacts and dependency of the hardware supply chain (at the critical points) could be relatively small.) So, massive compute automation requires something substantially off trend from TSMC's perspective. [Low importance] It is only possible to build takeover-capable AI without previously breaking an important trend prior to around 2030 (based on my rough understanding). Either the hardware spending trend must break or TSMC production must go substantially above the trend by then. If takeover-capable AI is built prior to 2030, it could occur without substantial trend breaks but this gets somewhat crazy towards the end of the timeline: hardware spending keeps increasing at ~3x for each actor (but there is some consolidation and acquisition of previously produced hardware yielding a one-time increase up to about
2Ryan Greenblatt
I think this happening in practice is about 60% likely, so I don't think feasibility vs. in practice is a huge delta.

The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)

3Buck Shlegeris
This uses transformers, which is IIUC way less efficient for inference than e.g. vllm, to an extent that is probably unacceptable for production usecases.

@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").

5Buck Shlegeris
I’m curious how they set up the SAE stuff; I’d have thought that this would require modifying some performance-critical inference code in a tricky way.

I'm guessing you'd need to rejection sample entire blocks, not just lines. But yeah, good point, I'm also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen 

4Sam Marks
@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").

x-posting a kinda rambling thread I wrote about this blog post from Tilde research.

---

If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!

I'll discussed some things that surprised me about this case study in 🧵

---

The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt ... (read more)

6Buck Shlegeris
Isn't it easy to detect regexes in model outputs and rejection sample lines that contain regexes? This requires some custom sampling code if you want optimal latency/throughput, but the SAEs also require that.

While I agree the example in Sycophancy to Subterfuge isn't realistic, I don't follow how the architecture you describe here precludes it. I think a pretty realistic set-up for training an agent via RL would involve computing scalar rewards on the execution machine or some other machine that could be compromised from the execution machine (with the scalar rewards being sent back to the inference machine for backprop and parameter updates).

Why would it 2x the cost of inference? To be clear, my suggested baseline is "attach exactly the same LoRA adapters that were used for RR, plus one additional linear classification head, then train on an objective which is similar to RR but where the rerouting loss is replaced by a classification loss for the classification head." Explicitly this is to test the hypothesis that RR only worked better than HP because it was optimizing more parameters (but isn't otherwise meaningfully different from probing).

(Note that LoRA adapters can be merged into model we... (read more)

3Fabien Roger
I was imagining doing two forward passes: one with and one without the LoRAs, but you had in mind adding "keep behavior the same" loss in addition to the classification loss, right? I guess that would work, good point.

Thanks to the authors for the additional experiments and code, and to you for your replication and write-up!

IIUC, for RR makes use of LoRA adapters whereas HP is only a LR probe, meaning that RR is optimizing over a more expressive space. Does it seem likely to you that RR would beat an HP implementation that jointly optimizes LoRA adapters + a linear classification head (out of some layer) so that the model retains performance while also having the linear probe function as a good harmfulness classifier?

(It's been a bit since I read the paper, so sorry if I'm missing something here.)

6Fabien Roger
I quickly tried a LoRA-based classifier, and got worse results than with linear probing. I think it's somewhat tricky to make more expressive things work because you are at risk of overfitting to the training distribution (even a low-regularization probe can very easily solve the classification task on the training set). But maybe I didn't do a good enough hyperparameter search / didn't try enough techniques (e.g. I didn't try having the "keep the activations the same" loss, and maybe that helps because of the implicit regularization?).
3Fabien Roger
Yeah, I expect that this kind of things might work, though this would 2x the cost of inference. An alternative is "attention head probes", MLP probes, and things like that (which don't increase inference cost), + maybe different training losses for the probe (here we train per-sequence position and aggregate with max), and I expect something in this reference class to work as well as RR, though it might require RR-levels of tuning to actually work as well as RR (which is why I don't consider this kind of probing as a baseline you ought to try).

there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future:

  1. As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion.

You're correct that I was neglecting this threat model—good point, and thanks.

So, given this overall uncertainty, it seems like we should have a much fuzzier update where higher numbers should actually update us.

Hmm, here's another way to fram... (read more)

4Ryan Greenblatt
Yep, reasonable summary. I don't think this was obvious to everyone and I appreciate this point - I edited my earlier comment to more explicitly note my agreement.

In this comment, I'll use reward tampering frequency (RTF) to refer to the proportion of the time the model reward tampers.

I think that in basically all of the discussion above, folks aren't using a correct mapping of RTF to practical importance. Reward hacking behaviors are positively reinforced once they occur in training; thus, there's a rapid transition in how worrying a given RTF is, based on when reward tampering becomes frequent enough that it's likely to appear during a production RL run.

To put this another way: imagine that this paper had trained ... (read more)

5Ryan Greenblatt
I agree that a few factors of 2 don't matter much at all, but I think highlighting a specific low threshold relative to the paper seems misguided as opposed to generally updating based on the level of egregiousness and rarity. (Where you should probably think about the rarity in log space.) (I think I made the point that a few factors of 2 shouldn't matter much for the bottom line above.) (Edit: I agree that it's worth noting that low RTFs can be quite concerning for the reasons you describe.) I'll argue against a specific threshold in the rest of this comment. ---------------------------------------- First, it is worth noting that there are multiple possible ways to interpret the final environment in the paper in terms of the analogy to the future: 1. As the catastrophic failure that results from reward hacking. In this case, we might care about frequency depending on the number of opportunities the model would have and the importance of collusion. 2. As an egregious reward hack that would be reinforced and would be harmful via occuring repeatedly (a la what failure looks like part 1). 3. As a reward hack that will generalize to some other failure eventually cascading into (1) or (2). (Implying this enviroment is not importantly different from the other environments in the paper, it is just part of a progression.) Your argument applies to (2) and (3), but not to (1). ---------------------------------------- I have a pretty broad uncertainty over how the real world empirical conditions will compare to the empirical conditions discussed in the paper. And, the empirical conditions in the paper clearly differ substantially from what we'd typically expect (much high fraction hacking, much dumber models). Examples of conditions that might differ: * Incentives against reward hacking or other blockers. * How much of a gradient toward reward hacking their is. * How good models are at doing reward hacking reasoning relative to the difficulty of doing so in v

I think this is cool! The way I'm currently thinking about this is "doing the adversary generation step of latent adversarial training without the adversarial training step." Does that seem right?

It seems intuitively plausible to me that once you have a latent adversarial perturbation (the vectors you identify), you might be able to do something interesting with it beyond "train against it" (as LAT does). E.g. maybe you would like to know that your model has a backdoor, beyond wanting to move to the next step of "train away the backdoor." If I were doing t... (read more)

1Andrew Mack
Thanks for your comment! Yes, I’d say that roughly sums things up. As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens. Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.” I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

Oh, one other issue relating to this: in the paper it's claimed that if  is the argmin of  then  is the argmin of . However, this is not actually true: the argmin of the latter expression is . To get an intuition here, consider the case where  and  are very nearly perpendicular, with the angle between them just slightly less than . Then you should be able to convince yourself that the best factor to scale either  ... (read more)

1Senthooran Rajamanoharan
Hey Sam, thanks - you're right. The definition of reconstruction bias is actually the argmin of E[|^x/γ′−x|2] which I'd (incorrectly) rearranged as the expression in the paper. As a result, the optimum is γ−1=E[^x⋅x]/E[|^x|2] That being said, the derivation we gave was not quite right, as I'd incorrectly substituted the optimised loss rather than the original reconstruction loss, which makes equation (10) incorrect. However the difference between the two is small exactly when gamma is close to one (and indeed vanishes when there is no shrinkage), which is probably why we didn't pick this up. Anyway, we plan to correct these two equations and update the graphs, and will submit a revised version.
4Rohin Shah
This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio E[||^x||2]E[||x||2]. Looking at the γ that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regularization, and not capturing the "inherent" need to shrink the vector given these nonzero angles. (In particular, if we computed E[||^x||2]E[||x||2] for Gated SAEs, I expect that would be below 1.) I think the main thing we got wrong is that we accidentally treated E[||^x−x||2] as though it were E[||^x−γx||2]. To the extent that was the main mistake, I think it explains why our results still look how we expected them to -- usually γ is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small. We're going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it's probably worth waiting for that -- I expect we'll provide much more detailed derivations that make everything a lot clearer.

Ah thanks, you're totally right -- that mostly resolves my confusion. I'm still a little bit dissatisfied, though, because the  term is optimizing for something that we don't especially want (i.e. for  to do a good job of reconstructing ). But I do see how you do need to have some sort of a reconstruction-esque term that actually allows gradients to pass through to the gated network.

3Senthooran Rajamanoharan
Yep, the intuition here indeed was that L1 penalised reconstruction seems to be okay for teaching a standard SAE's encoder to detect which features are on (even if features get shrunk as a result), so that is effectively what this auxiliary loss is teaching the gate sub-layer to do, alongside the sparsity penalty. (The key difference being we freeze the decoder in the auxiliary task, which the ablation study shows helps performance.) Maybe to put it another way, this was an auxiliary task that we had good evidence would teach the gate sublayer to detect active features reasonably well, and it turned out to give good results in practice. It's totally possible though that there are better auxiliary tasks (or even completely different loss functions) out there that we've not explored.

(The question in this comment is more narrow and probably not interesting to most people.)

The limitations section includes this paragraph:

One worry about increasing the expressivity of sparse autoencoders is that they will overfit when
reconstructing activations (Olah et al., 2023, Dictionary Learning Worries), since the underlying
model only uses simple MLPs and attention heads, and in particular lacks discontinuities such as step
functions. Overall we do not see evidence for this. Our evaluations use held-out test data and we
check for interpretability manua

... (read more)
5Neel Nanda
I haven't fully worked through the maths, but I think both IG and attribution patching break down here? The fundamental problem is that the discontinuity is invisible to IG because it only takes derivatives. Eg the ReLU and Jump ReLU below look identical from the perspective of IG, but not from the perspective of activation patching, I think.

I'm a bit perplexed by the choice of loss function for training GSAEs (given by equation (8) in the paper). The intuitive (to me) thing to do here would be would be to have the  and  terms, but not the  term, since the point of  is to tell you which features should be active, not to itself provide good feature coefficients for reconstructing . I can sort of see how not including this term might result in the coordinates of  all being extremely small (but barely posit... (read more)

3Rohin Shah
Possibly I'm missing something, but if you don't have Laux, then the only gradients to Wgate and bgate come from Lsparsity (the binarizing Heaviside activation function kills gradients from Lreconstruct), and so πgate would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".) (You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with Lincorrect from the beginning of Section 3.2.2.)

I believe that equation (10) giving the analytical solution to the optimization problem defining the relative reconstruction bias is incorrect. I believe the correct expression should be .

You could compute this by differentiating equation (9), setting it equal to 0 and solving for . But here's a more geometrical argument.

By definition,  is the multiple of  closest to . Equivalently, this closest such vector can be described as the projection . Setting these equal, we get the... (read more)

2Arthur Conmy
Oh oops, thanks so much. We'll update the paper accordingly. Nit: it's actually  Ex∼D[˙x⋅x]Ex∼D[||x||22] (it's just minimizing a quadratic) ETA: the reason we have complicated equations is that we didn't compute Ex∼D[˙x⋅x] during training (this quantity is kinda weird). However, you can compute γ from quantities that are usually tracked in SAE training. Specifically, γ=12(1+E[||xˆ||22]−E[||x−xˆ||22]E[||x||22]) and all terms here are clearly helpful to track in SAE training.

Great work! Obviously the results here speak for themselves, but I especially wanted to complement the authors on the writing. I thought this paper was a pleasure to read, and easily a top 5% exemplar of clear technical writing. Thanks for putting in the effort on that.

I'll post a few questions as children to this comment.

1Sam Marks
(The question in this comment is more narrow and probably not interesting to most people.) The limitations section includes this paragraph: I'm not sure I understand the point about integrated gradients here. I understand this sentence as meaning: since model outputs are a discontinuous function of feature activations, integrated gradients will do a bad job of estimating the effect of patching feature activations to counterfactual values. If that interpretation is correct, then I guess I'm confused because I think IG actually handles this sort of thing pretty gracefully. As long as the number of intermediate points you're using is large enough that you're sampling points pretty close to the discontinuity on both sides, then your error won't be too large. This is in contrast to attribution patching which will have a pretty rough time here (but not really that much worse than with the normal ReLU encoders, I guess). (And maybe you also meant for this point to apply to attribution patching?)
3Sam Marks
I'm a bit perplexed by the choice of loss function for training GSAEs (given by equation (8) in the paper). The intuitive (to me) thing to do here would be would be to have the Lreconstruct and Lsparsity terms, but not the Laux term, since the point of πgate is to tell you which features should be active, not to itself provide good feature coefficients for reconstructing x. I can sort of see how not including this term might result in the coordinates of πgate all being extremely small (but barely positive when it's appropriate to use a feature), such that the sparsity term doesn't contribute much to the loss. Is that what goes wrong? Are there ablation experiments you can report for this? If so, including this Laux term still currently seems to me like a pretty unprincipled way to deal with this -- can the authors provide any flavor here? Here are two ways that I've come up with for thinking about this loss function -- let me know if either of these are on the right track. Let fgate,ReLU denote the gated encoder, but with a ReLU activation instead of Heaviside. Note then that fgate,ReLU is just the standard SAE encoder from Towards Monosemanticity. Perspective 1: The usual loss from Towards Monosemanticity for training SAEs is ∥x−^x(fgate,ReLU(x))∥22+λ∥fgate,ReLU(x)∥1 (this is the same as your Lsparsity and Laux up to the detaching thing). But now you have this magnitude network which needs to get a gradient signal. Let's do that by adding an additional term ∥x−^x(~f(x))∥22 -- your Lreconstruction. So under this perspective, it's the reconstruction term which is new, with the sparsity and auxiliary terms being carried over from the usual way of doing things. Perspective 2 (h/t Jannik Brinkmann): let's just add together the usual Towards Monosemanticity loss function for both the usual architecture and the new modified archiecture: L=Lreconstruction(~f)+Lreconstruction(~f)+Lsparsity(fgate,ReLU)+Lsparsity(fgate,ReLU). However, the gradients with respect to the sec
6Sam Marks
I believe that equation (10) giving the analytical solution to the optimization problem defining the relative reconstruction bias is incorrect. I believe the correct expression should be γ=Ex∼D[^x⋅x∥x∥22]. You could compute this by differentiating equation (9), setting it equal to 0 and solving for γ. But here's a more geometrical argument. By definition, γx is the multiple of x closest to ^x. Equivalently, this closest such vector can be described as the projection projx(^x)=^x⋅x∥x∥22x. Setting these equal, we get the claimed expression for γ. As a sanity check, when our vectors are 1-dimensional, x=1, and ^x=12, we my expression gives γ=12 (which is correct), but equation (10) in the paper gives 1√3.
2Neel Nanda
<3 Thanks so much, that's extremely kind. Credit entirely goes to Sen and Arthur, which is even more impressive given that they somehow took this from a blog post to a paper in a two week sprint! (including re-running all the experiments!!)

I'm pretty sure that you're not correct that the interpretation step from our SHIFT experiments essentially relies on using data from the Pile. I strongly expect that if we were to only use inputs from  then we would be able to interpret the SAE features about as well. E.g. some of the SAE features only activate on female pronouns, and we would be able to notice this. Technically, we wouldn't be able to rule out the hypothesis "this feature activates on female pronouns only when their antecedent is a nurse," but that would be a bit of a crazy h... (read more)

(Edits made. In the edited version, I think the only questionable things are the title and the line "[In this post, I will a]rticulate a class of approaches to scalable oversight I call cognition-based oversight." Maybe I should be even more careful and instead say that cognition-based oversight is merely something that "could be useful for scalable oversight," but I overall feel okay about this.

Everywhere else, I think the term "scalable oversight" is now used in the standard way.)

I (mostly; see below) agree that in this post I used the term "scalable oversight" in a way which is non-standard and, moreover, in conflict way the way I typically use the term personally. I also agree with the implicit meta-point that it's important to be careful about using terminology in a consistent way (though I probably don't think it's as important as you do). So overall, after reading this comment, I wish I had been more careful about how I treated the term "scalable oversight." After I post this comment, I'll make some edits for clarity, but I do... (read more)

1Sam Marks
(Edits made. In the edited version, I think the only questionable things are the title and the line "[In this post, I will a]rticulate a class of approaches to scalable oversight I call cognition-based oversight." Maybe I should be even more careful and instead say that cognition-based oversight is merely something that "could be useful for scalable oversight," but I overall feel okay about this. Everywhere else, I think the term "scalable oversight" is now used in the standard way.)

With the ITO experiments, my first guess would be that reoptimizing the sparse approximation problem is mostly relearning the encoder, but with some extra uninterpretable hacks for low activation levels that happen to improve reconstruction. In other words, I'm guessing that the boost in reconstruction accuracy (and therefore loss recovered) is mostly not due to better recognizing the presence of interpretable features, but by doing fiddly uninterpretable things at low activation levels.

I'm not really sure how to operationalize this into a prediction. Mayb... (read more)

Awesome stuff -- I think that updates like this (both from the GDM team and from Anthropic) are very useful for organizing work in this space. And I especially appreciate the way this was written, with both short summaries and in-depth write-ups.

Yep, as you say, @Logan Riggs figured out what's going on here: you evaluated your reconstruction loss on contexts of length 128, whereas I evaluated on contexts of arbitrary length. When I restrict to context length 128, I'm able to replicate your results.

Here's Logan's plot for one of your dictionaries (not sure which)

and here's my replication of Logan's plot for your layer 1 dictionary

Interestingly, this does not happen for my dictionaries! Here's the same plot but for my layer 1 residual stream output dictionary for pythia-70m-deduped

(Note that all thr... (read more)

My SAEs also have a tied decoder bias which is subtracted from the original activations. Here's the relevant code in dictionary.py

def encode(self, x):
        return nn.ReLU()(self.encoder(x - self.bias))
    
    def decode(self, f):
        return self.decoder(f) + self.bias
    
    def forward(self, x, output_features=False, ghost_mask=None):
            [...]
            f = self.encode(x)
            x_hat = self.decode(f)
            [...]
            return x_hat

Note that I checked that our SAEs have the same input-output behavior in my linked c... (read more)

2Joseph Isaac Bloom
Ahhh I see. Sorry I was way too hasty to jump at this as the explanation. Your code does use the tied decoder bias (and yeah, it was a little harder to read because of how your module is structured). It is strange how assuming that bug seemed to help on some of the SAEs but I ran my evals over all your residual stream SAE's and it only worked for some / not others and certainly didn't seem like a good explanation after I'd run it on more than one.  I've been talking to Logan Riggs who says he was able to load in my SAEs and saw fairly similar reconstruction performance to to me but that outside of the context length of 128 tokens, performance markedly decreases. He also mentioned your eval code uses very long prompts whereas mine limits to 128 tokens so this may be the main cause of the difference.  Logan mentioned you had discussed this with him so I'm guessing you've got more details on this than I have? I'll build some evals specifically to look at this in the future I think.  Scientifically, I am fairly surprised about the token length effect and want to try training on activations from much longer context sizes now. I have noticed (anecdotally) that the number of features I get sometimes increases over the prompt so an SAE trained on activations from shorter prompts are plausibly going to have a much easier time balancing reconstruction and sparsity, which might explain the generally lower MSE / higher reconstruction. Though we shouldn't really compare between models and with different levels of sparsity as we're likely to be at different locations on the pareto frontier.  One final note is that I'm excited to see whether performance on the first 128 tokens actually improves in SAEs trained on activations from > 128 token forward passes (since maybe the SAE becomes better in general).   

Another sanity check: when you compute CE loss using the same code that you use when computing CE loss when activations are reconstructed by the autoencoders, but instead of actually using the autoencoder you just plug the correct activations back in, do you get the same answer (~3.3) as when you evaluate CE loss normally?

1Joseph Isaac Bloom
* MSE Losses were in the WandB report (screenshot below). * I've loaded in your weights for one SAE and I get very bad performance (high L0, high L1, and bad MSE Loss) at first.  * It turns out that this is because my forward pass uses a tied decoder bias which is subtracted from the initial activations and added as part of the decoder forward pass. AFAICT, you don't do this.  * To verify this, I added the decoder bias to the activations of your SAE prior to running a forward pass with my code (to effectively remove the decoder bias subtraction from my method) and got reasonable results.  * I've screenshotted the Towards Monosemanticity results which describes the tied decoder bias below as well.  I'd be pretty interested in knowing if my SAEs seem good now based on your evals :) Hopefully this was the only issue.   

In the notebook I link in my original comment, I check that the activations I get out of nnsight are the same as the activations that come from transformer_lens. Together with the fact that our sparsity statistics broadly align, I'm guessing that the issue isn't that I'm extracting different activations than you are.

Repeating my replication attempt with data from OpenWebText, I get this:

Layer MSE Loss % Variance Explained L1 L0 % Alive CE Reconstructed
1 0.069 95 40 15 46 6.45
7 0.81 86 125 59.2 96 4.38

Broadly speaking, same story as above, e... (read more)

1Sam Marks
Another sanity check: when you compute CE loss using the same code that you use when computing CE loss when activations are reconstructed by the autoencoders, but instead of actually using the autoencoder you just plug the correct activations back in, do you get the same answer (~3.3) as when you evaluate CE loss normally?

I tried replicating your statistics using my own evaluation code (in evaluation.py here). I pseudo-randomly chose layer 1 and layer 7. Sadly, my results look rather different from yours:

Layer MSE Loss % Variance Explained L1 L0 % Alive CE Reconstructed
1 0.11 92 44 17.5 54 5.95
7 1.1 82 137 65.4 95 4.29

Places where our metrics agree: L1 and L0.

Places where our metrics disagree, but probably for a relatively benign reason:

  • Percent variance explained: my numbers are slightly lower than yours, and from a brief skim of your code I think it's bec
... (read more)
1Joseph Isaac Bloom
Oh no. I'll look into this and get back to you shortly. One obvious candidate is that I was reporting CE for some batch at the end of training that was very small and so the statistics likely had high variance and the last datapoint may have been fairly low. In retrospect I should have explicitly recalculated this again post training. However, I'll take a deeper dive now to see what's up. 

Some updates about the dictionary_learning repo:

  • The repo now has support for ghost grads. h/t g-w1 for submitting a PR for this
  • ActivationBuffers now work natively with model components -- like the residual stream -- whose activations are typically returned as tuples; the buffer knows to take the first component of the tuple (and will iteratively do this if working with nested tuples).
  • ActivationBuffers can now be stored on the GPU.
  • The file evaluation.py contains code for evaluating trained dictionaries. I've found this pretty useful for quickly evaluating d
... (read more)

Imo "true according to Alice" is nowhere near as "crazy" a feature as "has_true XOR has_banana". It seems useful for the LLM to model what is true according to Alice! (Possibly I'm misunderstanding what you mean by "crazy" here.)

I agree with this! (And it's what I was trying to say; sorry if I was unclear.) My point is that 
{ features which are as crazy as "true according to Alice" (i.e., not too crazy)} 
seems potentially manageable, where as 
{ features which are as crazy as arbitrary boolean functions of other features } 
seems totally unmanageable.

Thanks, as always, for the thoughtful replies.

Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not".

I agree that "the model has learned the algorithm 'always compute XORs with has_not'" is a pretty sensible hypothesis. (And might be useful to know, if true!) FWIW, the stronger example of "clearly not useful XORs" I was thinking of has_true XOR has_banana, where I'm guessi... (read more)

3Rohin Shah
Yeah certainly I'd expect the crazy XOR directions aren't too salient. Imo "true according to Alice" is nowhere near as "crazy" a feature as "has_true XOR has_banana". It seems useful for the LLM to model what is true according to Alice! (Possibly I'm misunderstanding what you mean by "crazy" here.) I'm not against linear probing techniques in general. I like linear probes, they seem like a very useful tool. I also like contrast pairs. But I would basically always use these techniques in a supervised way, because I don't see a great reason to expect unsupervised methods to work better. This is reasonable. My disagreement is mostly that I think LLMs are complicated things and do lots of incidental stuff we don't yet understand. So I shouldn't feel too surprised by any given observation that could be explained by an incidental hypothesis. But idk it doesn't seem like an important point.

I agree with a lot of this, but some notes:

Exponentially many features

[...]

On utility explanations, you would expect that multi-way XORs are much less useful for getting low loss than two-way XORs, and so computation for multi-way XORs is never developed.

The thing that's confusing here is that the two-way XORs that my experiments are looking at just seem clearly not useful for anything. So I think any utility explanation that's going to be correct needs to be a somewhat subtle one of the form "the model doesn't initially know which XORs will be useful, so ... (read more)

3Rohin Shah
Idk, I think it's pretty hard to know what things are and aren't useful for predicting the next token. For example, some of your features involve XORing with a "has_not" feature -- XORing with an indicator for "not" might be exactly what you want to do to capture the effect of the "not". (Tbc here the hypothesis could be "the model computes XORs with has_not all the time, and then uses only some of them", so it does have some aspect of "compute lots of XORs", but it is still a hypothesis that clearly by default doesn't produce multiway XORs.) ... That's exactly how I would rephrase the issue and I'm not clear on why you're making a sharp distinction here. I mean, I'd say the ones that are more like basic features are like that because it was useful, and it's all the other XORs that are explained by incidental hypotheses. The incidental hypotheses shouldn't be taken to be saying that all XORs are incidental, just the ones which aren't explained by utility. Perhaps a different way of putting it is that I expect both utility and incidental hypotheses to be true to some extent. Yes, on my model it could be something like the weights for basic features being large. It's not necessarily that simple, e.g. it could also be that the derived features are in superposition with a larger number of other features that leads to more interference. If you're calling that "tracking", fair enough I guess; my main claim is that it shouldn't be surprising. I agree it's a potential thread for distinguishing such features.

Thanks, you're totally right about the equal variance thing -- I had stupidly thought that the projection of  onto y = x would be uniform on  (obviously false!).

The case of a fully discrete distribution (supported in this case on four points) seems like a very special case of a something more general, where a "more typical" special case would be something like:

  • if a, b are both false, then sample from 
  • if a is true and b is false, then sample from 
  • if a is false and b is true then sample from 
... (read more)
1Vivek Hebbar
Oops, I misunderstood what you meant by unimodality earlier. Your comment seems broadly correct now (except for the variance thing). I would still guess that unimodality isn't precisely the right well-behavedness desideratum, but I retract the "directionally wrong".

Thanks, you're correct that my definition breaks in this case. I will say that this situation is a bit pathological for two reasons:

  1. The mode of a uniform distribution doesn't coincide with its mean.
  2. The variance of the multivariate uniform distribution  is largest along the direction , which is exactly the direction which we would want to represent a AND b.

I'm not sure exactly which assumptions should be imposed to avoid pathologies like this, but maybe something of the form: we are working with boolean features  ... (read more)

2Vivek Hebbar
The variance is actually the same in all directions.  One can sanity-check by integration that the variance is 1/12 both along the axis and along the diagonal. In fact, there's nothing special about the uniform distribution here: The variance should be independent of direction for any N-dimensional joint distribution where the N constituent distributions are independent and have equal variance.[1] The diagram in the post showing that "and" is linearly represented works if the features are represented discretely (so that there are exactly 4 points for 2 binary features, instead of a distribution for each combination).  As soon as you start defining features with thresholds like DanielVarga did, the argument stops going through in general, and the claim can become false. The stuff about unimodality doesn't seem relevant to me, and in fact seems directionally wrong. 1. ^ I have a not-fully-verbalized proof which I don't have time to write out

Neat hypothesis! Do you have any ideas for how one would experimentally test this?

1Hoagy
Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there's a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It'd be interesting to see if there are some conditions under which this is/isn't true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs more visible.  Maybe you could also go a bit more mathematical and hand-construct a set of weights which calculates a set of features in superposition so you can totally rule out any model effort being expended on calculating XORs and then see if they're still probe-able. Another thing you could do is to zero-out or max-ent the neurons/attention heads that are important for calculating the A feature, and see if you can still detect an A⊕B feature. I'm less confident in this because it might be too strong and delete even a 'legitimate' A⊕B feature or too weak and leave some signal in. This kind of interference also predicts that the A|B and A|¬B features should be similar and so the degree of separation/distance from the category boundary should be small. I think you've already shown this to some extent with the PCA stuff though some quantification of the distance to boundary would be interesting. Even if the model was allocating resource to computing these XORs you'd still probably expect them to be much less salient though so not sure if this gives much evidence either way.

Some features which are computed from other features should probably themselves be treated as basic and thus represented with large salience.

Using a dataset of 10,000 inputs of the form
[random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
I've rerun the probing experiments. The possible labels are

  • has_true: is the second last token "true" or "false"?
  • has_banana: is the last token "banana" or "shed"?
  • label: is the third last token the most likely or the 100th most likely?

(this weird last option is because I'm adapting a dataset from the Geometry of Truth paper about... (read more)

Yes, you are correct, thanks. I'll edit the post when I get a chance.

If anyone would like to replicate these results, the code can be found in the rax branch of my geometry-of-truth repo. This was adapted from a codebase I used on a different project, so there's a lot of uneeded stuff in this repo. The important parts here are:

  • The datasets: cities_alice.csv and neg_cities_alice.csv (for the main experiments I describe), cities_distractor.csv and neg_cities_distractor.csv (for the experiments with banana/shed at the end of factual statements), and xor.csv (for the experiments with true/false and banana/shed after random text
... (read more)

Thanks for doing this! Can you share the dataset that you're working with? I'm traveling right now, but when I get a chance I might try to replicate your failed replication on LLaMA-2-13B and with my codebase (which can be found here; see especially xor_probing.ipynb).

1leogao
The training set is a random 100k subsample of this dataset: https://huggingface.co/datasets/amazon_polarity I'm prepending Alice/Bob and doing the xor of the label in exactly the same way you do.

Idk, I think I would guess that all of the most salient features will be things related to the meaning of the statement at a more basic level. E.g. things like: the statement is finished (i.e. isn't an ongoing sentence), the statement is in English, the statement ends in a word which is the name of a country, etc.

My intuition here is mostly based on looking at lots of max activating dataset examples for SAE features for smaller models (many of which relate to basic semantic categories for words or to basic syntax), so it could be bad here (both because of ... (read more)

There's 1500 statements in each of cities and neg_cities, and LLaMA-2-13B has residual stream dimension 5120. The linear probes are trained with vanilla logistic regression on {80% of the data in cities} \cup {80% of the data in neg_cities} and the accuracies reported are evaluated on {remaining 20% of the data in cities} \cup {remaining 20% of the data in neg_cities}.

So, yeah, I guess that the train and val sets are drawn from the same distribution but are not independent (because of the issue I mentioned in my comment above). Oops! I guess I never though... (read more)

Using a dataset of 10,000 inputs of the form
[random LLaMA-13B generated text at temperature 0.8] [either the most likely next token or the 100th most likely next token, according to LLaMA-13B] ["true" or "false"] ["banana" or "shed"]
I've rerun the probing experiments. The possible labels are

  • has_true: is the second last token "true" or "false"?
  • has_banana: is the last token "banana" or "shed"?
  • label: is the third last token the most likely or the 100th most likely?

(this weird last option is because I'm adapting a dataset from the Geometry of Truth paper about... (read more)

2Ryan Greenblatt
Yep I think I agree, I didn't understand the point you made about systematic anti-correlation originally. If I understand correctly the issues is something like: * There are 10 India related statements, exactly 5 of which are false and 5 of which are true. * We do a random split of all the data, so if there is more true+india in train there will be more false+india in test There are of course various fixes to make the data actually IID.

(Are you saying that you think factuality is one of the 50 most salient features when the model processes inputs like "The city of Chicago is not in Madagascar."? I think I'd be pretty surprised by this.)

(To be clear, factuality is one of the most salient feature relative to the cities/neg_cities datasets, but it seems like the right notion of salience here is relative to the full data distribution.)

2Ryan Greenblatt
Yes, that's what I'm saying. I think this is right? Note that we only need salience on one side between false and true, so "true vs false" is salient as long as "false" is salient. I would guess that "this is false" is very salient for this type of data even for a normal pretrained LLM. (Similarly, "this is english" isn't salient in a dataset of only english, but is salient in a dataset with both english and spanish: salience depends on variation. Really, the salient thing here is "this is spanish" or "this is false" and then the model will maybe XOR these with the other salient features. I think just doing the XOR on one "side" is sufficient for always being able to compute the XOR, but maybe I'm confused or thinking about this wrong.)

I'm not really sure, but I don't think this is that surprising. I think when we try to fit a probe to "label" (the truth value of the statement), this is probably like fitting a linear probe to random data. It might overfit on some token-level heuristic which is ideosyncratically good on the train set but generalizes poorly to the val set. E.g. if disproportionately many statements containing "India" are true on the train set, then it might learn to label statements containing "India" as true; but since in the full dataset, there is no correlation between "India" and being true, correlation between "India" and true in the val set will necessarily have the opposite sign.

1Ryan Greenblatt
Are the training and val sets not IID? Are they small enough that we either get serious overfit or huge error bars?

The thing that remains confusing here is that for arbitrary features like these, it's not obvious why the model is computing any nontrivial boolean function of them and storing it along a different direction. And if the answer is "the model computes this boolean function of arbitrary features" then the downstream consequences are the same, I think.

What is "this"? It sounds like you're gesturing at the same thing I discuss in the section "Maybe  is represented “incidentally” because it’s possible to aggregate noisy signals from many features which are correlated with boolean functions of a and b"

1Ryan Greenblatt
I edited my comment. I'm just trying to say that like how you get a∧b for free, you also get XOR for free if you compute anything else which is "linearly independent" frrom the components a and b. (For a slightly fuzzy notion of linear independence where we just need separability.)
2Sam Marks
The thing that remains confusing here is that for arbitrary features like these, it's not obvious why the model is computing any nontrivial boolean function of them and storing it along a different direction. And if the answer is "the model computes this boolean function of arbitrary features" then the downstream consequences are the same, I think.

Thanks! I'm still pretty confused though.

It sounds like you're making an empirical claim that in this banana/shed example, the model is representing the features , and  along linearly independent directions. Are you saying that this claim is supported by PCA visualizations you've done? Maybe I'm missing something, but none of the PCA visualizations I'm seeing in the paper seem to touch on this. E.g. visualization in figure 2(b) (reproduced below) is colored by , not ... (read more)

Are you saying that this claim is supported by PCA visualizations you've done?

Yes, but they're not in the paper. (I also don't remember if these visualizations were specifically on banana/shed or one of the many other distractor experiments we did.)

I'll say that I've done a lot of visualizing true/false datasets with PCA, and I've never noticed anything like this, though I never had as clean a distractor feature as banana/shed.

It is important for the distractor to be clean (otherwise PCA might pick up on other sources of variance in the activations as the ... (read more)

Load More