AI ALIGNMENT FORUM
AF

All of Lee Sharkey's Comments + Replies

Attribution-based parameter decomposition

seems great for mechanistic anomaly detection! very intuitive to map ADP to surprise accounting (I was vaguely trying to get at a method like ADP here)

Agree! I'd be excited by work that uses APD for MAD, or even just work that applies APD to Boolean circuit networks. We did consider using them as a toy model at various points, but ultimately opted to go for other toy models instead.

(btw typo: *APD)

Attribution-based parameter decomposition

Lee Sharkey2mo11

IMO most exciting mech-interp research since SAEs, great work

I think so too! (assuming it can be made more robust and scaled, which I think it can)
And thanks! :)

Attribution-based parameter decomposition

Lee Sharkey2mo30

We're aware of model diffing work like this, but I wasn't aware of this particular paper.

It's probably an edge case: They do happen both to be in weight space and to be suggestive of weight space linearity. Indeed, our work was informed by various observations from a range of areas that suggest weight space linearity (some listed here).

On the other hand, our work focused on decomposing a given network's parameters. But the line of work you linked above seems more in pursuit of model editing and understanding the difference between two similar models, rathe... (read more)

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Lee Sharkey7mo80

It would be interesting to meditate in the question "What kind of training procedure could you use to get a meta-SAE directly?" And I think answering this relies in part on mathematical specification of what you want.

At Apollo we're currently working on something that we think will achieve this. Hopefully will have an idea and a few early results (toy models only) to share soon.

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey7mo10

So I believe I had in mind "active means [is achieved deliberately through the agent's actions]".

I think your distinction makes sense. And if I ever end up updating this article I would consider incorporating it. However, I think the reason I didn't make this distinction at the time is because the difference is pretty subtle.

The mechanisms I labelled as "strictly active" are the kind of strategy that it would be extremely improbable to implement successfully without some sort of coherent internal representations to help orchestrate the actions requir... (read more)

1Joe Collman7mo

To be clear, I only think this becomes obviously significant in a train-against-interpretability-tools context; if we're not training against tools, the things I'm highlighting here seem intuitively highly unlikely. This still seems important, given that the combination [x is clearly a bad idea] and [x is the easiest path forward] does not robustly result in [nobody does x]. That said, I think I'd disagree on one word of the following: Any internal representations etc. need to result in the actions being orchestrated (to the extent necessary). They don't need to be used for the purpose of orchestrating the actions. They don't need to be [clearly selected for the purpose of orchestrating such actions]. In some cases this might be improbable without active, deliberate malign selection - but this would need to be argued for. In at least the case of [manipulating the interpreter], this doesn't seem at all improbable, given the right selection pressure. (some manipulation seems obviously possible; whether it's enough for a particular malign strategy would vary) I haven't thought about this for long, but in general the [this kind of thing ~must require active deliberation] intuition seems highly suspicious to me. All I need to set up a not-directly-deliberated-on malign strategy so that it's implemented via benign cognition is a sufficient set of innocuous correlations (both of [innocuous action tends to have useful-for-malign-strategy effect] and [innocuous actions will tend to occur in these predictable patterns]). If I have unlimited passive selection pressure against tools that robustly detect active malign deliberation, but not 'instinctive' malign strategies, I'd strongly expect the outcome to be 'instinctive' active manipulation (etc.). (assuming of course that successful manipulation is instrumentally useful) Whether I'd expect this in practice would come down to how much selection pressure were required for a particular [malign, active-with-no-malign-cognit

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Lee Sharkey8mo22

Extremely glad to see this! The Guez et al. model has long struck me as one of the best instances of a mesaoptimizer and it was a real shame that it was closed source. Looking forward to the interp findings!

1Adrià Garriga-Alonso8mo

Thank you!! I agree it's a really good mesa-optimizer candidate, it remains to see now exactly how good. It's a shame that I only found out about it about a year ago :)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey8mo10

Thanks! Fixed now

Transcoders enable fine-grained interpretable circuit analysis for language models

Lee Sharkey11mo10

I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon!

There's recent work published on this here by Chris Mathwin, Dennis Akar, and me. The gated attention block is a kind of transcoder adapted for attention blocks.

Nice work by the way! I think this is a promising direction.

Note also the similar, but substantially different, use of the term transcoder here, whose problems were pointed out to me by... (read more)

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey1y10

Trying to summarize my current understanding of what you're saying:

Yes all four sound right to me.
To avoid any confusion, I'd just add an emphasis that the descriptions are mathematical, as opposed semantic.

I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems

Lee Sharkey1y10

Hm I think of the (network, dataset) as scaling multiplicatively with size of network and size of dataset. In the thread with Erik above, I touched a little bit on why:
"SAEs (or decompiled networks that use SAEs as the building block) are supposed to approximate the original network behaviour. So SAEs are mathematical descriptions of the network, but not of the (network, dataset). What's a mathematical description of the (network, dataset), then? It's just what you get when you pass the dataset through the network; this datum interacts with thi... (read more)

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey1y10

Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.

I'll register that I prefer using 'description' instead of 'explanation' in most places. The reason is that 'explanation' invokes a notion of understanding, which requ... (read more)

0Jason Gross1y

What's wrong with "proof" as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on "formal proof", I'm in the process of producing a write-up on results exploring this.

2Erik Jenner1y

Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout. Trying to summarize my current understanding of what you're saying: * SAEs themselves aren't meant to be descriptions of (network, dataset). (I'd just misinterpreted your earlier comment.) * As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network. * Given a description of the network in terms of "parts," we can get a description of (network, dataset) by listing out which "parts" are "active" on each sample. I assume we then "compress" this description somehow (e.g. grouping similar samples), since otherwise the description would always have size linear in the dataset size? * You're then claiming that SAEs are a particularly short description of (network, dataset) in this sense (since they're optimized for not having many parts active). My confusion mainly comes down to defining the words in quotes above, i.e. "parts", "active", and "compress". My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it's not just that we have a great intuition and it's just annoying to spell it out mathematically, I'm not convinced we even have a good intuitive understanding of what these things should mean.) That said, my sense is you're not claiming any of this is easy to define. I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I'm happy to keep trying a bit more in case you're excited to explain).

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey1y*20

Thanks Erik :) And I'm glad you raised this.

One of the things that many researchers I've talked to don't appreciate is that, if we accept networks can do computation in superposition, then we also have to accept that we can't just understand the network alone. We want to understand the network's behaviour on a dataset, where the dataset contains potentially lots of features. And depending on the features that are active in a given datum, the network can do different computations in superposition (unlike in a linear network that can't do s... (read more)

1Erik Jenner1y

Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering. As naive examples that probably don't work (similar to the ones from my original comment): * We could consider any Turing machine that approximately outputs (network, dataset) an "explanation", but it seems very likely that SAEs aren't competitive with short TMs of this form (obviously this isn't a fair comparison) * We could consider fixed computational graphs made out of linear maps and count the number of parameters. I think your objection to this is that these don't "explain the dataset"? (but then I'm not sure in what sense SAEs do) * We could consider arithmetic circuits that approximate the network on the dataset, and count the number of edges in the circuit to get "description length". This might give some advantage to SAEs if you can get sparse weights in the sparse basis, seems like the best attempt out of these three. But it seems very unclear to me that SAEs are better in this sense than even the original network (let alone stuff like pruning). Focusing instead on what an "explanation" is: would you say the network itself is an "explanation of (network, dataset)" and just has high description length? If not, then the thing I don't understand is more about what an explanation is and why SAEs are one, rather than how you measure description length. ETA: On re-reading, the following quote makes me think the issue is that I don't understand what you mean by "the explanation" (is there a single objective explanation of any given network? If so, what is it?) But I'll leave the rest in case it helps clarify where I'm confused.

2Ryan Greenblatt1y

Only by a constant factor with chinchilla scaling laws right (e.g. maybe 20x more tokens than params)? And spiritually, we only need to understand behavior on the training dataset to understand everything that SGD has taught the model.

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey1y20

So, for models that are 10 terabytes in size, you should perhaps be expecting a "model manual" which is around 10 terabytes in size.

Yep, that seems reasonable.
I'm guessing you're not satisfied with the retort that we should expect AIs to do the heavy lifting here?

Or perhaps you don't think you need something which is close in accuracy to a full explanation of the network's behavior.

I think the accuracy you need will depend on your use case. I don't think of it as a globally applicable quantity for all of interp.

For instance, maybe to 'aud... (read more)

3Ryan Greenblatt1y

I think this presents a plausible approach and is likely needed for ambitious bottom up interp. So this seems like a reasonable plan. I just think that it's worth acknowledging that "short description length" and "sparse" don't result in something which is overall small in an absolute sense.

Sparsify: A mechanistic interpretability research agenda

Lee Sharkey1y31

Thanks for this feedback! I agree that the task & demo you suggested should be of interest to those working on the agenda.

It makes me a bit worried that this post seems to implicitly assume that SAEs work well at their stated purpose.

There were a few purposes proposed, and at multiple levels of abstraction, e.g.

The purpose of being the main building block of a mathematical description used in an ambitious mech interp solution
The purpose of being the main building block of decompiled networks
The purpose of taking features out of superposition

I'm g... (read more)

Ryan Greenblatt1y*46

I'm curious if you believe that, even if SAEs aren't the right solution, there realistically exists a potential solution that would allow researchers to produce succinct, human understandable explanation that allow for recovering >75% of the training compute of model components?

There isn't any clear reason to think this is impossible, but there are multiple reasons to think this is very, very hard.

I think highly ambitious bottom up interpretability (which naturally pursues this sort of goal), seems like an decent bet overall, but seems unlikely to su... (read more)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey1y10

Makes sense! Thanks!

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey1y10

Great! I'm curious, what was it about the sparsity penalty that you changed your mind about?

2Arthur Conmy1y

I previously thought that L1 penalties were just exactly what you wanted to do sparse reconstruction. Thinking about your undershooting claim, I came up with a toy example that made it obvious to me that the Anthropic loss function was not optimal: suppose you are role-playing a single-feature SAE reconstructing the number 2, and are given loss equal to the squared error of your guess, plus the norm of your guess. Then guessing x>0 gives loss minimized at x=3/2, not 2

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey1y*190Review for 2022 Review

Comments on the outcomes of the post:

I'm reasonably happy with how this post turned out. I think it probably bought the Anthropic/superposition mechanistic interpretability agenda somewhere between 0.1 to 4 counterfactual months of progress, which feels like a win.
I think sparse autoencoders are likely to be a pretty central method in mechanistic interpretability work for the foreseeable future (which tbf is not very foreseeable).
Two parallel works used the method identified in the post (sparse autoencoders - SAEs) or slight modification:
- Cunningham et al.

... (read more)

2Arthur Conmy1y

I really appreciated this retrospective, this changed my mind about the sparsity penalty, thanks!

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey2y20

Here is a reference that supports the claim using simulations https://royalsocietypublishing.org/doi/10.1098/rspb.2008.0877

But I think you're right to flag it - other references don't really support it as the main reason for stripes. https://www.nature.com/articles/ncomms4535

Announcing Apollo Research

Lee Sharkey2y10

Thanks Akash!

I agree that this feels neglected.

Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712

Looking forward to it coming out!

'Fundamental' vs 'applied' mechanistic interpretability research

Lee Sharkey2y*10

Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers.

Dictionary learning - This is one of my main bets for comprehensive interpretability.

Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709

A small update to the Sparse Coding interim research report

Lee Sharkey2y20

No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report.

A small update to the Sparse Coding interim research report

Lee Sharkey2y30

I strongly suspect this is the case too!

In fact, we might be able to speed up the learning of common features even further:

Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't ha... (read more)

Deep Deceptiveness

Lee Sharkey2y*30

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven't observed others to notice on their own.

I just want to point out that I've written a long list of such obstacles in this article: Circumventing interpretability: How to defeat mind-readers

I believe the example of deep deception that Nate describes in this post is actually a combination of several methods described in that post.

I'll quote the parts of this post that correspond to particular interpretability circumvention methods in the ot... (read more)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey2y10

Thanks for your interest!

The autoencoder losses reported are the train losses. And you're right to point at noise potentially being an issue. It's my strong suspicion that some of the problems in these results are due to there being an insufficient number of data points to train the autoencoders on LM data.

> I would also be interested to test a bit more if this method works on toy models which clearly don't have many features, such as a mixture of a dozen of gaussians, or random points in the unit square (where there is a lot of room "in the corne... (read more)

Why almost every RL agent does learned optimization

Lee Sharkey2y21

My usual starting point is “maybe people will make a model-based RL AGI / brain-like AGI”. Then this post is sorta saying “maybe that AGI will become better at planning by reading about murphyjitsu and operations management etc.”, or “maybe that AGI will become better at learning by reading Cal Newport and installing Anki etc.”. Both of those things are true, but to me, they don’t seem safety-relevant at all.

Hm, I don't think this quite captures what I view the post as saying.

Maybe what you’re thinking is: “Maybe Future Company X will program a

... (read more)

2Steve Byrnes2y

Thanks! See Section 3 here for why I think it would be a lot worse.

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey2y10

That's correct. 'Correlated features' could ambiguously mean "Feature x tends to activate when feature y activates" OR "When we generate feature direction x, its distribution is correlated with feature y's". I don't know if both happen in LMs. The former almost certainly does. The second doesn't really make sense in the context of LMs since features are learned, not sampled from a distribution.

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey2y20

There should be a neat theoretical reason for the clean power law where L1 loss becomes too big. But it doesn't make intuitive sense to me - it seems like if you just add some useless entries in the dictionary, the effect of losing one of the dimensions you do use on reconstruction loss won't change, so why should the point where L1 loss becomes too big change? So unless you have a bug (or some weird design choice that divides loss by number of dimensions), those extra dimensions would have to be changing something.

The L1 loss on the activations does indee... (read more)

[Interim research report] Taking features out of superposition with sparse autoencoders

Lee Sharkey2y20

In the toy datasets, the features have the same scale (uniform from zero to one when active multiplied by a unit vector). However in the NN case, there's no particular reason to think the feature scales are normalized very much (though maybe they're normalized a bit due to weight decay and similar). Is there some reason this is ok?

Hm it's a great point. There's no principled reason for it. Equivalently, there's no principled reasons to expect the coefficients/activations for each feature to be on the same scale either. We should probably look into a ... (read more)

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey3y20

I agree

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey3y42

This sounds really reasonable. I had only been thinking of a naive version of interpretability tools in the loss function that doesn't attempt to interpret the gradient descent process. I'd be genuinely enthusiastic about the strong version you outlined. I expect to think a lot about it in the near future.

Circumventing interpretability: How to defeat mind-readers

Lee Sharkey3y10

Thanks! Amended.