User Comment Replies — AI Alignment Forum

I believe the closest research to this topic is under the heading "Performative Power" (cf, e.g., this arXiv paper). I think "The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power" by Shoshana Zuboff is also a pretty good book that seems related.

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross

8mo10

[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?

This is very interesting! What prior does log(1+|a|)... (read more)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross

8mo10

[Nix] Toy model of feature splitting
There are at least two explanations for feature splitting I find plausible:
Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained v

Jason Gross

9mo*21

Progress Measures for Grokking via Mechanistic Interpretability (Neel Nanda et al) - nothing important in mech interp has properly built on this IMO, but there's just a ton of gorgeous results in there. I think it's the most (only?) truly rigorous reverse-engineering work out there

Totally agree that this has gorgeous results, and this is what got me into mech interp in the first place! Re "most (only?) truly rigorous reverse-engineering work out there": I think the clock and pizza paper seems comparably rigorous, and there's also my recent Compact Pr... (read more)

Compact Proofs of Model Performance via Mechanistic Interpretability

Jason Gross

9mo*63

I believe what you describe is effectively Casual Scrubbing. Edit: Note that it is not exactly the same as causal scrubbing, which picks looks at the activations for another input sampled at random.

On our particular model, doing this replacement shows us that the noise bound in our particular model is actually about 4 standard deviations worse than random, probably because the training procedure (sequences chosen uniformly at random) means we care a lot more about large possible maxes than small ones. (See Appendix H.1.2 for some very sparse details.)

On o... (read more)

Sparsify: A mechanistic interpretability research agenda

Jason Gross

1y10

We propose a simple fix: Use $L_{0 < p < 1}$ instead of $L_{1}$ , which seems to be a Pareto improvement over $L_{1}$ (at least in some real models, though results might be mixed) in terms of the number of features required to achieve a given reconstruction error.

When I was discussing better sparsity penalties with Lawrence, and the fact that I observed some instability in $L_{0 < p < 1}$ in toy models of super-position, he pointed out that the gradient of $L_{0 < p < 1}$ norm explodes near zero, meaning that features with "small errors" that cause them to h... (read more)

Sparsify: A mechanistic interpretability research agenda

Jason Gross

1y00

"explanation of (network, dataset)": I'm afraid I don't have a great formalish definition beyond just pointing at the intuitive notion.

What's wrong with "proof" as a formal definition of explanation (of behavior of a network on a dataset)? I claim that description length works pretty well on "formal proof", I'm in the process of producing a write-up on results exploring this.

AI ALIGNMENT FORUM
AF

All of Jason Gross's Comments + Replies