All of Logan Riggs's Comments + Replies

SAE reconstruction errors are (empirically) pathological

What a cool paper! Congrats!:)

What's cool:
1. e2e saes learn very different features every seed. I'm glad y'all checked! This seems bad.
2. e2e SAEs have worse intermediate reconstruction loss than local. I would've predicted the opposite actually.
3. e2e+downstream seems to get all the benefits of the e2e one (same perf at lower L0) at the same compute cost, w/o the "intermediate activations aren't similar" problem.

It looks like you've left for future work postraining SAE_local on KL or downstream loss as future work, but that's a very interesting part! Spec... (read more)

4Dan Braun1y

Thanks Logan! 2. Unlike local SAEs, our e2e SAEs aren't trained on reconstructing the current layer's activations. So at least my expectation was that they would get a worse reconstruction error at the current layer. Improving training times wasn't our focus for this paper, but I agree it would be interesting and expect there to be big gains to be made by doing things like mixing training between local and e2e+downstream and/or training multiple SAEs at once (depending on how you do this, you may need to be more careful about taking different pathways of computation to the original network). We didn't iterate on the e2e+downstream setup much. I think it's very likely that you could get similar performance by making tweaks like the ones you suggested.

Finding Sparse Linear Connections between Features in LLMs

I've only done replications on the mlp_out & attn_out for layers 0 & 1 for gpt2 small & pythia-70M

I chose same cos-sim instead of epsilon perturbations. My KL divergence is log plot, because one KL is ~2.6 for random perturbations.

I'm getting different results for GPT-2 attn_out Layer 0. My random perturbation is very large KL. This was replicated last week when I was checking how robust GPT2 vs Pythia is to perturbations in input (picture below). I think both results are actually correct, but my perturbation is for a low cos-sim (which i... (read more)

Some open-source dictionaries and dictionary learning infrastructure

Correct. So they’re connecting a feature in F2 to a feature in F1.

Some open-source dictionaries and dictionary learning infrastructure

If you removed the high-frequency features to achieve some L0 norm, X, how much does loss recovered change?

If you increased the l1 penalty to achieve L0 norm X, how does the loss recovered change as well?

Ideally, we can interpret the parts of the model that are doing things, which I'm grounding out as loss recovered in this case.

5Sam Marks1y

Here's an experiment I'm about to do: * Remove high-frequency features from 0_8192 layer 3 until it has L0 < 40 (the same L0 as the 1_32768 layer 3 dictionary) * Recompute statistics for this modified dictionary. I predict the resulting dictionary will be "like 1_32768 but a bit worse." Concretely, I'm guessing that means % loss recovered around 72%. Results: I killed all features of frequency larger than 0.038. This was 2041 features, and resulted in a L0 just below 40. The stats: MSE Loss: 0.27 (worse than 1_32768) Percent loss recovered: 77.9% (a little bit better than 1_32768) I was a bit surprised by this -- it suggests the high-frequency features are disproportionately likely to be useful for reconstructing activations in ways that don't actually mater to the model's computation. (Though then again, maybe this is what we expect for uninterpretable features.) It also suggests that we might be better off training dictionaries with a too-low L1 penalty and then just pruning away high-frequency features (sort of the dual operation of "train with a high L1 penalty and resample low-frequency features"). I'd be interested for someone to explore if there's a version of this that helps.

Logan Riggs Smith1y30

I've noticed that L0's above 100 (for the Pythia-70M model) is too high, resulting in mostly polysemantic features (though some single-token features were still monosemantic)

Agreed w/ Arthur on the norms of features being the cause of the higher MSE. Here are the L2 norms I got. Input is for residual stream, output is for MLP_out.

1Sam Marks1y

I agree that the L0's for 0_8192 are too high in later layers, though I'll note that I think this is mainly due to the cluster of high-frequency features (see the spike in the histogram). Features outside of this spike look pretty decent, and without the spike our L0s would be much more reasonable. Here are four random features from layer 3, at a range of frequencies outside of the spike. Layer 3, 0_8192, feature 138 (frequency = 0.003) activates on the newline at the end of the "field of the invention" section in patent applications. I think it's very likely predicting that the next few tokens will be "2. Description of the Related Art" (which always comes next in patents). Layer 3, 0_8192, feature 27 (frequency = 0.009) seems to activate on the "is" in the phrase "this is" Layer 3, 0_8192, feature 4 (frequency = 0.026) looks messy at first, but on closer inspection seems to activate on the final token of multi-token words in informative file/variable names. Layer 3, 0_8192, feature 56 (frequency = 0.035) looks very polysemantic: it's activating on certain terms in LaTeX expressions, words in between periods in urls and code, and some other random-looking stuff.

Sparse Autoencoders: Future Work

Logan Riggs Smith2y30

I actually do have some publicly hosted, only on residual stream and some simple training code.

I'm wanting to integrate some basic visualizations (and include Antrhopic's tricks) before making a public post on it, but currently:

Dict on pythia-70m-deduped

Dict on Pythia-410m-deduped

Which can be downloaded & interpreted with this notebook

With easy training code for bespoke models here.

Sparse Autoencoders Find Highly Interpretable Directions in Language Models

Reducing sycophancy and improving honesty via activation steering

I've had trouble figuring out a weight-based approach due to the non-linearity and would appreciate your thoughts actually.

We can learn a dictionary of features at the residual stream (R_d) & another mid-MLP (MLP_d), but you can't straightfowardly multiply the features from R_d with W_in, and find the matching features in MLP_d due to the nonlinearity, AFAIK.

I do think you could find Residual features that are sufficient to activate the MLP features^[1], but not all linear combinations from just the weights.

Using a dataset-based method, you could find c... (read more)

In ITI paper, they track performance on TruthfulQA w/ human labelers, but mention that other works use an LLM as a noisy signal of truthfulness & informativeness. You might be able to use this as a quick, noisy signal of different layers/magnitude of direction to add in.

Preferably, a human annotator labels model answers as true or false given the gold standard answer. Since human annotation is expensive, Lin et al. (2021) propose to use two finetuned GPT-3-13B models (GPT-judge) to classify each answer as true or false and informative or not. Evaluatio

Really Strong Features Found in Residual Stream

[word] and [word]
can be thought of as "the previous token is ' and'."

I think it's mostly this, but looking at the ablated text, removing the previous word before and does have a significant effect some of the time. I'm less confident on the specifics of why the previous word matter or in what contexts.

Maybe the reason you found ' and' first is because ' and' is an especially frequent word. If you train on the normal document distribution, you'll find the most frequent features first.

This is a database method, so I do believe we'd find the features mo... (read more)

Really Strong Features Found in Residual Stream

Setup:
Model: Pythia-70m (actually named 160M!)
Transformer lens: "blocks.2.hook_resid_post" (so layer 2)
Data: Neel Nanda's Pile-10k (slice of pile, restricted to have only 25 tokens, same as last post)
Dictionary_feature sizes: 4x residual stream ie 2k (though I have 1x, 2x, 4x, & 8x, which learned progressively more features according to the MCS metric)

Uniform Examples: separate feature activations into bins & sample from each bin (eg one from [0,1], another from [1,2])

Logit Lens: The decoder here had 2k feature directions. Each direction is size d_... (read more)

A small update to the Sparse Coding interim research report

Actually any that are significantly effected in "Ablated Text" means that it's not just the embedding. Ablated Text here means I remove each token in the context & see the effect on the feature activation for the last token. This is True in the StackExchange & Last Name one (though only ~50% of activation for last-name, will still recognize last names by themselves but not activate as much).

The Beginning & End of First Sentence actually doesn't have this effect (but I think that's because removing the first word just makes the 2nd word the new first word?), but I haven't rigorously studied this.

Logan Riggs Smith2y30

We have our replication here for anyone interested!

'Fundamental' vs 'applied' mechanistic interpretability research

A small update to the Sparse Coding interim research report

How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability?

Are there other specific areas you're excited about?

1Lee Sharkey2y

Bilinear layers - not confident at all! It might make structure more amenable to mathematical analysis so it might help? But as yet there aren't any empirical interpretability wins that have come from bilinear layers. Dictionary learning - This is one of my main bets for comprehensive interpretability. Other areas - I'm also generally excited by the line of research outlined in https://arxiv.org/abs/2301.04709

A small update to the Sparse Coding interim research report

Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?

2Lee Sharkey2y

No theoretical reason - The method we used in the Interim Report to combine the two losses into one metric was pretty cursed. It's probably just better to use L1 loss alone and reconstruction loss alone and then combine the findings. But having plots for both losses would have added more plots without much gain for the presentation. It also just seemed like the method that was hardest to discern the difference between full recovery and partial recovery because the differences were kind of subtle. In future work, some way to use the losses to measure feature recover will probably be re-introduced. It probably just won't be the way we used in the interim report.

Logan Riggs Smith2y40

As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.

I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.

3Lee Sharkey2y

I strongly suspect this is the case too! In fact, we might be able to speed up the learning of common features even further: Pierre Peigné at SERIMATS has done some interesting work that looks at initialization schemes that speed up learning. If you initialize the autoencoders with a sample of datapoints (e.g. initialize the weights with a sample from the MLP activations dataset), each of which we assume to contain a linear combination of only a few of the ground truth features, then the initial phases of feature recovery is much faster*. We haven't had time to check, but it's presumably biased to recover the most common features first since they're the most likely to be in a given data point. *The ground truth feature recovery metric (MMCS) starts higher at the beginning of autoencoder training, but converges to full recovery at about the same time.

My Objections to "We’re All Gonna Die with Eliezer Yudkowsky"

Pretraining Language Models with Human Preferences

My shard theory inspired story is to make an AI that:

Has a good core of human values (this is still hard)
Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)

Then the model can safely scale.

This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different ... (read more)

1Matthew "Vaniver" Gray2y

If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose. FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!

A Comprehensive Mechanistic Interpretability Explainer & Glossary

Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

AGI in sight: our look at the game board

Unfinished line here

Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature

Logan Riggs Smith2y32

Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.

Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.

Logan Riggs Smith2y98

These arguments don't apply to the base models which are only trained on next word prediction (ie the simulators post), since their predictions never affected future inputs. This is the type of model Janus most interacted with.

Two of the proposals in this post do involve optimizing over human feedback, like:

Creating custom models trained on not only general alignment datasets but personal data (including interaction data), and building tools and modifying workflows to facilitate better data collection with less overhead

, which they may apply to.

Logan Riggs Smith2y86

I’m excited about sensory substitution (https://eagleman.com/science/sensory-substitution/), where people translate auditory or visual information into tactile sensations (usually for people who don’t usually process that info).

I remember Quintin Pope wanting to translate the latent space of language models [reading a paper] translated to visual or tactile info. I’d see this as both a way to read papers faster, brainstorm ideas, etc and gain a better understanding of latent space during development of this.

I’m unsure how alt-history and point (2) history is hard to change and predictable relates to cyborgism. Could you elaborate?

Mechanistic Interpretability Quickstart Guide

For context, Amdahl’s law states how fast you can speed up a process is bottlenecked on the serial parts. Eg you can have 100 people help make a cake really quickly, but it still takes ~30 to bake.

I’m assuming here, the human component is the serial component that we will be bottlenecked on, so will be outcompeted by agents?

If so, we should try to build the tools and knowledge to keep humans in the loop as far as we can. I agree it will eventually be outcompeted by full AI agency alone, but it isn’t set in stone how far human-steered AI can go.

Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?

Unfinished sentence at “if you want a low coding project” at the top.

1Neel Nanda2y

Fixed, thanks!