"ensuring that if the AIs are not aligned [...], then you are still OK" (which I think is the main meaning of "AI control")
Making ~worst-case assumptions about things like neural representations or inductive biases (which in practice means you likely rely on black-box methods, as in Redwood's existing work on control).

I think 2. is arguably the most promising strategy for 1., but I've occasionally noticed myself conflating them more than I should.

1. gives you the naive 50/50 equilibrium, i.e. 50% of people should naively work on this broad notion of control. But I think other reasons in favor apply more strongly to 2. (e.g. the tractability arguments are significantly weaker for model internals-based approaches to 1.)

I also think (non-confidently) that 2. is what's really very different from most existing research. For control in the first, broad sense, some research seems less clearly on either the control or alignment side.

But I do agree that safety-motivated researchers should evaluate approaches from a control perspective (in the broad sense) more on the margin. And I also really like the narrower black-box approach to control!

Sparsify: A mechanistic interpretability research agenda

Erik Jenner1y20

Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.

Trying to summarize my current understanding of what you're saying:

SAEs themselves aren't meant to be descriptions of (network, dataset). (I'd just misinterpreted your earlier comment.)
As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
Given a description of the network in terms of "parts," we can get a description of (network, dataset) by listing out which "parts" are "active" on each sample. I assume we then "compress" this description somehow (e.g. grouping similar samples), since otherwise the description would always have size linear in the dataset size?
You're then claiming that SAEs are a particularly short description of (network, dataset) in this sense (since they're optimized for not having many parts active).

My confusion mainly comes down to defining the words in quotes above, i.e. "parts", "active", and "compress". My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it's not just that we have a great intuition and it's just annoying to spell it out mathematically, I'm not convinced we even have a good intuitive understanding of what these things should mean.)

That said, my sense is you're not claiming any of this is easy to define. I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I'm happy to keep trying a bit more in case you're excited to explain).

Sparsify: A mechanistic interpretability research agenda

Erik Jenner1y10

Is there some formal-ish definition of "explanation of (network, dataset)" and "mathematical description length of an explanation" such that you think SAEs are especially short explanations? I still don't think I have whatever intuition you're describing, and I feel like the issue is that I don't know how you're measuring description length and what class of "explanations" you're considering.

As naive examples that probably don't work (similar to the ones from my original comment):

We could consider any Turing machine that approximately outputs (network, dataset) an "explanation", but it seems very likely that SAEs aren't competitive with short TMs of this form (obviously this isn't a fair comparison)
We could consider fixed computational graphs made out of linear maps and count the number of parameters. I think your objection to this is that these don't "explain the dataset"? (but then I'm not sure in what sense SAEs do)
We could consider arithmetic circuits that approximate the network on the dataset, and count the number of edges in the circuit to get "description length". This might give some advantage to SAEs if you can get sparse weights in the sparse basis, seems like the best attempt out of these three. But it seems very unclear to me that SAEs are better in this sense than even the original network (let alone stuff like pruning).

Focusing instead on what an "explanation" is: would you say the network itself is an "explanation of (network, dataset)" and just has high description length? If not, then the thing I don't understand is more about what an explanation is and why SAEs are one, rather than how you measure description length.

ETA: On re-reading, the following quote makes me think the issue is that I don't understand what you mean by "the explanation" (is there a single objective explanation of any given network? If so, what is it?) But I'll leave the rest in case it helps clarify where I'm confused.

Assuming the network is smaller yet as performant (therefore presumably doing more computation in superposition), then the explanation of the (network, dataset) is basically unchanged.

Sparsify: A mechanistic interpretability research agenda

Erik Jenner1y411

The sparsity penalty trains the SAE to activate fewer features for any given datapoint, thus optimizing for shorter mathematical description length.

I'm confused by this claim and some related ones, sorry if this comment is correspondingly confused and rambly.

It's not obvious at all to me that SAEs lead to shorter descriptions in any meaningful sense. We get sparser features (and maybe sparser interactions between features), but in exchange, we have more features and higher loss. Overall, I share Ryan's intuition here that it seems pretty hard to do much better than the total size of the network parameters in terms of description length.

Of course, the actual minimal description length program that achieves the same loss probably looks nothing like a neural network and is much more efficient. But why would SAEs let us get much closer to that? (The reason we use neural networks instead of arbitrary Turing machines in the first place is that optimizing over the latter is intractable.)

One might say that SAEs lead to something like a shorter "description length of what happens on any individual input" (in the sense that fewer features are active). But I don't think there's a formalization of this claim that captures what we want. In the limit of very many SAE features, we can just have one feature active at a time, but clearly that's not helpful.

If you're fine with a significant hit in loss from decompiling networks, then I'm much more sympathetic to the claim that you can reduce description length. But in that case, I could also reduce the description length by training a smaller model.

You might also be using a notion of "mathematical description length" that's a bit different from what I'm was thinking of (which is roughly "how much disk space would the parameters take?"), but I'm not sure what it is. One attempt at an alternative would be something like "length of the shortest efficiently runnable Turing machine that outputs the parameters", in order to not penalize simple repetitive structures, but I have no idea how using that definition would actually shake out.

All that said, I'm very glad you wrote this detailed description of your plans! I'm probably more pessimistic than you about it but still think this is a great post.

A comparison of causal scrubbing, causal abstractions, and related methods

Erik Jenner2y10

Thanks! Mostly agree with your comments.

I actually think this is reasonably relevant, and is related to treeification.

I think any combination of {rewriting, using some canonical form} and {treeification, no treeification} is at least possible, and they all seem sort of reasonable. Do you mean the relation is that both rewriting and treeification give you more expressiveness/more precise hypotheses? If so, I agree for treeification, not sure for rewriting. If we allow literally arbitrary extensional rewrites, then that does increase the number of different hypotheses we can make, but these hypotheses can't be understood as making precise claims about the original computation anymore. I could even see an argument that allowing rewrites in some sense always makes hypotheses less precise, but I feel pretty confused about what rewrites even are given that there might be no canonical topology for the original computation.

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Erik Jenner2y*162

ETA: We've now written a post that compares causal scrubbing and the Geiger et al. approach in much more detail: https://www.alignmentforum.org/posts/uLMWMeBG3ruoBRhMW/a-comparison-of-causal-scrubbing-causal-abstractions-and

I still endorse the main takeaways from my original comment below, but the list of differences isn't quite right (the newer papers by Geiger et al. do allow multiple interventions, and I neglected the impact that treeification has in causal scrubbing).

To me, the methods seem similar in much more than just the problem they're tackling. In particular, the idea in both cases seems to be:

One format for explanations of a model is a causal/computational graph together with a description of how that graph maps onto the full computation.
Such an explanation makes predictions about what should happen under various interventions on the activations of the full model, by replacing them with activations on different inputs.
We can check the explanation by performing those activation replacements and seeing if the impact is what we predicted.

Here are all the differences I can see:

In the Stanford line of work, the output of the full model and of the explanation are the same type, instead of the explanation having a simplified output. But as far as I can tell, we could always just add a final step to the full computation that simplifies the output to basically bridge this gap.
How the methods quantify the extent to which a hypothesis isn't perfect: at least in this paper, the Stanford authors look at the size of the largest subset of the input distribution on which the hypothesis is perfect, instead of taking the expectation of the scrubbed output.
The "interchange interventions" in the Stanford papers are allowed to change the activations in the explanation. They then check whether the output after intervention changes in the way the explanation would predict, as opposed to checking that the scrubbed output stays the same. (So along this axis, causal scrubbing just performs a subset of all the interchange interventions.)
Apparently the Stanford authors only perform one intervention at a time, whereas causal scrubbing performs all possible interventions at once.

These all strike me as differences in implementation of fundamentally the same idea.

Anyway, maybe we're actually on the same page and those differences are what you meant by "pretty different algorithm". But if not, I'd be very interested to hear what you think the key differences are. (I'm working on yet another approach and suspect more and more strongly that it's very similar to both causal scrubbing and Stanford's causal abstraction approach, so would be really good to know if I'm misunderstanding anything.)

FWIW, I would agree that the motivation of the Stanford authors seems somewhat different, i.e. they want to use this measurement of explanation quality in different ways. I'm less interested in that difference right now.

Natural Abstractions: Key claims, Theorems, and Critiques

Erik Jenner2y111

Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:

The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]

Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few variables. These local transformations aren't a problem because if we, say, resample variables at a time, then transforming $m < n$ variables doesn't affect redundant information.

I'm tentatively skeptical that we can split transformations up into these local components. E.g. to me it seems that describing some large number $N$ of particles by their center of mass and the distance vectors from the center of mass is a very reasonable description. But it sounds like you have a notion of "reasonable" in mind that's more specific then the set of all descriptions physicists might want to use.

I also don't see yet how exactly to make this work given local transformations---e.g. I think my version above doesn't quite work because if you're resampling a finite number $n$ of variables at a time, then I do think transforms involving fewer than $n$ variables can sometimes affect redundant information. I know you've talked before about resampling any finite number of variables in the context of a system with infinitely many variables, but I think we'll want a theory that can also handle finite systems. Another reason this seems tricky: if you compose lots of local transformations, for overlapping local neighborhoods, you get a transformation involving lots of variables. I don't currently see how to avoid that.

I'd also offer this as one defense of my relatively low level of formality to date: finite approximations are clearly the right way to go, and I didn't yet know the best way to handle finite approximations. I gave proof sketches at roughly the level of precision which I expected to generalize to the eventual "right" formalizations. (The more general principle here is to only add formality when it's the right formality, and not to prematurely add ad-hoc formulations just for the sake of making things more formal. If we don't yet know the full right formality, then we should sketch at the level we think we do know.)

Oh, I did not realize from your posts that this is how you were thinking about the results. I'm very sympathetic to the point that formalizing things that are ultimately the wrong setting doesn't help much (e.g. in our appendix, we recommend people focus on the conceptual open problems like finite regimes or encodings, rather than more formalization). We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).

Existing work has managed to go from pseudocode/circuits to interpretation of inputs mainly by looking at cases where the circuits in question are very small and simple - e.g. edge detectors in Olah's early work, or the sinusoidal elements in Neel's work on modular addition. But this falls apart quickly as the circuits get bigger - e.g. later layers in vision nets, once we get past early things like edge and texture detectors.

I totally agree with this FWIW, though we might disagree on some aspects of how to scale this to more realistic cases. I'm also very unsure whether I get how you concretely want to use a theory of abstractions for interpretability. My best story is something like: look for good abstractions in the model and then for each one, figure out what abstraction this is by looking at training examples that trigger the abstraction. If NAH is true, you can correctly figure out which abstraction you're dealing with from just a few examples. But the important bit is that you start with a part of the model that's actually a natural abstraction, which is why this approach doesn't work if you just look at examples that make a neuron fire, or similar ad-hoc ideas.

More generally, if you're used to academia, then bear in mind the incentives of academia push towards making one's work defensible to a much greater degree than is probably optimal for truth-seeking.

I agree with this. I've done stuff in some of my past papers that was just for defensibility and didn't make sense from a truth-seeking perspective. I absolutely think many people in academia would profit from updating in the direction you describe, if their goal is truth-seeking (which it should be if they want to do helpful alignment research!)

On the other hand, I'd guess the optimal amount of precision (for truth-seeking) is higher in my view than it is in yours. One crux might be that you seem to have a tighter association between precision and tackling the wrong questions than I do. I agree that obsessing too much about defensibility and precision will lead you to tackle the wrong questions, but I think this is feasible to avoid. (Though as I said, I think many people, especially in academia, don't successfully avoid this problem! Maybe the best quick fix for them would be to worry less about precision, but I'm not sure how much that would help.) And I think there's also an important failure mode where people constantly think about important problems but never get any concrete results that can actually be used for anything.

It also seems likely that different levels of precision are genuinely right for different people (e.g. I'm unsurprisingly much more confident about what the right level of precision is for me than about what it is for you). To be blunt, I would still guess the style of arguments and definitions in your posts only work well for very few people in the long run, but of course I'm aware you have lots of details in your head that aren't in your posts, and I'm also very much in favor of people just listening to their own research taste.

both my current work and most of my work to date is aimed more at truth-seeking than defensibility. I don't think I currently have all the right pieces, and I'm trying to get the right pieces quickly.

Yeah, to be clear I think this is the right call, I just think that more precision would be better for quickly arriving at useful true results (with the caveats above about different styles being good for different people, and the danger of overshooting).

Being both precise and readable at the same time is hard, man.

Yeah, definitely. And I think different trade-offs between precision and readability are genuinely best for different readers, which doesn't make it easier. (I think this is a good argument for separate distiller roles: if researchers have different styles, and can write best to readers with a similar style of thinking, then plausibly any piece of research should have a distillation written by someone with a different style, even if the original was already well written for a certain audience. It's probably not that extreme, I think often it's at least possible to find a good trade-off that works for most people, though hard).

Anomalous tokens reveal the original identities of Instruct models

Erik Jenner2y59

I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:

If I understand correctly, we'd need a model that we're confident is a mesa-optimizer (and perhaps even deceptive---mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are "thresholds" where slight changes have big effects on how dangerous a model is.
If there's a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.
Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.
It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don't also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that's actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?

ETA: I do want to highlight that this is still one of the most promising ideas I've heard recently and I really look forward to hopefully reading a full post on it!

Trying to disambiguate different questions about whether RLHF is “good”

Erik Jenner2y26

No, I'm not claiming that. What I am claiming is something more like: there are plausible ways in which applying 30 nats of optimization via RLHF leads to worse results than best-of-exp(30) sampling, because RLHF might find a different solution that scores that highly on reward.

Toy example: say we have two jointly Gaussian random variables X and Y that are positively correlated (but not perfectly). I could sample 1000 pairs and pick the one with the highest X-value. This would very likely also give me an unusually high Y-value (how high depends on the correlation). Or I could change the parameters of the distribution such that a single sample will typically have an X-value as high as the 99.9th percentile of the old distribution. In that case, the Y-value I typically get will depend a lot on how I changed the parameters. E.g. if I just shifted the X-component of the mean and nothing else, I won't get higher Y-values at all.

I'm pretty unsure what kinds of parameter changes RLHF actually induces, I'm just saying that parameter updates can destroy correlations in a way that conditioning doesn't. This is with the same amount of selection pressure on the proxy in both cases.

Trying to disambiguate different questions about whether RLHF is “good”

Erik Jenner2y10

As a caveat, I didn't think of the RL + KL = Bayesian inference result when writing this, I'm much less sure now (and more confused).

Anyway, what I meant: think of the computational graph of the model as a causal graph, then changing the weights via RLHF is an intervention on this graph. It seems plausible there are somewhat separate computational mechanisms for producing truth and for producing high ratings inside the model, and RLHF could then reinforce the high rating mechanism without correspondingly reinforcing the truth mechanism, breaking the correlation. I certainly don't think there will literally be cleanly separable circuits for truth and high rating, but I think the general idea plausibly applies. I don't see how anything comparable happens with filtering.