Independent alignment researcher
I have signed no contracts or agreements whose existence I cannot mention.
So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?
Mostly I think that MI is right to think it can do a lot for alignment, but I suspect that lots of the best things it can do for alignment it will do in a very dual-use way, which skews heavily towards capabilities. Mostly because capabilities advances are easier and there are more people working on those.
At the same time I suspect that many of those dual use concerns can be mitigated by making your MI research targeted. Not necessarily made such that you can do off-the-shelf interventions based on your findings, but made such that if it ever has any use, that use is going to be for alignment, and you can predict broadly what that use will look like.
This also doesn't mean your MI research can't be ambitious. I don't want to criticize people for being ambitious or too theoretical! I want to criticize people for producing knowledge on something which, while powerful, seems powerful in too many directions to be useful if done publicly.
I agree that if your theory of change for interp goes through, "interp solves a concrete problem like deception or sensor tampering or adversarial robustness", then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn't terrible, and so it's worth exploring and investing in anyways.
I don't entirely know what you mean by this. How would we solve alignment by not going through a concrete problem? Maybe you think MI will be secondary to that process, and will give us useful information about what problems are necessary to solve? In such a case I still don't see why you need ambitious MI. You can just test the different problem classes directly. Maybe you think the different problem classes are too large to test directly. Even in that case, I still think that a more targeted approach would be better, where you generate as much info about those target classes as possible, while minimizing info that can be used to make your models better. And you selectively report only the results of your investigation which bear on the problem class. Even if the research is exploratory, the result & verification demonstration can still be targeted.
But again, most mech interp people aren't aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn't lead to novel alignment methods.
Maybe I misspoke. I dislike current MI because I expect large capability improvements before and at the same time as the alignment improvements, but I don't dispute future alignment improvements. Just whether they'll be worth it. The reason I brought up that was as some motivation for why I think targeted is better, and why I don't like some peoples' criticism of worries about MI externalities by appealing to the lack of capabilities advances caused by MI. There've certainly been more attempts at capabilities improvements motivated by MI than there have been attempts at alignment improvements. Regardless of what you think about the future of the field, its interesting when people make MI discoveries which don't lead to too much capabilities advances.
I personally like activation additions because they give me evidence about how models mechanistically behave in a way which directly tells me about which threat models are more or less likely, and it has the potential to make auditing and iteration a lot easier. Accomplishments which ambitious MI is nowhere close to, and for which I expect its methods would have to pay a lot in terms of capability advances in order to get to. I mention this as evidence for why I expect targeted approaches are faster and cheaper than ambitious ones. At least if done publicly.
Interpretability seems pretty useful for alignment, but it also seems pretty dangerous for capabilities. Overall the field seems net-bad. Using an oversimplified model, my general reason for thinking this is because for any given interpretability advance, it can either be used for the purposes of capabilities or the purposes of alignment. Alignment is both harder, and has fewer people working on it than improving model capabilities. Even if the marginal interpretability advance would be net good for alignment if alignment and capabilities were similar in size and difficulty, we should still expect that it will get used for the purposes of capabilities.
Lots of people like pointing to how better interpretability almost never makes long-term improvements to model capabilities, but it leads to just as few improvements to model alignment! And the number & quality of papers or posts using interpretability methods for capabilities vastly exceeds the number & quality using interpretability methods for alignment.
The only example of interpretability leading to novel alignment methods I know of is shard theory's recent activation additions work (notably work that is not so useful if Nate & Eliezer are right about AGI coherence). In contrast, it seems like all the papers using interpretability to advance capabilities rely on Anthropic's transformer circuits work.
These are two interesting case-studies, and more work should probably be done comparing their relative merits. But in lieu of that, my explanation for the difference in outcomes is this:
Anthropic's work was highly explorational, while Team Shard's was highly targeted. Anthropic tried to understand the transformer architecture and training process in general, while shard theory tried to understand values and only values. If capabilities is easier than alignment, it should not be too surprising if an unfocused approach makes capabilities relatively easier, while a focused-on-values approach makes alignment relatively easier. The unfocused approach will gather a wide range of low-hanging fruit, but little low-hanging fruit is alignment related, so most fruit gathered will be capabilities related.
This is why I'm pessimistic about most interpretability work. It just isn't focused enough! And its why I'm optimistic about interpretability (and interpretability adjacent) work focused on understanding explicitly the value systems of our ML systems, and how those can be influenced.
So a recommendation for those hoping to work on interpretability and have it be net-positive: Focus on understanding the values of models! Or at least other directly alignment relevant parts of models.
For example, I mostly expect a solution to superposition to be net-negative, in the same way that transoformer circuits is net-negative. Though at the same time I also expect superposition to have lots of alignment benefits in the short-term. If AGI is further off, superposition ends up being net-negative, the closer AGI is to now, the more positive a superposition solution becomes.
Another sort of interpretability advance I'm worried about: locating the optimization algorithms operating inside neural networks. I admit these have large alignment boosts, but that seems inconsequential compared to their large potential for large boosts to capabilities. Such advances may be necessary for alignment though, so I'm more happy in a world where these are not so widely publicized, and given only to the superintelligence alignment wings of AGI labs [EDIT: and a group of researchers outside the labs, all in a way such that nobody shares it with people who may use the knowledge to advance capabilities].
I don't think the conclusion follows from the premises. People often learn new concepts after studying stuff, and it seems likely (to me) that when studying human cognition, we'd first be confused because our previous concepts weren't sufficient to understand it, and then slowly stop being confused as we built & understood concepts related to the subject. If an AI's thoughts are like human thoughts, given a lot of time to understand them, what you describe doesn't rule out that the AI's thoughts would be comprehensible.
The mere existence of concepts we don't know about in a subject doesn't mean that we can't learn those concepts. Most subjects have new concepts.
Counterintuitively, it may be easier for an organization (e.g. Redwood Research) to get a $1 million grant from Open Phil than it is for an individual to get a $10k grant from LTFF. The reason why is that both grants probably require a similar amount of administrative effort and a well-known organization is probably more likely to be trusted to use the money well than an individual so the decision is easier to make. This example illustrates how decision-making and grant-making processes are probably just as important as the total amount of money available.
A priori, and talking with some grant-makers, I'd think the split would be around people & orgs who are well-known by the grant-makers, and those who are not well-known by the grant-makers. Why do you think the split is around people vs orgs?
This seems like an underestimate because you don’t consider whether the first “AGI” will indeed make it so we only get one chance. If it can only self improve by more gradient steps, then humanity has a greater chance than if it self improves by prompt engineering or direct modification of its weights or latent states. Shard theory seems to have nonzero opinions on the fruitfulness of the non-data methods.
I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:
Despite the economists, the economy doesn't try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they're able to make a profit, and dis-rewards firms which aren't able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.
Individual firms also don't (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.
Politicians don't try to (only) maximize win-probability.
Democracies don't try to (only) maximize voter approval.
Evolution doesn't try to maximize inclusive genetic fitness.
Memes don't try to maximize inclusive memetic fitness.
Academics don't try to (only) maximize status.
China doesn't maximize allegiance to the CCP.
I think there's a general tendency for people to look at local updates in a system (when the system has humans as decision nodes, the local updates are called incentive gradients), somehow perform some integration-analogue for a function which would produce those local updates, then find a local minimum of that "integrated" function and claim the system is at that minimum or can be approximated well by the system at that minimum. Generally, this seems constrained in empirical systems by common sense learned by experience with the system, but in less and less empirical systems (like the economy or SGD), people get more and more crazy because they have less learned common sense to guide them when making the analysis.
This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).
John usually does not make his plans with an eye toward making things easier. His plan previously involved values because he thought they were strictly harder than corrigibility. If you solve values, you solve corrigibility. Similarly, if you solve abstraction, you solve interpretability, shard theory, value alignment, corrigibility, etc.
I don’t know all the details of John’s model here, but it may go something like this: If you solve corrigibility, and then find out corrigibility isn’t sufficient for alignment, you may expect your corrigible agent to help you build your value aligned agent.
I've seen this, their examples don't seem so subtle to me compared with alternatives.
For example,
You can clearly see a cat in the center of the left image!