Review


Part 1 of 12 in the Engineer’s Interpretability Sequence.

If we want to reduce near and long term risks from AI, we should care a lot about interpretability tools. This is a very uncontroversial claim to make inside the AI safety community. Almost every agenda for safe advanced AI incorporates interpretability in some way. The key value of interpretability tools is that they aid in human oversight by enabling open-ended evaluation. 

Short of actually deploying a system, any method of evaluating it can only be a proxy for its actual performance. The most common way to evaluate a model is by its performance in some test set or environment. But test sets alone can fail to reveal – and often incentivize – undesirable solutions involving overfitting, biases, deception, etc. This highlights the need for other ways to evaluate models, and an interpretability toolbox full of effective tools may go a long way. 

Some of the seeds of the AI safety community’s interest in interpretability were planted by Distill in 2017. But 2022 was an inflection point with a massive new surge in interest and work on interpretability tools. Anthropic was founded a little over a year ago. ARC started less than a year ago. Redwood has begun to push for much more interpretability work, including with the REMIX program. We are seeing a number of pushes to get many more people involved in interpretability work. And as someone on the ground, I have subjectively observed a surge in interest over 2022. And the popularity of interpretability hasn’t been limited to the AI safety community. There is now so much work in interpretability that we now have a dataset of 5199 interpretability papers (Jacovi, 2023). See also a survey of 300+ of them from some coauthors and me (Räuker et al., 2022)

Growth in the interpretability literature by year from Jacovi (2023).

But despite all this work, interpretability research has limitations. One of the goals of this sequence is to argue that:

Interpretability research both within the AI safety space and at large is not very productive and may be on course to stay this way. 

This is intentionally baitey, and I mean to make this point with a large amount of detail and nuance over the course of this sequence. But one striking thing about interpretability research is that:

For all the interpretability work that exists, there is a significant gap between this research and engineering applications.

This is not to say that purely exploratory work is not good and necessary. But the problem of AI safety is an engineering problem at its core. If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful in the real world? Hence the name of this sequence: The Engineer’s Interpretability Sequence (EIS). 

This sequence will start with foundations, engage with existing work, and build toward an agenda. There will be 12 parts.

  1. EIS I: Intro
  2. EIS II: What is “Interpretability”? 
  3. EIS III Broad critiques of Interpretability Research
  4. EIS IV: A Spotlight on Feature Attribution/Saliency 
  5. EIS V: Blind Spots In AI Safety Interpretability Research
  6. EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety 
  7. EIS VII: A Challenge for Mechanists
  8. EIS VIII: An Engineer’s Understanding of Deceptive Alignment
  9. EIS IX: Interpretability and Adversaries
  10. EIS X: Continual Learning, Modularity, Compression, and Biological Brains
  11. EIS XI: Moving Forward
  12. EIS XII: Summary

In the coming days, I plan to post a new installment every day or so. Thanks to my labmates, advisor, friends, and many others in the interpretability community for lots of good conversations and inspiration in the past year. Thanks to Rio Popper for feedback on this intro post. I'll be thanking others on a per-post basis later on. However, to be 100% clear, all opinions, hot takes, and mistakes are my own. 

In the coming posts, I will discuss dozens of takes on a variety of topics. And I’ve done my best to make sure that those takes are good. But if and when some of them are not good, I hope that coming to understand why will be useful. I’ll look forward to corrections and alternative points of view in the comments. Feedback throughout will be welcome. Thanks!

Questions

  • Is there anything in particular you would like to see discussed later in this sequence?
  • How truly pre-paradigmatic do you think interpretability research is? Is it still time to explore concepts and techniques, or should we be focusing more on benchmarks and real-world applications?
  • What things about interpretability research make you optimistic or pessimistic?
  • Are you working on anything to make interpretability work more engineering-relevant?
New Comment
18 comments, sorted by Click to highlight new comments since:

We are seeing a number of pushes to get many more people involved in interpretability work

Context: I work at redwood. You linked to REMIX here, but I wouldn't neccesarily argue for more people doing interpretability on the margin (and I think Buck probably roughly agrees with me here). I think it's plausible that too much effort is going to interp at the margin. I'm personally far more worried about interpretability work being well directed and high quality than the number of people involved. (It seems like I potentially agree with you on this point based on what this post is implying)

Edit: it seems like you define interpretability very broadly, to the point where I'm a bit confused about what is or isn't interpretability work. This comment should be interpreted to refer to interpretability as 'someone (humans or AIs) getting a better understanding how an AI works (often with a mechanistic connotation)'

Interesting to know that about the plan. I have assumed that remix was in large part about getting more people into this type of work. But I'm interested in the conclusions and current views on it. Is there a post reflecting on how it went and what lessons were learned from it?

Don't think there have been public writeups, but here's two relevant manifold markets:
 


 

Is there anything in particular you would like to see discussed later in this sequence?

It seems like you're trying to convince people to do interpretability research differently or to work on other types of research.

If so, I think that it might be worth engaging with peoples cruxes. This can be harder than laying out general arguments, but it would make the sequence more useful.

That I said, I don't really know what people's cruxes for working in interp are and as far as I know this sequence already includes lots of discussion along these lines.

If one of our main goals for interpretability research is to help us with aligning highly intelligent AI systems in high stakes settings, shouldn’t we be seeing tools that are more helpful in the real world?

There are various reasons you might not see tools which are helpful right now. Here are some overly conjunctive examples:

  1. There's a clear and well argued plan for the tools/research to build into tools/research which reduce X-risk, but this plan requires additional components which don't exist yet. So, these components are being worked on. Ideally, there would be some reason to think that acquiring this components is achievable.
  2. There's a rough idea that 'this sort of thing has to be helpful' and people are iterating on stuff which seems like it pushes on making interpretability eventually useful. (Even if it isn't making it more useful now and the path is unclear)
  3. People are working in the space for reasons other than maximally reducing X-risk (e.g., interpretability is cool or high status or is what I do to get paid).

I wouldn't be particularly worried if (1) was the dominant explanation: clear plans should make it obvious if the research isn't on track to reduce X-risk and should directly justify the lack of current tools.

I'm somewhat worried about (2) resulting in very poor allocation of resources. Further, it's hard to know if things are going as well (because all you had was a rough intuition). But, it's not really clear that this is bad bet to make overall.

I think reasons like (2) are very common for people working on interp targeting X-risk.

My guess is that for people saying their working on X-risk we shouldn't be super worried about the failure modes associated with (3) (or similar reasons).

I think that (1) is interesting. This sounds plausible, but I do not know of any examples of this perspective being fleshed out. Do you know of any posts on this?

I would argue that ARC's research is justified by (1) (roughly speaking). Sadly, I don't think that there are enough posts on their current plans for this to be clear or easy for me to point at. There might be some posts coming out soon.

I don't know if they'd put it like this, but IMO solving/understanding superposition is an important part of being able to really grapple with circuits in language models, and this is why it's a focus of the Anthropic interp team

At least based on my convos with them, the Anthropic team does seem like a clear example of this, at least insofar as you think understanding circuits in real models with more than one MLP layer in them is important for interp -- superposition just stops you from using the standard features as directions approach almost entirely!

I'm hopeful that Redwood (where I work) moves toward having a clear and well argued plan or directly useful techniques (perhaps building up from more toy problems).

ARC started less than a year ago

FWIW, I woudn't describe ARC's work as interpretability. My best understanding is that they aren't directly targeting better human understanding of how AIs work (though this may happen indirectly). (I'm pretty confident in this, but maybe someone from ARC will correct me : ) )

Edit: it seems like you define interpretability very broadly, to the point where I'm a bit confused about what is or isn't interpretability work. This comment should be interpreted to refer to interpretability as 'someone (humans or AIs) getting a better understanding how an AI works (often with a mechanistic connotation)'

Thanks! I discuss in the second post of the sequence why I lump ARC's work in with human-centered interpretability. 

If we want to reduce near and long term risks from AI, we should care a lot about interpretability tools. This is a very uncontroversial claim to make inside the AI safety community. Almost every agenda for safe advanced AI incorporates interpretability in some way. The key value of interpretability tools is that they aid in human oversight by enabling open-ended evaluation.

Hmm, I actually don't think this is uncontroversial if by 'interpretability' you mean mechanistic interpretability. I think there's a pretty plausible argument that doing anything other than running your AI (and training it) will end up being irrelevant. And this argument could extend to thinking that the expected value of working on (mechanistic) interpretability is considerably lower than other domains.

If by interpretability, you mean 'understand what the AI is doing via any means', then it seem very likely to be useful and widely used (see here for instance, but the idea of trying to understand what the model is doing by interacting with it is very basic). I'm not currently sure what research should be done in this domain, but there are evals projects iterating on this sort of work.

Oh, it seems like you're reluctant to define interpretability, but if anything into using a very broad definition. Fair enough, I certainly agree that "methods by which something novel about a system can be better predicted or described" are important.

How do you anticipate and strategize around dual-use concerns, particularly for basic / blue-sky interpretability-enablong research?

I think that my personal thoughts on capabilities externalities are reflected well in this post

I'd also note that this concern isn't very unique to interpretability work but applies to alignment work in general. And in comparison to other alignment techniques, I think that the downside risks of interpretability tools are most likely lower than those of stuff like RLHF. Most theories of change for interpretability helping with AI safety involve engineering work at some point in time, so I would expect that most interpretability researchers have similar attitudes to this on dual use concerns. 

In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work. 

In general, a tool being engineering-relevant does not imply that it will be competitive for setting a new SOTA on something risky. So when I will talk about engineering relevance in this sequence, I don't have big advancements in mind so much as stuff like fairly simple debugging work. 

Fwiw this does not seem to be in the Dan Hendrycks post you linked!

Correct. I intended the 3 paragraphs in that comment to be separate thoughts. Sorry.