Introduction to the sequence: Interpretability Research for the Most Important Century

Evan R. Murphy

This is the first post in a sequence exploring the argument that interpretability is a high-leverage research activity for solving the AI alignment problem.

This post contains important background context for the rest of the sequence. I'll give an overview of one of Holden Karnofksy's (2022) Important, actionable research questions for the most important century^[1], which is the central question we'll be engaging with in this sequence. I'll also define some terms and compare this sequence to existing works.

If you're already very familiar with Karnofsky (2022)^[1] and interpretability, then you can probably skip to the second post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

The Alignment Research Activities Question

This sequence is being written as a direct response to the following question from Karnofsky (2022)^[1]:

“What relatively well-scoped research activities are particularly likely to be useful for longtermism-oriented AI alignment?” (full question details)

I'll refer to this throughout the sequence as the Alignment Research Activities Question.

Context on the question and why it matters

In the details linked above for the Alignment Research Activities Question, Holden first discusses two categories of alignment research which are lacking in one way or another. He then presents a third category with some particularly desirable properties:

“Activity that is [1] likely to be relevant for the hardest and most important parts of the problem, while also being [2] the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)”

He refers to this as "category (3)", but I'll use the term High-leverage Alignment Research since it's more descriptive and we'll be referring back to this concept often throughout the sequence.

We want to know more about which alignment research is in this category. Why? Further excerpts from Karnofsky (2022)^[1] to clarify:

“I think anything we can clearly identify as category (3) [that is, High-leverage Alignment Research] is immensely valuable, because it unlocks the potential to pour money and talent toward a relatively straightforward (but valuable) goal.
[...]
I think there are a lot of people who want to work on valuable-by-longtermist-lights AI alignment research, and have the skills to contribute to a relatively well-scoped research agenda, but don’t have much sense of how to distinguish category (3) from the others.

There’s also a lot of demand from funders to support AI alignment research. If there were some well-scoped and highly relevant line of research, appropriate for academia, we could create fellowships, conferences, grant programs, prizes and more to help it become one of the better-funded and more prestigious areas to work in.

I also believe the major AI labs would love to have more well-scoped research they can hire people to do."

I won't be thoroughly examining other research directions besides interpretability, except in cases where a hypothetical interpretability breakthrough is impacting another research direction toward a potential solution to the alignment problem. So I don't expect this sequence to produce a complete comparative answer to the Alignment Research Activities Question.

But by investigating whether interpretability research is High-leverage Alignment Research, I hope to put together a fairly comprehensive analysis of interpretability research that could be useful to people considering investing their money or time into it. I also hope that someone trying to answer the larger Alignment Research Activities Question could use my work on interpretability in this sequence as part of a more complete, comparative analysis across different alignment research activities.

So in the next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, I'll be exploring whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."

Then, in a later post of this sequence, I'll explore whether interpretability has property #2 of High-leverage Alignment Research. That is, whether interpretability is "the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)"

A note on terminology

First of all, what is interpretability?

I’ll borrow a definition (actually two) from Christoph Molnar’s Interpretable Machine Learning (the superscript numbers here are Molnar's footnotes, not mine - you can find what they refer to by following the link):

“A (non-mathematical) definition of interpretability that I like by Miller (2017)³ is: Interpretability is the degree to which a human can understand the cause of a decision. Another one is: Interpretability is the degree to which a human can consistently predict the model’s result⁴. The higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model.”

I also occasionally use the word “transparency” instead of “interpretability”, but I mean these to be synonymous.

Comparison to existing works

This is the first post I’m aware of attempting to answer the Alignment Research Activities Question since Karnofsky (2022)^[1] put it forth.

However, there are several previous posts which explore interpretability at a high-level and its possible impact on alignment. Many of the ideas in this post hence aren’t original and either draw from these earlier works or arrived independently at the same ideas.

Here are some of the relevant posts, and my comments on how they compare to the present sequence:

Neel Nanda’s A Longlist of Theories of Impact for Interpretability. Neel proposes a list of 20 possible impacts for interpretability and briefly describes them. He puts forth a wide range of possible impacts, from technical alignment solutions to norm-setting and cultural shifts following from interpretability research. There’s also an interesting linked spreadsheet where he conducted a survey among several researchers and had them rate the plausibility of each theory.

The second post in this sequence is similar to Neel's post in that it explores potential impacts of interpretability on alignment. My post covers a smaller number of scenarios in greater depth, mostly limiting the type of potential impacts to solving technical alignment. I evaluate each scenario's impact on different aspects of alignment My post references Neel's as well as his spreadsheet.
Beth Barnes’ Another list of theories of impact for interpretability. Beth provides a list of interpretability theories of impact similar to Neel’s above, but focusing on technical alignment impacts. It explores some interesting scenarios I hadn't seen mentioned before elsewhere. The second post in this sequence focuses on a similar number of interpretability scenarios; some of my scenarios overlap with Beth's.
Mark Xu's Transparency Trichotomy. This is a useful exploration of 3 general approaches to producing interpretability in AI systems: transparency via inspection, transparency via training, and transparency via architecture. Mark goes more in-depth to each of these and some ways they could converge or assist each other. We reference these 3 approaches throughout the present sequence.
jylin04’s Transparency and AGI safety. jylin04 argues that there are four motivations for working on transparency. Three of the motivations are about safety or robustness. The remaining motivation is interestingly about using transparency to improve forecasting. jylin04 also reviews the Circuits program and discuss future directions for interpretability research.

Motivation #2 from jylin04's post is about how important interpretability seems for solving the inner alignment. We will see this theme recur throughout the second post of the present sequence, where interpretability is identified as being capable of great positive impacts on inner alignment across 7 scenarios and a wide range of analyzed techniques.
Evan Hubinger’s An overview of 11 proposals for building safe advanced AI. Hubinger’s post heavily influenced the second post in this sequence, as did several of his other works. Hubinger proposes four important components of alignment, which I borrow in order to evaluate the impacts of 7 different interpretability scenarios on alignment. It's interesting to note that although Hubinger’s post wasn’t specifically about interpretability, every single one of the 11 alignment proposals he evaluated turns out to depend on interpretability in an important way.^[2]
How can Interpretability help Alignment? by RobertKirk, Tomáš Gavenčiak, and flodorner. Kirk et al.’s post explores interactions between interpretability and several alignment proposals. It also has some discussion aimed at helping individual researchers decide what to work on within interpretability.

The present sequence has a lot in common with the Kirk et al. post. The second post in this sequence similarly considers the impact of interpretability on many different alignment proposals. Later in the sequence, I plan to evaluate whether interpretability research exhibits property of #2 of High-leverage Alignment Research. Property #2 is concerned with helping individual researchers find direction in the bottom-up fashion that Kirk et al. have done, but it is also concerned with the possibility of being able to onboard researchers in a more systematic and potentially top-down manner.
Open Philanthropy’s RFP on Interpretability, written by Chris Olah. The second post in this sequence is similar to the Open Phil RFP in terms of offering some ambitious goals and milestones for interpretability and in attempting to attract researchers to the field.

It's interesting to note that the RFP's aspirational goal wasn’t actually aspirational enough to make the list of scenarios in the next post of this sequence. (However, many elements in Scenario 1: Full understanding of arbitrary neural networks were inspired by it.) This makes sense when you consider that the purpose of the RFP was to elaborate concrete near-term research directions for interpretability. By contrast, the second post in this sequence requires a more birds-eye view of interpretability endgames in order to evaluate whether interpretability has property #1 of High-leverage Alignment Research.

Another difference is that I, unlike Olah in the post for Open Phil, am not directly offering anyone money to work on interpretability! ;)
Paul Christiano’s Comments on OpenPhil's Interpretability RFP. Here Paul is responding to the previously summarized RFP by Chris Olah. Paul is excited about interpretability but advocates prioritizing in-depth understanding of very small parts.

In a sense, the second post in this sequence is gesturing in the opposite direction, highlighting aspirational goals and milestones for interpretability that would be game changers and have plausible stories for solving AGI alignment. However, the ideas aren’t really at odds at all. Paul’s suggestions may be an important tactical component of eventually realizing one or more of the 7 interpretability endgame scenarios considered in the next post of this sequence.
Opinions on Interpretable Machine Learning and 70 Summaries of Recent Papers by Peter Hase and lifelonglearner (Owen Shen). This is an amazingly thorough post on the state of interpretability research as of mid-2021. Borrowing from Rohin’s summary for the Alignment Newsletter, ‘The authors provide summaries of 70 (!) papers on [interpretability], and include links to another 90. I’ll focus on their opinions about the field in this summary. The theory and conceptual clarity of the field of interpretability has improved dramatically since its inception. There are several new or clearer concepts, such as simulatability, plausibility, (aligned) faithfulness, and (warranted) trust. This seems to have had a decent amount of influence over the more typical “methods” papers.’ I’d like to say that I thoroughly internalized these concepts and leveraged them in the scenarios analysis which follows. However, that’s not the case. So I’m calling this out as a limitation of the present writing. While these concepts are evidently catching on in the mainstream interpretability community focused on present-day systems, I have some skepticism about how applicable they will be to aligning advanced AI systems like AGI - but they may be useful here as well. I also would like to spend more time digging into this list of papers to better understand the cutting edge of interpretability research, which could help inform the numerous “reasons to be optimistic/pessimistic” analyses I make later in this post.

What’s next in this series?

The next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, explores whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."

Acknowledgments

Many thanks to Joe Collman, Nick Turner, Eddie Kibicho, Donald Hobson, Logan Riggs Smith, Ryan Murphy and Justis Mills (LessWrong editing service) for helpful discussions and feedback on earlier drafts of this post.

Thanks also to the AGI Safety Fundamentals Curriculum, which is an excellent course I learned a great deal from leading up to writing this, and for which I started this sequence as my capstone project.

Read the next post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios

^{^}
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
^{^}
3 of the 11 proposals explicitly have “transparency tools” in the name. 5 more of them rely on relaxed adversarial training. In Evan Hubinger’s Relaxed adversarial training for inner alignment, he explains why this technique ultimately depends on interpretability as well:
“...I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are only
shooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases.”
Then there is Microscope AI, which is an alignment proposal based entirely around interpretability. STEM AI relies on transparency tools to solve inner alignment issues in Hubinger’s analysis. Finally, in proposal #2 which utilizes intermittent oversight, he clarifies that the overseers will be "utilizing things like transparency tools and adversarial attacks."

4