Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
3 of the 11 proposals explicitly have “transparency tools” in the name. 5 more of them rely on relaxed adversarial training. In Evan Hubinger’s Relaxed adversarial training for inner alignment, he explains why this technique ultimately depends on interpretability as well:
“...I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are only
shooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases.”
Then there is Microscope AI, which is an alignment proposal based entirely around interpretability. STEM AI relies on transparency tools to solve inner alignment issues in Hubinger’s analysis. Finally, in proposal #2 which utilizes intermittent oversight, he clarifies that the overseers will be "utilizing things like transparency tools and adversarial attacks."
This is the first post in a sequence exploring the argument that interpretability is a high-leverage research activity for solving the AI alignment problem.
This post contains important background context for the rest of the sequence. I'll give an overview of one of Holden Karnofksy's (2022) Important, actionable research questions for the most important century[1], which is the central question we'll be engaging with in this sequence. I'll also define some terms and compare this sequence to existing works.
If you're already very familiar with Karnofsky (2022)[1] and interpretability, then you can probably skip to the second post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
The Alignment Research Activities Question
This sequence is being written as a direct response to the following question from Karnofsky (2022)[1]:
I'll refer to this throughout the sequence as the Alignment Research Activities Question.
Context on the question and why it matters
In the details linked above for the Alignment Research Activities Question, Holden first discusses two categories of alignment research which are lacking in one way or another. He then presents a third category with some particularly desirable properties:
He refers to this as "category (3)", but I'll use the term High-leverage Alignment Research since it's more descriptive and we'll be referring back to this concept often throughout the sequence.
We want to know more about which alignment research is in this category. Why? Further excerpts from Karnofsky (2022)[1] to clarify:
I won't be thoroughly examining other research directions besides interpretability, except in cases where a hypothetical interpretability breakthrough is impacting another research direction toward a potential solution to the alignment problem. So I don't expect this sequence to produce a complete comparative answer to the Alignment Research Activities Question.
But by investigating whether interpretability research is High-leverage Alignment Research, I hope to put together a fairly comprehensive analysis of interpretability research that could be useful to people considering investing their money or time into it. I also hope that someone trying to answer the larger Alignment Research Activities Question could use my work on interpretability in this sequence as part of a more complete, comparative analysis across different alignment research activities.
So in the next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, I'll be exploring whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."
Then, in a later post of this sequence, I'll explore whether interpretability has property #2 of High-leverage Alignment Research. That is, whether interpretability is "the sort of thing that researchers can get up to speed on and contribute to relatively straightforwardly (without having to take on an unusual worldview, match other researchers’ unarticulated intuitions to too great a degree, etc.)"
A note on terminology
First of all, what is interpretability?
I’ll borrow a definition (actually two) from Christoph Molnar’s Interpretable Machine Learning (the superscript numbers here are Molnar's footnotes, not mine - you can find what they refer to by following the link):
I also occasionally use the word “transparency” instead of “interpretability”, but I mean these to be synonymous.
Comparison to existing works
This is the first post I’m aware of attempting to answer the Alignment Research Activities Question since Karnofsky (2022)[1] put it forth.
However, there are several previous posts which explore interpretability at a high-level and its possible impact on alignment. Many of the ideas in this post hence aren’t original and either draw from these earlier works or arrived independently at the same ideas.
Here are some of the relevant posts, and my comments on how they compare to the present sequence:
The second post in this sequence is similar to Neel's post in that it explores potential impacts of interpretability on alignment. My post covers a smaller number of scenarios in greater depth, mostly limiting the type of potential impacts to solving technical alignment. I evaluate each scenario's impact on different aspects of alignment My post references Neel's as well as his spreadsheet.
Motivation #2 from jylin04's post is about how important interpretability seems for solving the inner alignment. We will see this theme recur throughout the second post of the present sequence, where interpretability is identified as being capable of great positive impacts on inner alignment across 7 scenarios and a wide range of analyzed techniques.
The present sequence has a lot in common with the Kirk et al. post. The second post in this sequence similarly considers the impact of interpretability on many different alignment proposals. Later in the sequence, I plan to evaluate whether interpretability research exhibits property of #2 of High-leverage Alignment Research. Property #2 is concerned with helping individual researchers find direction in the bottom-up fashion that Kirk et al. have done, but it is also concerned with the possibility of being able to onboard researchers in a more systematic and potentially top-down manner.
It's interesting to note that the RFP's aspirational goal wasn’t actually aspirational enough to make the list of scenarios in the next post of this sequence. (However, many elements in Scenario 1: Full understanding of arbitrary neural networks were inspired by it.) This makes sense when you consider that the purpose of the RFP was to elaborate concrete near-term research directions for interpretability. By contrast, the second post in this sequence requires a more birds-eye view of interpretability endgames in order to evaluate whether interpretability has property #1 of High-leverage Alignment Research.
Another difference is that I, unlike Olah in the post for Open Phil, am not directly offering anyone money to work on interpretability! ;)
In a sense, the second post in this sequence is gesturing in the opposite direction, highlighting aspirational goals and milestones for interpretability that would be game changers and have plausible stories for solving AGI alignment. However, the ideas aren’t really at odds at all. Paul’s suggestions may be an important tactical component of eventually realizing one or more of the 7 interpretability endgame scenarios considered in the next post of this sequence.
What’s next in this series?
The next post, Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios, explores whether interpretability has property #1 of High-leverage Alignment Research. That is, whether interpretability is "likely to be relevant for the hardest and most important parts of the [AI alignment] problem."
Acknowledgments
Many thanks to Joe Collman, Nick Turner, Eddie Kibicho, Donald Hobson, Logan Riggs Smith, Ryan Murphy and Justis Mills (LessWrong editing service) for helpful discussions and feedback on earlier drafts of this post.
Thanks also to the AGI Safety Fundamentals Curriculum, which is an excellent course I learned a great deal from leading up to writing this, and for which I started this sequence as my capstone project.
Read the next post in this sequence: Interpretability’s Alignment-Solving Potential: Analysis of 7 Scenarios
Karnofsky, Holden (2022): Important, actionable research questions for the most important century
Sometimes when I quote Karnofsky (2022), I'm referring directly to the link above to the post on the Effective Altruism Forum. Other times I'm referring to text that only appears in the associated Appendix 1: detailed discussion of important, actionable questions for the most important century that Holden provides, which is on Google Docs.
The "most important century" part of the present sequences's name also draws its inspiration from Karnofsky (2022) and an earlier blog post series by the same author.
3 of the 11 proposals explicitly have “transparency tools” in the name. 5 more of them rely on relaxed adversarial training. In Evan Hubinger’s Relaxed adversarial training for inner alignment, he explains why this technique ultimately depends on interpretability as well:
“...I believe that one of the most important takeaways we can draw from the analysis presented here, regardless of what sort of approach we actually end up using, is the central importance of transparency. Without being able to look inside our model to a significant degree, it is likely going to be very difficult to get any sort of meaningful acceptability guarantees. Even if we are only
shooting for an iid guarantee, rather than a worst-case guarantee, we are still going to need some way of looking inside our model to verify that it doesn't fall into any of the other hard cases.”
Then there is Microscope AI, which is an alignment proposal based entirely around interpretability. STEM AI relies on transparency tools to solve inner alignment issues in Hubinger’s analysis. Finally, in proposal #2 which utilizes intermittent oversight, he clarifies that the overseers will be "utilizing things like transparency tools and adversarial attacks."