It depends somewhat on what you mean by 'near term interpretability' - if you apply that term to research into, for example, improving the stability and ability to access the 'inner world models' held by large opaque langauge models like GPT-3, then there's a strong argument that ML based 'interpretability' research might be one of the best ways of directly working on alignment research,
And see this discussion for more,
Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.
Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff.
So language model transparency/interpretability tools might be useful on the basis of pro 2) and also 1) to some extent, because it will help build tools for intereting TAI systems and alos help align them ahead of time.
1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires.
2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback.
3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world.
4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.
Cross-posted from the EA Forum: https://forum.effectivealtruism.org/posts/Ry4C4CKZvuRG7ztxY/pros-and-cons-of-working-on-near-term-technical-ai-safety
Recently I've been thinking about the pros and cons of working on near-term technical AI safety and assurance. This includes topics such as interpretability for near-term systems, generalizability / robustness, AI security, testing, verification, and the like.
Here are my own considerations so far:
(Note: In what follows I use the term Transformative AI (TAI) very loosely to mean any type of AI that has a decent chance of leading to a global catastrophe if safety challenges are not addressed first.)
Pros
(For whatever it's worth, my personal inside view leans towards 3 as the most plausibly important from an EA point of view.)
Cons
I'm also interested in how these pros and cons might change if you're doing research for large organizations (industry or government) that might plausibly have the capacity to eventually build TAI-type systems, but where the research you do will not be publicly available due to proprietary or secrecy reasons. If it makes a difference, let's assume that you're working at a place that is reasonably ethical (as corporations and governments go) and that is at least somewhat aware of AI ethics and safety concerns.
I think that in this situation you'd have both a reduction in the value of the pros (since your solutions won't spread beyond your organization, at least for some time) and in the potential damage of the cons (for the same reason). But it seems to me that the cons are still mostly there, and possibly made worse: The lowered barriers to deployment would still probably lead your organization to press its advantage, thereby increasing the market (or strategic) value of AI as perceived by competitors, thereby leading to more resources poured into AI research in general - only now the competition might not have all the best safety solutions available to it because they're proprietary.
I'm curious what others think about all this. I would also appreciate links to good previous discussions of these topics. The only one I know of at the moment is this post, which discusses some of these considerations but not all.