I think this is really exciting and I’m very interested see how it goes. I think the current set of problems and methodologies is solid enough that participants have a reasonable shot at making meaningful progress within a month. I also expect this to be a useful way to learn about language models and to generally be in a better position to think about alignment.
I think we’re still a long way from understanding model behavior well enough that we could e.g. rule out deceptive alignment, but it feels to me like recent work on LM interpretability is making real progress towards that goal, and I can imagine having large teams studying frontier models closely enough to robustly notice deceptive alignment well in advance by the time we have transformative AI.
I'm really excited about this program! Super curious to see what comes out of it - I expect I'll learn a lot whether it goes well, or struggles to get traction. And I want to see more of this kind of ambitious scalable alignment effort!
If you're interested in getting into mechanistic interpretability work, you should definitely apply to it
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no, but I am broadly a bit confused when this is a commitment for.
Also, are people going through as cohorts or will they start with the training week whenever they show up, not necessarily in-sync with anyone else?
Also, is the idea to be doing self-directed research by default, or research in collaboration with Redwood staff by default? I don't know what my default action is day-to-day during this program. Do I have to come in with a bunch of research plans already?
Thanks for the questions :)
I was trying to figure out whether someone who is just here for the month of November should apply. I think the answer is no,
Probably no.
but I am broadly a bit confused when this is a commitment for.
Yeah we haven't totally settled this yet; the application form asks a lot of questions about availability. I think the simplest more specific answer is "you probably have to be available in January, and it would be cool if you were available earlier and wanted to get here earlier and do this for longer".
Also, are people going through as cohorts or will they start with the training week whenever they show up, not necessarily in-sync with anyone else?
Not totally settled. We'll probably have most people at a big final cohort in January, and we'll try to have people who arrive earlier show up at synced times so that they can do the training week with others.
Also, is the idea to be doing self-directed research by default, or research in collaboration with Redwood staff by default? I don't know what my default action is day-to-day during this program. Do I have to come in with a bunch of research plans already?
The default is to do research directed by Redwood staff. You do not need to come in with any research plans.
This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel.
REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ).
Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field. We also think participants will learn skills valuable for many styles of interpretability research, and also for ML research more broadly.
Apply here by Sunday, November 13th [DEADLINE EXTENDED] to be a researcher in the program. Apply sooner if you’d like to start early (details below) or receive an earlier response.
Some key details:
Feel free to email programs@rdwrs.com with questions.
Goals
Why you should apply:
Research results. We are optimistic about the research progress you could make during this program (more below). Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field.
Skill-building. We think this is a great way to gain experience working with language models and interpreting/analyzing their behaviors. The skills you’ll learn in this program will be valuable for many styles of interpretability research, and also for ML research more broadly.
Financial support & community. This is a paid opportunity, and a chance to meet and connect with other researchers interested in interpretability.
Why we’re doing this:
Research output. We hope this program will produce research that is useful in multiple ways:
Training and hiring. We might want to hire people who produce valuable research during this program.
Experience running large collaborative research projects. It seems plausible that at some point it will be useful to run a huge collaborative alignment project. We’d like to practice this kind of thing, in the hope that the lessons learned are useful to us or others.
See “Is this research promising enough to justify running this program?” and “How useful is this kind of interpretability research for understanding models that might pose an existential risk?”
Why do this now?
We think our recent progress in interpretability makes it a lot more plausible for us to reliably establish mechanistic explanations of model behaviors, and therefore get value from a large, parallelized research effort.
A unified framework for specifying and validating explanations. Previously, a big bottleneck on parallelizing interpretability research across many people was the lack of a clear standard of evidence for proposed explanations of model behaviors (which made us expect the research produced to be pretty unreliable). We believe we’ve recently made some progress on this front, developing an algorithm called “causal scrubbing” which allows us to automatically derive an extensive set of tests for a wide class of mechanistic explanations. This algorithm is only able to reject hypotheses rather than confirming them, but we think that this still makes it way more efficient to review the research produced by all the participants.
Improved proofs of concept. We now have several examples where we followed our methodology and were able to learn a fair bit about how a transformer was performing some behavior.
Tools that allow complicated experiments to be specified quickly. We’ve built a powerful library for manipulating neural nets (and computational graphs more generally) for doing intervention experiments and getting activations out of models. This library allows us to do experiments that would be quite error-prone and painful with other tools.
Who should apply?
We're most excited about applicants comfortable working with (basic) Python, any of PyTorch/TensorFlow/Numpy, and linear algebra. Quickly generating hypotheses about model mechanisms and testing them requires some competence in these domains.
If you don’t understand the transformer architecture, we’ll require that you go through preparatory materials, which explain the architecture and walk you through building one yourself.
We’re excited about applicants with a range of backgrounds; prior experience in interpretability research is not required. The primary activity will be designing, running, and analyzing results from experiments which you hope will shed light on how a model accomplishes some task, so we’re excited about applicants with experience doing empirical science in any field (e.g. economics, biology, physics). The core skill we’re looking for here, among people with the requisite coding/math background, is something like rigorous curiosity: a drive to thoroughly explore all the ways the model might be performing some behavior and narrow them down through careful experiments.
What is doing this sort of research like?
Mechanistic interpretability is an unusual empirical scientific setting in that controlled experimentation is relatively easy, but there’s relatively little knowledge about the kinds of structures found in neural nets.
Regarding the ease of experimentation:
Regarding the openness of the field:
REMIX participants pursue interpretability research akin to the investigations Redwood has done recently into induction heads, indirect object identification (IOI) in small language models, and balanced parenthesis classification, all of which will be released publically soon. You can read more about behavior selection criteria here.
The main activities will be:
The mechanisms for behaviors we’ll be studying are often surprisingly complex, so careful experimentation is needed to accurately characterize them. For example, the Redwood researchers investigating the IOI behavior found that removing the influence of the circuit they identified as primarily responsible had surprisingly little effect on the model’s ability to do IOI. Instead, other heads in the model substantially changed their behavior to compensate for the excision. As the researchers write, “Both the reason and the mechanism of this compensation effect are still unclear. We think that this could be an interesting phenomenon to investigate in future work.”
Here’s how a Redwood researcher describes this type of research:
This research also involves a reasonable amount of linear algebra and probability theory. Researchers will be able to choose how deep they want to delve into some of the trickier math we’ve used for our interpretability research (for example, it turns out that one technique we’ve used is closely related to Wick products and Feynman diagrams).
Schedule
The program will start out with a week of training using our library for computational graph rewrites and investigating model behaviors using our methodology. This week will have a similar structure to MLAB (our machine learning bootcamp), with pair programming and a prepared curriculum. We’re proud to say that past iterations of MLAB have been highly-reviewed – the participants in the second iteration gave an average score of 9.2/10 to the question “How likely are you to recommend future MLAB programs to a friend/colleague?”.
An approximate schedule for week one:
In future weeks, you’ll split your time between investigating behaviors in these models, communicating your findings to the other researchers, and reading/learning from/critiquing other researchers’ findings.
Miscellaneous Notes
FAQ
What if I can’t make these dates?
We encourage you to submit an application even if you can’t make the dates; we have some flexibility, and might make exceptions for exceptional applicants. We’re planning to have some participants start as soon as possible to test drive our materials, practice in our research methodology, and generally help us structure this research program so it goes well.
Am I eligible if I’m not sure I want to do interpretability research long-term?
Yes.
What’s the application process?
You fill out the form, complete some TripleByte tests that assess your programming abilities, then do an interview with us.
Can you sponsor visas?
Given this program is a research sprint rather than a purely educational program, and given the fact that we plan to offer stipends for participants, we can’t guarantee sponsorship of the right-to-work visas required for international participants to be in person. If you are international but studying at a US university, we are optimistic about getting a CPT for you to be able to participate.
However, we still encourage international candidates to apply. We’ll try to evaluate on a case-by-case basis and for exceptional candidates depending on your circumstances, there may be alternatives, like trying to sponsor a visa to have your join later or participating remotely for some period.
Is this research promising enough to justify running this program?
Buck’s opinion:
How useful is this kind of interpretability research for understanding models that might pose an existential risk?
This research might end up not being very useful. Here’s Buck’s description of some reasons why this might be the case:
Overall, we think that mechanistic interpretability is one of the most promising research directions for helping prevent AI takeover. Our hope is that mature interpretability techniques will let us distinguish between two ML systems that each behave equally helpfully during training – even having exactly the same input/output behavior on the entire training dataset – but where one does so because it is deceiving us and the other does so “for the right reasons.”[1]
Our experience has been that explaining model behaviors supports both empirical interpretability work – guiding how we engineer interpretability tools and providing practical knowhow – and theoretical interpretability work – for example, leading to the development of the causal scrubbing algorithm. We expect many of the practical lessons that we might learn would generalize to more advanced systems, and we expect that addressing the theoretical questions that we encounter along the way would lead to important conceptual progress.
Currently, almost no interesting behaviors of ML models have been explained – even for models that are tiny compared with frontier systems. We have been working to change this, and we’d like you to help.
For other perspectives on this question, see Chris Olah’s description of the relevance of thorough, small-model interpretability here and Paul Christiano’s similar view here.
Apply here by November 8th to be a researcher in the program, and apply sooner if you want to start ASAP. Sooner applications are also more likely to receive sooner responses. Email programs@rdwrs.com with questions.
The problem of distinguishing between models which behave identically on the training distribution is core to the ELK problem.