Thanks for sharing! I think the paper is cool (though massively buries the lede). My summary:
Yeah, sorry, I should have made clear that the story that I tell in the post is not contained in the linked paper. Rather, it's a story that David Bau sometimes tells during talks, and which I wish were wider-known. As you note, the paper is about the problem of taking specific images and relighting them (not of generating any image at all of an indoor scene with unlit lamps), and the paper doesn't say anything about prompt-conditioned models. As I understand things, in the course of working on the linked project, Bau's group noticed that they couldn't get scenes with unlit lamps out of the popular prompt-conditioned generative image models.
Ah, thanks for the clarification! That makes way more sense. I was confused because you mentioned this in a recent conversation, I excitedly read the paper, and then couldn't see what the fuss was about (your post prompted me to re-read and notice section 4.1, the good section!).
Another thought: The main thing I find exciting about model editing is when it is surgical - it's easy to use gradient descent to find ways to intervene on a model, while breaking performance everywhere else. But if you can really localise where a concept is represented in the model and apply it there, that feels really exciting to me! Thus I find this work notably more exciting (because it edits a single latent variable) than ROME/MEMIT (which apply gradient descent)
This post advertises an illustrative example of model editing that David Bau is fond of, and which I think should be better known. As a reminder, David Bau is a professor at Northeastern; the vibe of his work is "interpretability + interventions on model internals"; examples include the well-known ROME, MEMIT, and Othello papers.
Consider the problem of getting a generative image model to produce an image of a bedroom containing unlit lamps (i.e. lamps which are turned off). Doesn't sound particularly interesting. Let's try the obvious prompts on DALL-E-2.
Doesn't work great: my first three attempts only got one borderline hit, at the expense of turning the entire bedroom dark (which isn't really what I had in mind).
As David Bau tells things, even after putting further effort into engineering an appropriate prompt, they weren't able to get what they wanted: a normal picture of a bedroom with lamps which are not turned on. (Apparently the captions for images containing unlit lamps don't often mention the lamps.)
Despite not being able to get what they wanted with prompt engineering, they were able to get what they wanted via interpretability + model editing. Namely, they took a GAN which was trained to produce images of bedrooms, and then:
The result is shown below.
I like this example because it provides a template in a very toy setting for how model editing could be useful for alignment. The basic structure of this situation is:
This example also showcases some key weaknesses of this technique, which would need to be addressed for model editing to become a viable alignment strategy:
Overall, I'm not super bullish on the usefulness of model editing for alignment. But I do think it's intriguing, and it seems to have been useful in at least one case (though not necessarily one which is very analogous to x-risky cases of misalignment). Overall, I think that work like this is a risky bet, with the advantage that some of its failure modes might differ from the failure modes of other alignment techniques.
[Thanks to Xander Davies and Oam Patel for discussion.]
That is, I find it very likely that DALL-E-2 "can" produce images of bedrooms with unlit lamps, for whatever reasonable interpretation you'd like of what that "can" means.
The general vibe here is "the effectiveness of model editing scales with our ability to generate accurate interpretability hypotheses, whereas the effectiveness of finetuning scales with something different." In more detail:
* Finetuning relies on being able to effectively evaluate outputs, whereas model editing might not. For example, suppose that we would like to get a language model to write good economics papers. Finetuning for human evaluations of economics papers could jointly optimize for papers which are more true as well as more persuasive. On the other hand, if our interpretability were good enough to separate out the mechanisms for writing truly vs. writing persuasively, then we could hope to make a model edit which results in more true but not more persuasive papers output.
* Finetuning might require large datasets, whereas model editing might not. Again in the economics paper writing example, finetuning a model to write good economics papers might require a large number of evaluations. On the other hand, we might be able to generate accurate interpretability hypotheses using only a small number of dataset examples (and possibly no dataset examples at all).
* Model editing might transfer better out-of-distribution. If the interpretability hypothesis on which we base a model edit is good, then the model edit ought to have the predicted effect on a broad distribution of inputs. On the other hand, fine-tuning only guarantees that we get the desired behavior on the fine-tuning distribution.