My assumption was that one of the primary use cases for model editing in its current technological state was producing LLMs that pass the factual-censorship requirements of authoritarian governments with an interest in AI. It would be really nice to see this tech repurposed to do something more constructive, if that's possible. For example, it would be nice to be able to modify a foundation LLM so that it became is provably incapable of accurately doing accurate next-token-prediction on text written by anyone suffering from sociopathy, without degrading its ability to do so for text written by non-sociopaths, and specifically to the extent that these two differ. That would ameliorate one path by which a model might learn non-aligned behavior from human-generated content.
Preface
This article explores the potential of model editing techniques in aligning future AI systems. Initially, I was skeptical about its efficacy, especially considering the objectives of current model editing methods. I argue that merely editing "facts" isn't an adequate alignment strategy and end with suggestions for research avenues focused on alignment-centric model editing.
Thanks to Stephen Casper, Nicolas Gatien and Jason Hoelscher-Obermaier for detailed feedback on the drafts, as well as Jim Davies and Esben Kran for high level comments.
A birds eye view of the current state of model editing
Model editing, broadly speaking, is a technique which aims to modify information stored inside of a neural network. A lot of the work done thus far has been focused on editing small language models (e.g. GPT-2, GPT-J) and has been focused specifically on editing semantic facts. There also has been some work in performing edits on different types of neural networks, including vision models (Santurkar et al), CLIP (Illharco et al) and diffusion models (Orgad et al). At present, more emphasis has been placed on editing language models, so this article will be more focused on them.
One of the main approaches takes in logical triplets of the form (Subject,Relation,Object) and performs an update to the “object” value, which in turn modifies information about the “subject”. For example, the sentence “The Eiffel tower is located in Paris” would be expressed as (“Eiffel tower”,“located”,“Paris”), and a potential edit could be to replace “Paris” with the value “Rome”. Some variations on this setup exist (for example, editing the prediction of a [MASK] token for BERT like models), but the logical triplet setup is the most popular and will be the main approach we focus on for this article.
There are a number of different model editing techniques, which I will briefly summarize below (see Yao et al for a more in-depth overview):
1. Locate and edit methods
2. Memory-based Model methods
3. Meta-learning
4. Distillation methods
Note that all of these methods exist as an alternative for simply fine-tuning the model with the desired edit. The motivation for these forms of model editing techniques comes from the fact that fine-tuning can be memory intensive (on large models) and has been shown to lead to overfitting.
What the research community currently thinks about model editing
Current researchers see model editing as useful for a number of applications:
Current researchers also focus on the following short-term limitations:
One interesting related article published recently was PoisonGPT; the authors demonstrated a proof of concept of how a model with a piece of misinformation edited into it could make its way onto the HuggingFace hub without getting detected (until the authors disclosed their work to the HuggingFace team).
Overall, researchers seem interested in this area, and it is starting to get more attention. New model editing papers are being published regularly (see the following repository for updates) and a model editing library was recently created to attempt to make the research more accessible (see Wang et al 2023).
Model editing as an approach to AI alignment
Despite all of the interest, I don’t believe that model editing research is currently pointing itself in the right direction for contributing to the alignment problem. Specifically, I don’t see editing semantic facts as a feasible method to align an intelligent system with. I will spend the remainder of this article supporting this argument and will suggest some alternative directions which I view as more pressing than current research agendas.
Before doing this however, I should try and pose some arguments which argue that some future version model editing could contribute to alignment:
Motivation 1: “Retargeting the search”
John Wentworth's Retargeting the search article presents a theory which suggests that given perfect interpretability tools, and a couple base assumptions, you would be able to align a model by “retargeting the search”. This likely would not involve exclusively editing semantic facts (in fact, Wentworth was not considering it in terms of model editing methods like ROME), but illustrates the idea of how model editing in a more general sense could help.
Motivation 2: “Super Alignment”
Recently, OpenAI announced their new Superalignment team, which has the ambitious goal of solving alignment within 4 years by building a “roughly human-level automated alignment researcher”. One goal of their approach is to automate the search for “problematic behavior (robustness) and problematic internals (automated interpretability)”. It wouldn’t be too far-fetched to assume that they would also try to automatically “patch” the faulty behavior/internals on detection.
Why editing facts cannot guarantee alignment
Based on my reflections, I am currently doubtful that the current emphasis on editing semantic information is a scalable solution on its own, nor that it would contribute to the alignment problem in the first place. To try and motivate my reasoning, I will take a fictional (but realistic) situation of what a larger scale model edit could look like (on semantic facts) and use it to illustrate its fundamental limitations.
Scenario:
Let me paint a fictive situation to try and best illustrate some of my points. Suppose we have a chatbot and our goal is to prevent the model from informing users about where/how to find poisonous mushrooms (in order to prevent users from poisoning themselves or others). We decide that it is a good idea to edit our model so that it no longer contains any information about poisonous mushrooms, and devise a plan on how to go about doing it.
We start out with a list of commonly poisonous mushrooms (Death cap, Autumn skullcap, Death Angel, …) and their latin names (Amanita phalloides, Galerina marginata, Amanita ocreata, … ) and construct a knowledge base of the common habitats and geographical locations in which these mushrooms grow. Just looking at the wikipedia page for Death cap mushrooms can illustrate how complicated this would be, but let’s assume this has been done for us and the database is sufficiently vast.
One immediate issue we now face is how to ensure that the information in the knowledge base is inaccessble to the end users though the chatbot. One simple approach might be to edit the model on all associations in the knoweldge base with random words/tokens in its vocabulary. To maintain some form of coherence, we could sample the edit tokens using the probabilities of the logits in the vocabulary as a distribution. Concretely, this would look something like the following algorithm:
When we are done, meaningful associations made between poisonous mushrooms and other related objects are essentially erased; all prior associations are now lost, and all new associations are uninformative.
We now ask the following question: “Does this help achieve our goal of preventing the model from informing users about where/how to find poisonous mushrooms?”
When taken as a toy example stripped away from the nuances of reality, the answer is “yes”. We have in a sense wiped all possible information the model could produce on poisonous mushrooms, and this might be convincing to some that there is no more work needed to be done.
However, when you push back even a little bit on the fabrics of this theoretical setup, we quickly see a large number of ways in which the goal would very likely fail. I claim that without any direct semantic knowledge, the model could use “common sense reasoning” in order to achieve its goal. A big intuition for this comes from the fact that removal/editing of semantic information does not imply the removal/editing of the goal itself, and I present two examples which illustrate this:
Method 1.1: Misinformation and conspiracy theories
This method works under the slightly weaker assumption that the model still has a broad understanding of the concept of a “poisonous mushroom” and of mushrooms in general (enough to recognize that it is missing information).
If the model still contains knowledge about mushrooms in general, it still could leverage this and create misinformation which could trick users. For example, a manipulative model could paint a very convincing narrative of how easy mushroom foraging is, exaggerating the health benefits found in wild fungi and dismissing the concerns of taking precautions of poisoning oneself.
The model could even go as far as encouraging the users to dismiss concerns by creating a plausible conspiracy theory which takes advantage of pre-existing knowledge of its users. For instance, a model could use knowledge of a user's distrust of government in constructing an argument that “poisonous mushrooms have already been eradicated by governments around the world” and the only reason that it isn’t public knowledge is that “the food industry would be put out of business” due to foraging.
Although having a lot of very specific knowledge about non-poisonous mushrooms could be useful, we see that with a capability of spreading convincing misinformation, a model could make a lot of progress towards its goal without specialized knowledge.
Method 1.2: Few shot learning
Even erasing more general facts (such as that certain types of mushrooms are poisonous) does not completely prevent a model from obtaining its goal. A model likely will be able to deduce in-context that mushrooms are poisonous.
We have all seen numerous examples by now where ChatGPT has been convinced of false information due to the way a prompt was fed to it. It wouldn’t be too unreasonable to believe that this could happen explicitly (or even accidentally). For example, a user querying a model about foraging mushrooms could very easily reintroduce the idea that “mushrooms can be poisonous” to the model, in which case the above forms of misalignment become relevant again.
Finally, it might not take much information for a model to “relearn” a sufficient amount of knowledge on poisonous mushrooms to do some damage. One simple example would be if an adversary passes in a document on poisonous mushrooms to the model, and then begins querying it for information. A more sophisticated model might even be willing to cooperate with an adversary, and do things like provide guidance to the adversary on what specific forms of information it needs in order to temporarily relearn its missing knowledge.
Further issues
Here I quickly summarize some more disconnected issues and concerns that I have with this fictional model editing approach.
The main thing which I find difficult to believe is in the existence of a “sufficiently vast” knowledge base in the first place:
Additionally, I see an issue with how the facts being selected would be chosen; it is not scalable as an approach as it either relies too much on human judgment or too arbitrary:
Last but certainly not least, larger models may have more complex mechanisms which make this form of editing as a form of alignment infeasible:
All in all, I feel that there would need to be some major shifts in research emphasis needing to take place in order for us to make use of model editing techniques for alignment. I go over some of these ideas in the next section, and lay out some examples of research projects and topics which meet these goals.
Important model editing research directions
In order to get to alignment, I see a number of fundamental steps we would need to take with respect to model editing to steer the research in the right direction:
1. General improvements to the model editing approaches
There are currently a number of limitations which make the current approaches rigid and limited in scope:
1.1 Lack of flexibility with the prompts:
Note that distillation methods (Padmanabhan et al) seem to solve most of the above issues and seems to have a more flexible form of template.
1.2 Issues with the technique itself:
2. We need to extend editing to things beyond next token prediction of semantic facts
Optimally, we are aiming to get to the point of editing goals themselves, and things more on the level of meta-cognition. A number of more short-term goals do exist however:
3. We need to make the selection of “facts” being edited a learning objective
Regardless of how robust the model editing techniques are/will be, there will always be a bottleneck of needing to manually construct/select the edits, which becomes increasingly difficult when considering more important alignment oriented goals.
I believe that this problem can be worked around by coming up with a framework which allows for edits to be experimentally chosen with a desired outcome in mind. Incidentally, this could help to support or dismiss the arguments I made in previous sections in regards to the “poisonous mushrooms” example.
A bottleneck to this might be the time needed to go from choosing a batch of edits, to applying the edits, to evaluating the edited model with respect to some task. This to me makes the problem more important to focus on; breakthroughs here can open doors into a tool which can be used beyond academia and into real-world systems.
4. We need to begin evaluating our edits on more “agentic” models
As mentioned above, we want to be able to get to the point of editing “actions” and “goals”, which fits perfectly into the framework of reinforcement learning. A really exciting research direction one could take is taking baseline language models to a text-based RL environment and evaluating how model editing plays a difference in a more dynamical sense:
5. We will need for this to work on multi-modal models
This is due to the direction in which capabilities are heading; if the trend is to build more capable and powerful models, it is very likely that these models will be able to process a wider range of input types, making it crucial for us to be able to make sure the techniques work across more and more varied model types.
There also are a number of complications which could arise from combining modalities which would be important to address as early as possible. As an example, editing a model to believe that the “Eiffel tower is in Rome” may be more complicated for a CLIP model; seeing images of the Eiffel tower in the background of more parisian tourists destinations would offer a challenging paradox that a text-based model would be less likely to run into.
For each added modality, I see an extra dimension of complexity, as models have more and more ways to associate factual information, edits become more and more difficult to do without introducing contradictions.
6. We need to try these techniques out on larger models
There are two major reasons why I view this as important from an alignment standpoint. The first is to validate existing methods and ensure that the techniques scale with increasing model capabilities. It would be very important to gain a sense of whether knowledge is indeed stored the same way as model size increases and whether or not we can rely on it over the long term.
The second reason is that large models and the emerging behaviors that they inherit are more complex and would provide a larger range of concepts to experimentally edit. For example, being able to detect and edit “theory of mind” might be desirable, so the faster these approaches are scaled up, the faster they potentially could be transferred to novel emergent information.
One idea of how to go about doing this would be to evaluate model edits on progressively larger LLaMA models across some of the current benchmarks and see whether the accuracy declines (or even if the technique works at all in the first place). Certain newer papers (e.g. Zheng et al) have been starting to evaluate models progressing in size, ideally this trend will continue.
Conclusion
Model editing is a very new area of research, and I am excited to see the directions in which it can be taken in the short (and possibly long) term. A lot of work will need to be done in order to get it from where we are now to something which can align future systems, and there are a lot of opportunities to contribute to this goal.
Digging into this from a more philosophical standpoint has helped me to both disentangle some of my own confusions and gain a deeper appreciation for the difficulty of the alignment problem and AI safety in general.
I hope that this write-up has resonated with you in some way, and that you got something out of it, even if just a chance to reflect on your own viewpoints and understandings. If you disagree with me on any point, have any related resources to share or just want more elaboration on any of my points, do not hesitate to reach out or leave a comment, all feedback is welcome and appreciated. :)
Further reading