(Random note: I think this post would get more attention if the title more clearly communicated what this post is about. Maybe something like "Who is Harry Potter? Some predictions about when unlearning will fail.". Feel free to totally ignore this comment.)
Microsoft has released a paper called "who is Harry Potter" in which they claim to make a neural network forget who Harry Potter is.
https://www.microsoft.com/en-us/research/project/physics-of-agi/articles/whos-harry-potter-making-llms-forget-2/
Here are some of my predictions on how I think this method might fail. I am predicting publicly without looking at the evidence first. This is speculatory.
Non-sourcelike references.
The system of training proposed in the paper is centered around training on the data they want forgotten. Namely the actual text.
The origional network was presumably able to use it's knowledge of Harry Potter when completing computer code in what appeard to be a Harry Potter video game. Or writing in a foreign language. Or other contexts that use the knowledge, but are clearly very different from the source material.
I am not confident that the model would break like this, but I suspect the following prompt or similar would make the model fail.
With the model failing by outputting "Granger".
Plot leak
The way this forgetting method works involves generating "equivalent" text and training the network to apply the same probability distribution that it's unmodified version does to the "equivalent".
They use language models to do this, but how doesn't seem important.
So they take text like.
Text 1
And they turn it into text like this
Text 2
Imagine being a large language model and reading that text. It's pretty clear it's a Harry Potter knock off job. Perhaps a parody, perhaps a lazy hack of a writer. Perhaps some future network has read this paper and is well aware of precisely what you are doing. A smart and generally well generalizing model should be able to figure out that the text resembles Harry Potter, if it had seen knock offs in general, and Harry Potter. Even if it's training dataset contained no information on Harry, other than the original source text.
Thus the network trying to predict the next word would continue in this style. It won't produce the original names of things, but the plot, style of text and everything else will be highly recognizable.
Now the network that is supposed to forget Harry Potter is trained to output the same probability for text 1 that the original network output for text 2.
Now the network that is supposed to be forgetting Harry Potter has presumably been trained on it in the first place. But it wouldn't matter if it hadn't been. Information is still leaking through.
So I predict that, given text that starts off sounding like a knock off of Harry Potter, this model is likely to continue to sound like a knock off. Leaking info as it does this. For example, I would suspect that this model, given the first part of text 2, will produce continuations with 4 houses in more often than continuations with 3 or 5 houses.