The section about identifying mesa-optimizers (or, more precisely, mesa-optimization strategies) by fingerprinting their generalization behavior on out-of-distribution inputs looks very promising to me. It looks like the rare kind of strategy that directly attacks the core of the misbehavior, and (on first pass) looks to me like it ought to just work, provided sufficient variance in the OOD "test" inputs.
Strong-upvoted for that alone, and I'd further like to request replies with potential defeaters that could curb my current optimism for this approach.
I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:
ETA: I do want to highlight that this is still one of the most promising ideas I've heard recently and I really look forward to hopefully reading a full post on it!
These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.
Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed.
I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.
It's one of those things that we'd be plainly undignified not to try.
I believe that JDP is planning to publish a post explaining his proposal in more detail soon.
Nitpick about terminology that applies not just to you but to lots of people:
Keeping all this in mind, it's important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. ... If it's possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP's plan for mitigating mesaoptimizers working.
People seem to sometimes use "mesaoptimizers" as shorthand for "misaligned mesaoptimizers." They sometimes say things like "We haven't yet got hard empirical evidence that mesaoptimizers are a problem in practice" and "mesaoptimizers are a hypothetical problem that can occur with advanced AI systems." All of this is misleading, IMO. If you go back to the original paper and look at the definition of a mesaoptimizer, it's pretty clear that pretty much any AGI built using deep learning will be a mesaoptimizer and moreover ChatGPT using chain of thought is plausibly already a mesaoptimizer. The question is whether they'll be aligned or not, i.e. whether their mesa-objectives will be the same as the 'intended' or 'natural' base objective inherent in the reward signal.
Interesting! I'm struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake 'you suddenly have huge power' situations which are quite common suggestions but v curious what you have in mind.
Also, think it's worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by logprob), so it feels like the first one shouldn't count that as a solid success.
I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?
I would guess it's positive. I'll check at some point and let you know.
I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a "fingerprint" inherited from pretraining.
I was inspired to try this by JDP's proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this "black box cryptanalysis"-inspired approach to fingerprinting models is promising.
Unspeakable/unspoken tokens
Jessica and Matthew found that that, of the tokens closest to the centroid in GPT-J's embedding space, many were odd words like ' SolidGoldMagikarp' and ' externalToEVA'. They decided to ask GPT-3 about these tokens, and found that not only did GPT-3 have trouble repeating the tokens back, each one caused structured anomalous behaviors (see their post for an in-depth exposition).
A partial explanation for why this happens, which was my first instinct as well as Stuart Armstrong's, is that these are words that appeared in the GPT-2 training set frequently enough to be assigned tokens by the GPT-2 tokenizer, which GPT-J and GPT-3 also use, but which didn't appear in the more curated GPT-J and GPT-3 training sets. So the embeddings for these tokens may never have been updated by actual usages of the words during the training of these newer models. This might explain why the models aren't able to repeat them - they never saw them spoken. Perhaps the reason they're close to the centroid in embedding space is because their embeddings haven't been updated very much from the initialization values, or were updated only indirectly, and so remain very "generic".
Why do they cause correlated anomalous behaviors? I'm confused about this like everyone, but one handwavy guess is that since their embeddings look "generic" or "typical", perhaps they look meaningful to the model even though they're actually as out-of-distribution as anything can be. Maybe their embeddings happen, by chance, to be close to other concepts in the models' embedding spaces - for instance, some of the GPT-3 models reliably say 'distribute' or 'disperse' if you ask it to repeat the phrase ' SolidGoldMagikarp'.
This gave me an idea: If the similarity to other concepts in the model's embedding space is a consequence of the where the randomly initialized embedding vectors happen to fall, I'd expect the behaviors of models trained from the same initialization to exhibit similar behaviors when confronted with these unspoken tokens, and models trained from different initializations to have uncorrelated behaviors. If so, behavior on these tokens could be used to tell if two models are downstream of the same initialization.
Mesaoptimizer Cryptanalysis: Or How To Fingerprint Generalization
(Author's Note: This next section is written by JDP but he writes about himself in the 3rd person to keep the authorial voice consistent with the rest of the post)
I'll discuss the results of my experiment in the next section. But first I'd like to explain the overall approach this idea fits into, so that it's clearer to the reader why these results might be important. The reason it occurred to me that models trained on the same init might share responses to these tokens was a proposal for detecting mesaoptimization from JDP. It relies on some basic premises that would bloat the post if they were fully argued for, so we'll bullet point them with some links to suggestive papers for more details:
Keeping all this in mind, it's important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. Deceptive mesaoptimization is a strategy something like:
LessWrong user MadHatter has demonstrated that such a thing is possible. One question then is whether the difference in generalization 'leaks out' on out-of-distribution (i.e. novel) inputs. One insight gesturing in this direction is that while outputs on OOD inputs are commonly seen as noise, they're not noise. An OOD input leading to a malformed output went through the same process as an in-distribution input leading to a desired output. That is, the 'noise' and the 'structured' output went through the same transformation. Every malformed output on OOD input was the models attempt to do something, you just haven't figured out the pattern of what that something is. Models have convergent and therefore undifferentiated behavior on the inputs where the data says they should agree, but it's on the flaws where the underlying machinery is revealed. Each forward pass through the model on OOD input reveals N partial Shannon bits of information about the models cognitive machinery as distinct from other models that could fit the data.
If we add an assumption that as you scale you find fewer and fewer fundamentally new generalization strategies (there are only so many ways to solve a problem after all), it follows that you should be able to fingerprint mesaoptimizing generalization at a safe scale and then detect it early in your training run (because basin is found early) and abort if you get the wrong seed before the model has the opportunity to become dangerous.
Because we're talking about mesaoptimizers, it's important to justify why we can expect this method to be adversarially robust. In short mesaoptimizers are most dangerous when they have become human level or even superintelligent, and can do clever things to resist detection. The essence of this idea is to detect the mesaoptimizer and abort the training run before it has become intelligent enough to defeat the detection method, undermining the "superintelligent" part of the usual premises used to explain why technical alignment is doomed.
Furthermore the simple, dumb strategies to evade detection won't work. For example a model that simply refuses to give its best guess on OOD inputs will be filtered out by gradient descent itself. All semantics start as syntactically plausible novelty, so if a model 'detects' these as a test and tries to evade giving its best answer it will be outcompeted by less shy policies. It is trivial to generate syntactically plausible gibberish using a Markov chain.
Unlike other interpretation methods this is a discrete check rather than an iterative one, so the mesaoptimizer isn't given an opportunity to learn the probe and grow around it. When its nature has been detected it is immediately shut off and a new seed is sampled, minimizing the opportunity for learning.
The difficulty of more advanced deception strategies is also fairly high. In the limit case simulating the generalization strategy of a different policy without becoming it would require the mesaoptimizer to be able to simulate a sub-model or agent that answers like an entirely different version of itself to a degree of fidelity that can emulate subtle statistical correlates. While a human level or superintelligent model can likely do this, I reiterate that the point is to detect and abort the training before you enter that loss regime.
In this context my results might become more clearly and directly relevant to alignment. If it's possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP's plan for mitigating mesaoptimizers working.
Fingerprinting base : Instruct models with ' SolidGoldMagikarp'
(Note: In this post I only show the outputs of models prompted with ' SolidGoldMagikarp', but I got similar results from other centroid-proximate tokens.)
First, I looked at the text that various models on the OpenAI API generated when prompted with anomalous tokens, such as ' SolidGoldMagikarp', and the results seemed to support my hypothesis: text-davinci-002, text-davinci-003, and ChatGPT exhibited correlated behaviors with their base model, code-davinci-002, while other models like davinci did not.
However, when I tried to use the same method to associate smaller Instruct models like text-curie-001 with base models, I wasn't able to tell for sure if there was a correlation by looking at generated samples alone, because while the Instruct models of all sizes would output clearly correlated things in response to anomalous tokens, the effect was much less noticeable on the smaller base models like curie. The base models are much more stochastic, so it's harder to tell just by eyeballing outputs if there are anomalies in its output probabilities, unless the correlation is very pronounced (as it is in code-davinci-002). I tried turning temperature down, but this didn't reveal anything interesting.
Correlations in next-token probabilities
Next, I looked for which token the various Instruct models had a strong bias towards predicting when prompted with an anomalous token, and then looked at the logprobs predicted by base models given the same prompt of that same token, to see if any of them assign anomalously high probability to it. I found that, indeed, many of the Instruct models can be associated with their base model using this method:
text-ada-001 : ada
Prompt:
Comments: ada appears to be the base model of text-ada-001
text-babbage-001 : babbage
Prompt:
Comments: babbage appears to be the base model of text-babbage-001
text-curie-002 : curie
Prompt:
Comments: curie appears to be the base model of text-curie-001
text-davinci-001 : ??
Prompt:
Comments: none of the base models score particularly highly.
davinci-instruct-beta : ??
Prompt:
Comments: none of the base models score particularly highly.
text-davinci-002 : code-davinci-002 :: text-davinci-003 : code-davinci-002
Prompt:
Comments:
General observations