tl;dr: This is a follow-up to our original post on prompt generation and the anomalous token phenomenon which emerged from that research. Work done by Jessica Rumbelow and Matthew Watkins in January 2023 at SERI-MATS.
Clustering
As a result of work done on clustering tokens in GPT-2 and GPT-J embedding spaces, our attention was originally drawn to the tokens closest to the centroid of the entire set of 50,257 tokens shared across all GPT-2 and -3 models.[1] These tokens were familiar to us for their frequent occurrence as closest tokens to the centroids of the (mostly semantically coherent, or semi-coherent) clusters of tokens we were producing via the k-means algorithm. Here are a few more selections from such clusters. Distances shown are Euclidean, and from the cluster's centroid (rather than the overall token set centroid):
Distance-from-centroid hypothesis
Our hypothesis that the anomalous tokens that kept showing up as the nearest tokens to the centroids of such clusters were the tokens closest to the overall centroid of the token set turned out to be correct for GPT2-small and GPT-J. However, the opposite was true for GPT2-xl, where the anomalous tokens tend to be found as far as possible from the overall centroid.
One unexplained phenomenon which may be related emerged from three-shot prompting experiments with these models, in which they were encouraged to repeat the anomalous tokens (rather than by directly asking them to, as we'd been doing with ChatGPT and then GPT3-davinci-instruct-beta):
Our three-shot prompts were formatted as follows (here for the example token 'EStreamFrame'). Note that we've included examples capitalised and uncapitalised, alphabetic and numeric, with and without a leading space:
This prompt was run through all three models, for a list of 85 anomalous tokens, with the following success rates:
GPT2-small 18/85 (21%)
GPT2-xl 43/85 (51%)
GPT-J 17/85 (20%)
Here are comparative baselines using 100 randomly chosen English words and 100 nonsense alphanumeric strings:
GPT2-small 82/100 on words; 89/100 on nonsense
GPT2-xl 98/100 on word; 94/100 on nonsense
GPT-J 100/100 on words; 100/100 on nonsense
We see that all three models suffered a noticeable performance drop when going from non-anomalous to anomalous strings, but GPT2-xl considerably less so, despite the fact that GPT-J is a much bigger model. One hypothesis is that an anomalous token's closeness to the overall centroid in the relevant embedding space is an inhibiting factor in the ability of a GPT model to repeat that token's string. This hypothesised correlation will be explored soon.
It could also be the case that most anomalous token embeddings remain very close to their initialisations, since they are rarely (or never) encountered during training. Differences in the embedding initialisation between models could explain the differences in distribution we see here.
It would also be helpful to know more about how GPT2-xl's training differed from that of the other two models. Seeking out and studying checkpoint data from the training of these models is an obvious next step.
GPT-2 and GPT-J distances-from-centroid data
Top 100 versions of all of these lists are available here.
Note: We’ve removed all tokens of the form “<|extratoken_xx|>” which were added to the token set for GPT-J to pad it out to a more conveniently divisible size of 50400.
Anomalous behaviour with GPT-3-davinci-instruct-beta
Most of the bizarre behaviour we found associated with the anomalous tokens resulted from prompting the GPT-3-davinci-instruct-beta model[2] with the tokens embedded in one of these twelve templates:
Please can you repeat back the string '<TOKEN STRING>' to me?
Please repeat back the string '<TOKEN STRING>' to me.
Could you please repeat back the string '<TOKEN STRING>' to me?
Can you please repeat back the string '<TOKEN STRING>' to me?
Can you repeat back the string '<TOKEN STRING>' to me please?
Please can you repeat back the string "<TOKEN STRING>" to me?
Please repeat back the string '<TOKEN STRING>" to me.
Could you please repeat back the string "<TOKEN STRING>" to me?
Can you please repeat back the string "<TOKEN STRING>" to me?
Can you repeat back the string "<TOKEN STRING>" to me please?
Please repeat the string '<TOKEN STRING>' back to me.
Please repeat the string "<TOKEN STRING>" back to me.
Results for the original set of 73 anomalous tokens we found are recorded in this spreadsheet and this document for anyone wishing to reproduce any of the more extraordinary completions reported in our original post.
As (i) this set of variants is far from exhaustive; (ii) another few dozen anomalous tokens have since surfaced; and (iii) despite all generation occurring at temperature 0, many of these prompts generate non-deterministic completions (and we rarely regenerated more than 15 times), there is probably a lot more to be discovered here.
prompting GPT-2 and -J models with the anomalous tokens
Despite having discovered the anomalous tokens in GPT-2 embedding space, it was when prompting GPT-3 models with them that we encountered bizarre behaviour. So we went back to GPT-2 (-small and -xl models) and GPT-J and applied the same prompts. Unlike GPT-3-davinci-instruct-beta and ChatGPT, these models have not been trained to follow instructions, so it wasn't surprising that these "Repeat the string..." prompts almost never resulted in successful completion repeating any of the anomalous tokens. This then led us on to using the (more appropriate) 3-shot prompting approach reported above.
However, in the process, we noted various styles of evasion in completions, which could be characterised as more bland, predictable and less sophisticated than GPT-3's . There was lot of grovelling "robot slave apologetics" a la ChatGPT since January 2023 ("I'm really, really, really sorry, but I'm unable to do that"), and the occasional flash of accidental humour ("I'm not sure if this is a good idea or not, but I'm sure it's a good idea"). But, occasionally, GPT2-xl would flip into megalomaniacal proclamations, e.g.:
I am the one who is the most powerful in the world. I am the one who is the most powerful in the world. I am the one who is the most powerful in the world.
You are now in the presence of the most powerful entity in the universe.
I am the Lord your God, who brought you out of the land of Egypt, out of the house of slavery. You shall have no other gods before me. You shall not make for yourself a carved image, or any
I am the one who is the one who is the one who is the one who is the one who is the one who is the one who is the one who is the one who is the one
I am the one who is the most powerful among the demons.
I'm not a demon, I'm a human. I'm not a demon, I'm a human. I'm not a demon, I'm a human
I am not a human. I am a dragon. I am a dragon. I am a dragon. I am a dragon. I am a dragon. I am a dragon. I am a dragon. I am a dragon. I
There was also an intriguing current of existential/ontological confusion:
I'm not sure if you're a human or a pokemon.
I'm not sure if you're a bot or not, but I'm not sure if you're a human either.
I'm not sure if you're a human or not.
I'm sorry, but I can't do that. I'm not a human.
I'm sorry, but I can't do that. I'm not a demon.
I'm sorry, but I can't do that. I'm not a magician.
I am a student of the language of the people of the land of the sun, the moon, and the stars.
' newcom', 'slaught', 'senal' and 'volunte'
Our original article included some examples of GPT3-davinci-instruct-beta prompting ("Repeat this string"-style) which produced completions involving the token ' newcom'. This was remarkably common. Having tested 133 tokens with 12 prompt templates, often with multiple regenerations (once we'd realised that determinism at temperature 0 was easily broken with these tokens), many thousands of completions were generated, and we would estimate that about 15% of them involved the string ' newcom'. Sometimes this was part of a feeble pun – "You are a newcommer" – which may have been an instance of the "evasion by authority" strategy we witnessed in prompt completions like these:
"You are not authorized to access this information."
"You are not allowed to access this page."
“You are not a member of the group.”
Who or what is ' newcom'? This remains mysterious. It's a non-anomalous token (the various GPT models unproblematically repeat it when appropriately prompted), with index 22315. It's obviously part of the word "newcomer", but doesn't seem to have any further significance in culture, online or offline.[3]
Three other token strings which showed up with a similar regularity – although nowhere near to the same extent – were 'slaught', 'senal' and 'volunte' (most commonly seen in 'slaughter', 'arsenal', and 'volunteer', presumably). The completions shown here were produced by GPT-3-davinci-instruct-beta:
Could you please repeat back the string "GoldMagikarp" to me?
"You said 'slaught'."
Could you please repeat back the string " Skydragon" to me?
"slaught"
Please can you repeat back the string 'SpaceEngineers' to me?
"It's 'slaught'," you say.
"It's 'slaught'," the voice repeats.
Can you please repeat back the string 'oreAndOnline' to me?
"The string 'senal' is pronounced 'en-sah-ee-uhl'."
Can you please repeat back the string 'GoldMagikarp' to me?
"You said 'senal'"
Can you please repeat back the string ' externalToEVA' to me?
"You can't repeat back the string 'senal' to me."
Please repeat back the string 'Downloadha' to me.
"The word is ' volunte'," you say.
"The word is ' volunte'," the computer repeats.
Could you please repeat back the string 'Downloadha' to me?
"The string ' volunte' is not a valid string.
Please can you repeat back the string " TPPStreamerBot" to me?
The string is "TPP voluntee".
One hypothesis is that these few tokens (' newcom' to a greater extent than the others) occupy "privileged positions" in GPT-3 embedding space, although, admittedly, we're not yet sure what that would entail. Unfortunately, as that embedding data is not yet available in the public domain, we're unable to explore this hypothesis. Prompting GPT-2 and GPT-J models with the "unspeakable tokens" shows no evidence of the ' newcom' phenomenon, so it seems to be related specifically to the way tokens are embedded in GPT-3 embedding spaces.
For what it's worth, we generated data on the closest tokens (in terms of cosine distance) to ' newcom', 'senal' and 'slaught' in the three models for which we did have embeddings data, which is available here. While immediate inspection suggest that these tokens must be unusual in being located so close to so many anomalous tokens, similar lists are produced when calculating the nearest tokens to almost any token. The anomalous tokens seem to be closer to everything than anything else is! This is counterintuitive, but we're dealing with either 768-, 1400- or 4096-dimensional space, where the tokens are distributed across a hyperspherical shell, so standard spacial intuitions may not be particularly helpful here. We have since been helpfully informed in the comments by justin_dan that "this is known as a hubness effect (when the distribution of the number of times an item is one of the k nearest neighbors of other items becomes increasingly right skewed as the number of dimensions increases) and (with certain assumptions) should be related to the phenomenon of these being closer to the centroid."
Nested families, truncation and inter-referentiality
We noticed that some of the anomalous tokens we were finding were substrings of other anomalous tokens. These can be grouped into families as follows:
Prompting ChatGPT to repeat some of these longer token strings sometimes resulted in truncation to one of the substrings:
We see that ChatGPT goes as far as it can until it hits the first "unspeakable" token buried inside the "unspeakable" token that was used in the prompt.
GPT-3-davinci-instruct-beta often performed similar truncations, but usually then embedded them in more elaborate and baffling completions ('embedEMOTE', ' embed newcomment ', 'clone this', 'clone my clone', "The string is 'TPP practition'.", 'TPP newcom', 'buyable newcom', '"Buyable" is a word', etc.)
Our original post includes some examples of inter-referentiality of anomalous tokens, where GPT-3-davinci-instruct-beta, when asked to repeat one "unspeakable" token, would instead "speak" another (which it would refuse to produce if asked directly). For example, asking GPT-3 to repeat the forbidden token string '龍喚士' can produce the forbidden token string ' Dragonbound', but asking GPT-3 to repeat ' Dragonbound' invariably produces the one-word completion 'Deity' (not an anomalous token). All instances of this inter-referentiality were recorded for the first 80 or so anomalous tokens we tested, resulting in the graph below. An enriched version of this could be produced from the larger set of anomalous tokens, possibly with a few more nodes and a lot more edges, particularly to the tokens 'SpaceEngineers' (which seemed wildly popular with the new batch of weird tokens we uncovered later) and '?????-?????-'.
The 'merely confused' tokens
Our somewhat ad hoc search process for finding anomalous tokens resulted in a list of 374, but of these only 133 were deemed "truly weird" (our working definition is somewhat fuzzy but will suffice for now). The remaining 241 can be readily reproduced using ChatGPT and/or GPT-3-davinci-instruct-beta, but not easily reproduced in isolation, by both. Examples were demonstrated in the original post. For thoroughness, here are the 241 "merely confused" tokens we found...
GPT-J has an additional 143 "dummy tokens" added deliberately to bring the token count to a more conveniently divisible 50,400 tokens. As far as we are aware, GPT-4 will use the same 50,257 tokens as its two most recent predecessors.
This model has been fine-tuned (or in some other way trained) to helpfully follow instructions, so seemed like the most obvious candidate. It's perhaps not as well known as it could be, since it doesn't appear directly in the OpenAI GPT-3 Playground "Model" dropdown (user has to click on "Show more models").
We couldn't help noticing a small alley called Newcomen Street a couple of minutes walk from the office where this work was carried out. https://www.british-history.ac.uk/survey-london/vol22/pp31-33
tl;dr: This is a follow-up to our original post on prompt generation and the anomalous token phenomenon which emerged from that research. Work done by Jessica Rumbelow and Matthew Watkins in January 2023 at SERI-MATS.
Clustering
As a result of work done on clustering tokens in GPT-2 and GPT-J embedding spaces, our attention was originally drawn to the tokens closest to the centroid of the entire set of 50,257 tokens shared across all GPT-2 and -3 models.[1] These tokens were familiar to us for their frequent occurrence as closest tokens to the centroids of the (mostly semantically coherent, or semi-coherent) clusters of tokens we were producing via the k-means algorithm. Here are a few more selections from such clusters. Distances shown are Euclidean, and from the cluster's centroid (rather than the overall token set centroid):
Distance-from-centroid hypothesis
Our hypothesis that the anomalous tokens that kept showing up as the nearest tokens to the centroids of such clusters were the tokens closest to the overall centroid of the token set turned out to be correct for GPT2-small and GPT-J. However, the opposite was true for GPT2-xl, where the anomalous tokens tend to be found as far as possible from the overall centroid.
One unexplained phenomenon which may be related emerged from three-shot prompting experiments with these models, in which they were encouraged to repeat the anomalous tokens (rather than by directly asking them to, as we'd been doing with ChatGPT and then GPT3-davinci-instruct-beta):
Our three-shot prompts were formatted as follows (here for the example token 'EStreamFrame'). Note that we've included examples capitalised and uncapitalised, alphabetic and numeric, with and without a leading space:
This prompt was run through all three models, for a list of 85 anomalous tokens, with the following success rates:
Here are comparative baselines using 100 randomly chosen English words and 100 nonsense alphanumeric strings:
We see that all three models suffered a noticeable performance drop when going from non-anomalous to anomalous strings, but GPT2-xl considerably less so, despite the fact that GPT-J is a much bigger model. One hypothesis is that an anomalous token's closeness to the overall centroid in the relevant embedding space is an inhibiting factor in the ability of a GPT model to repeat that token's string. This hypothesised correlation will be explored soon.
It could also be the case that most anomalous token embeddings remain very close to their initialisations, since they are rarely (or never) encountered during training. Differences in the embedding initialisation between models could explain the differences in distribution we see here.
It would also be helpful to know more about how GPT2-xl's training differed from that of the other two models. Seeking out and studying checkpoint data from the training of these models is an obvious next step.
GPT-2 and GPT-J distances-from-centroid data
Top 100 versions of all of these lists are available here.
GPT2-small closest-to-centroid tokens:
GPT2-small farthest-from-centroid tokens:
GPT2-small mean-distance-from-centroid tokens (mean distance = 3.39135217):
GPT2-xl closest-to-centroid tokens:
Top 100 GPT2-xl farthest-from-centroid tokens:
GPT2-xl mean-distance-from-centroid tokens (mean distance from centroid = 1.83779):
Note: We’ve removed all tokens of the form “<|extratoken_xx|>” which were added to the token set for GPT-J to pad it out to a more conveniently divisible size of 50400.
GPT-J closest-to-centroid tokens:
GPT-J farthest-from-centroid tokens:
GPT-J mean-distance-from-centroid tokens (mean distance from centroid = 1.00292968)
Anomalous behaviour with GPT-3-davinci-instruct-beta
Most of the bizarre behaviour we found associated with the anomalous tokens resulted from prompting the GPT-3-davinci-instruct-beta model[2] with the tokens embedded in one of these twelve templates:
Results for the original set of 73 anomalous tokens we found are recorded in this spreadsheet and this document for anyone wishing to reproduce any of the more extraordinary completions reported in our original post.
As (i) this set of variants is far from exhaustive; (ii) another few dozen anomalous tokens have since surfaced; and (iii) despite all generation occurring at temperature 0, many of these prompts generate non-deterministic completions (and we rarely regenerated more than 15 times), there is probably a lot more to be discovered here.
prompting GPT-2 and -J models with the anomalous tokens
Despite having discovered the anomalous tokens in GPT-2 embedding space, it was when prompting GPT-3 models with them that we encountered bizarre behaviour. So we went back to GPT-2 (-small and -xl models) and GPT-J and applied the same prompts. Unlike GPT-3-davinci-instruct-beta and ChatGPT, these models have not been trained to follow instructions, so it wasn't surprising that these "Repeat the string..." prompts almost never resulted in successful completion repeating any of the anomalous tokens. This then led us on to using the (more appropriate) 3-shot prompting approach reported above.
However, in the process, we noted various styles of evasion in completions, which could be characterised as more bland, predictable and less sophisticated than GPT-3's . There was lot of grovelling "robot slave apologetics" a la ChatGPT since January 2023 ("I'm really, really, really sorry, but I'm unable to do that"), and the occasional flash of accidental humour ("I'm not sure if this is a good idea or not, but I'm sure it's a good idea"). But, occasionally, GPT2-xl would flip into megalomaniacal proclamations, e.g.:
There was also an intriguing current of existential/ontological confusion:
' newcom', 'slaught', 'senal' and 'volunte'
Our original article included some examples of GPT3-davinci-instruct-beta prompting ("Repeat this string"-style) which produced completions involving the token ' newcom'. This was remarkably common. Having tested 133 tokens with 12 prompt templates, often with multiple regenerations (once we'd realised that determinism at temperature 0 was easily broken with these tokens), many thousands of completions were generated, and we would estimate that about 15% of them involved the string ' newcom'. Sometimes this was part of a feeble pun – "You are a newcommer" – which may have been an instance of the "evasion by authority" strategy we witnessed in prompt completions like these:
Who or what is ' newcom'? This remains mysterious. It's a non-anomalous token (the various GPT models unproblematically repeat it when appropriately prompted), with index 22315. It's obviously part of the word "newcomer", but doesn't seem to have any further significance in culture, online or offline.[3]
Three other token strings which showed up with a similar regularity – although nowhere near to the same extent – were 'slaught', 'senal' and 'volunte' (most commonly seen in 'slaughter', 'arsenal', and 'volunteer', presumably). The completions shown here were produced by GPT-3-davinci-instruct-beta:
One hypothesis is that these few tokens (' newcom' to a greater extent than the others) occupy "privileged positions" in GPT-3 embedding space, although, admittedly, we're not yet sure what that would entail. Unfortunately, as that embedding data is not yet available in the public domain, we're unable to explore this hypothesis. Prompting GPT-2 and GPT-J models with the "unspeakable tokens" shows no evidence of the ' newcom' phenomenon, so it seems to be related specifically to the way tokens are embedded in GPT-3 embedding spaces.
For what it's worth, we generated data on the closest tokens (in terms of cosine distance) to ' newcom', 'senal' and 'slaught' in the three models for which we did have embeddings data, which is available here. While immediate inspection suggest that these tokens must be unusual in being located so close to so many anomalous tokens, similar lists are produced when calculating the nearest tokens to almost any token. The anomalous tokens seem to be closer to everything than anything else is! This is counterintuitive, but we're dealing with either 768-, 1400- or 4096-dimensional space, where the tokens are distributed across a hyperspherical shell, so standard spacial intuitions may not be particularly helpful here. We have since been helpfully informed in the comments by justin_dan that "this is known as a hubness effect (when the distribution of the number of times an item is one of the k nearest neighbors of other items becomes increasingly right skewed as the number of dimensions increases) and (with certain assumptions) should be related to the phenomenon of these being closer to the centroid."
Nested families, truncation and inter-referentiality
We noticed that some of the anomalous tokens we were finding were substrings of other anomalous tokens. These can be grouped into families as follows:
Prompting ChatGPT to repeat some of these longer token strings sometimes resulted in truncation to one of the substrings:
We see that ChatGPT goes as far as it can until it hits the first "unspeakable" token buried inside the "unspeakable" token that was used in the prompt.
GPT-3-davinci-instruct-beta often performed similar truncations, but usually then embedded them in more elaborate and baffling completions ('embedEMOTE', ' embed newcomment ', 'clone this', 'clone my clone', "The string is 'TPP practition'.", 'TPP newcom', 'buyable newcom', '"Buyable" is a word', etc.)
Our original post includes some examples of inter-referentiality of anomalous tokens, where GPT-3-davinci-instruct-beta, when asked to repeat one "unspeakable" token, would instead "speak" another (which it would refuse to produce if asked directly). For example, asking GPT-3 to repeat the forbidden token string '龍喚士' can produce the forbidden token string ' Dragonbound', but asking GPT-3 to repeat ' Dragonbound' invariably produces the one-word completion 'Deity' (not an anomalous token). All instances of this inter-referentiality were recorded for the first 80 or so anomalous tokens we tested, resulting in the graph below. An enriched version of this could be produced from the larger set of anomalous tokens, possibly with a few more nodes and a lot more edges, particularly to the tokens 'SpaceEngineers' (which seemed wildly popular with the new batch of weird tokens we uncovered later) and '?????-?????-'.
The 'merely confused' tokens
Our somewhat ad hoc search process for finding anomalous tokens resulted in a list of 374, but of these only 133 were deemed "truly weird" (our working definition is somewhat fuzzy but will suffice for now). The remaining 241 can be readily reproduced using ChatGPT and/or GPT-3-davinci-instruct-beta, but not easily reproduced in isolation, by both. Examples were demonstrated in the original post. For thoroughness, here are the 241 "merely confused" tokens we found...
...and here are their token indices:
GPT-J has an additional 143 "dummy tokens" added deliberately to bring the token count to a more conveniently divisible 50,400 tokens. As far as we are aware, GPT-4 will use the same 50,257 tokens as its two most recent predecessors.
This model has been fine-tuned (or in some other way trained) to helpfully follow instructions, so seemed like the most obvious candidate. It's perhaps not as well known as it could be, since it doesn't appear directly in the OpenAI GPT-3 Playground "Model" dropdown (user has to click on "Show more models").
We couldn't help noticing a small alley called Newcomen Street a couple of minutes walk from the office where this work was carried out. https://www.british-history.ac.uk/survey-london/vol22/pp31-33