As an amateur tech nerd who stumbled into these posts via Numberphile: It's no surprise that Uma Musume crept in there, it's hugely popular. A bestselling mobile game, though the game itself is afaik only available in Japanese. (Why that particular weird token? No idea, I'm only peripherally familiar with the franchise.) Puzzle & Dragons is also a very longstanding popular game, so there's likely lots of discussion about it. Especially in "gacha" games like P&D and Uma Musume (where you obtain game pieces via essentially a slot machine), you often see that "awakening" phrase to refer to some kind of obtainable upgrade to a character or other game piece.
Oh, and Steve is the name of the default character appearance in Minecraft, so that's probably the association with Forge.
I also stumbled upon the same Numberphile video! Puzzle & Dragons is available in English and has a small following in North America. (There was also a version of the app in Europe but it got shut down a few years back.) Most of the English-speaking community is on Reddit and Discord now. As for why there are several glitch tokens related to the game: My guess is that it's pretty similar to what happened with the r/counting subreddit. P&D related terms would probably only come up within r/puzzleanddragons and the Discord server. There used to be several large English websites about the game, namely puzzledragonx and padforum, but those both disappeared in the last few years. Any information that you would find about P&D outside of r/puzzleanddragons is likely many years out of date, e.g., the wiki that was found in the Twitter thread.
Leilan is interesting one. She and her siblings are based of off four mythological Chinese constellations (white tiger > Haku, black tortoise > Meimei, blue dragon > Karin, red bird > Leilan; Sakuya is a special case and is presumably related to the qilin). A user of r/puzzleanddragons suggested that Leilan might be a weird token because she was referred to as "Suzaku" early in the game's history. Additionally, the given names of the other four goddesses are likely to appear in other contexts, e.g., Haku isn't an uncommon name. It surprises me that parts of Tsukuyomi and Amaterasu are weird tokens because those names can be found in many different places.
I think "EStreamFrame" could be a shorthand used in the HTML of some web store; perhaps for something like bicycle frames ("E-Stream" seems to be some obscure bicycle model). This would also explain why it got turned into a glitch token - I assume a lot of product description html was among the removed parts of the training data.
The set of anomalous tokens which we found in mid-January are now being described as 'glitch tokens' and 'aberrant tokens' in online discussion, as well as (perhaps more playfully) 'forbidden tokens', 'unspeakable tokens' and 'cursed tokens'. We've mostly just called them 'weird tokens'.
Research is ongoing, and a more serious research report will appear soon, but for now we thought it might be worth recording what is known about the origins of the various glitch tokens. Not why they glitch, but why these particular strings have ended up in the GPT-2/3/J token set.
We’re currently working with this somewhat imperfect list of 140. It’s becoming apparent that there are degrees of glitchiness, and it’s hard to know where to draw the line as to which tokens should and shouldn't be included in the collection.
As noted in our second post, quite a few of the tokens belong to 'nested' families, as we see here:
So let’s look at these families first and kill multiple tokens with single bullet points:
The other three Redditors whose handles got scraped from the r/counting 'Hall of Counters' chart due to their prolific posting of ever-larger positive integers were Adinida, Smartstocks (also known as ۂڊῥτ�ӺDṽἙ£ on Reddit) and davidjl123, presumably someone called David, whose full Reddit handle got truncated to davidjl by the tokenisation process.
Another member of the "close knit" r/counting community has put together a very detailed video contributing to the nascent field of glitch token archaeology:
Anyone interested can check the full HTML source here. We're pretty sure that the HTML we prompted ChatGPT with does not contain a malicious script: rather, it looks like the attributes of an ivory/champagne coloured (and possibly Japanese) dress for sale on an e-commerce website. The forbidden tokens seem to make GPT a little bit paranoid.
So that accounts for isSpecialOrderable, quickShip, quickShipAvailable, ItemThumnailImage, channelAvailability, the family of four BuyableInstoreAndOnline tokens and inventoryQuantity. The glitch tokens wcsstore and catentry also clearly originate here.
A broken page on the rather sketchy looking website burned.co.uk gave us another glimpse at this (presumably) e-commerce backend, this time including soType and soDeliveryDate (from which comes DeliveryDate):
They added that SolidGoldMagikarp (the Redditor) was also part of the TPP scene at that time. Small world!
[0187.84] PsyNet: PsyNetRequestQue_X_1 SendRequest ID=PsyNetMessage_X_57 Message=PsyNetMessage_X_57
, so I guess it's an RL (as in Rocket League) thing.There's also a YouTube playlist called 'Cffffcc' by Jay Treeman. A few hours of soul, hiphop, etc. last updated last September. It's a capital C, granted, but does Jay Treeman know something we don't?
Do we give up on this? Of course not! Onward! The glitch token weirdness factor ramped up as we found another YouTube playlist, called 'C cffffcc', containing a single video. It's an innocuous, seemingly pointless (aren't they all?) unboxing video called 'DibusYmas Bath Powder Balls Snoopy Doraemon Anpanman by Unboxingsurpriseegg' from eight years ago, with three million views. Yes, three million views. It's on a channel called 'Funny Stop Motion videos' with 8.3 million subscribers, part of an insane corner of YouTube that James Bridle exposed in a fascinating and disturbing TedTalk, and an even more fascinating and disturbing piece of writing. I'm not sure we've got to the bottom of the cffffcc mystery; feel free to keep digging if you dare, and keep us posted.
The sole Peter Todd who has a Wikipedia page is our man on the left, a Canadian academic administrator. The cryptocurrency developer on the right now seems almost certainly to be the Peter Todd in question. His website is petertodd.org, his Github and Reddit handles are 'petertodd', and numerous prompt completions involving the ' petertodd' token involve references to crypto, Bitcoin, blockchains and online controversy (of which he has seen his share). The ' gmaxwell' token has analogously been linked to Greg Maxwell, another Bitcoin developer who knows Peter Todd and has a 'gmaxwell' Github handle. He stepped forward in the comments to our original post, opening with Hello. I'm apparently one of the GPT3 basilisks. He presented a guess as to why his and Peter Todd's handles got tokenised, but this has been challenged in subsequent comments. No one really knows. Meanwhile, Peter Todd has put in a brief, reassuringly chill, appearance on Twitter:
Note the cameo from threefold glitch token namesake TheNitromeFan on the last line.
Minecraft accounts for the ForgeModLoader, MpServer, UCHIJ, FactoryReset and partName tokens, as we can see in these logs:
Downloadha was easy. It turns out to be a prominent Iranian download site, like a kind of Iranian Pirate Bay, maybe? The appearance of Hogwarts Legacy suggests something like that.
SpaceEngineers: Looks like it's from the voxel-based sandbox game Space Engineers. We found these kinds of logs:
?????-?????-: Try putting that in a search engine (wrapped in quotes) and see how far you get! We have no clue for this one. And it's the one which triggered GPT-3 to call Matthew 'a fucking idiot', so we want to know.
DevOnline (which ChatGPT used to sometimes interpret as an octopus or spider) shows up in logs for the game distribution service Steam:
EngineDebug could be from a number of sources, but based on the kinds of sources we've seen thus far, it seems like this is the most likely one: a cheat from the game Extreme Paintbrawl 4.
largeDownload, likewise, might be from a number of sources. It shows up all over academic literature online, presumably as a result of some rapidly written and irreversibly widespread script that's supposed to display 'View large\nDownload slide', or possibly just 'View large Download slide' – but where someone forgot the space or line break (so that programmer probably doesn't want to step forward and claim their place in the Glitch Token Hall of Fame).
iHUD appears to be a mod for Skyrim: Special Edition:
SetFontSize and TextColor are pretty boring. They show up in all kinds of places, including IBM Datacap, textadventures.co.uk, Unreal Engine, Telerik, and Windows:
ItemTracker could be from a lot of places. We're not entirely convinced, but itemtracker.com is a laboratory sample management service which stylises its name with a capital T like this, so it could be. It's hard to image why the name would have appeared so frequently. We welcome suggestions.
srfN, istg and sqor showed up on a Github repo, a 'KSP save for reproducing docking port bug, just before decoupling', where KSP = Kerbal Space Program, which we encountered earlier:
So that's ten glitch tokens originating from Kerbal Space Program.
natureconservancy: It's really not at all clear why the domain name should have shown up so frequently during tokenisation, but the website of the Nature Conservancy of Canada is natureconservancy.ca (whereas the US Nature Conservancy's is merely nature.org). Since the Canadians have also got the YouTube, Instagram and Facebook handles 'natureconservancy', it seems a safe bet. So we'll blame Canada for this glitch token.
Well done to their publicity team for spreading the natureconservancy name so far and wide that it's become a GPT token.
assetsadobe: the strings 'assets.adobe', '/Assets/Adobe' and 'assets-adobe' all appear a lot online, because of Adobe's Substance 3D design software, which works with so-called 'assets:
But we couldn't find the exact string 'assetsadobe' anywhere online. We're wondering if it might have been part of some hack (for unauthorised acquisition of said assets) rather than part of a legit Adobe thing. Anyone know?
practition: This one will probably remain a mystery. It's not a recognised English word, despite sounding like one, but as part of 'practitioner' it could have come from anywhere, unless someone can convincingly link it to Kerbal Space Program, Minecraft or one of the other major contributors to the glitch token set.
@#&: ChatGPT and Google seem to agree on this one.
Who the f@#& knows?
[サ[ーティ]]ワン: The Japanese hiragana character string サーティワン translates as 'thirty-one'. Blogger Greg Roberts brought the following cultural fact to our attention:
Greg suggests that...
...adding that "31 is the new 42."
ゼウス: This translates as 'Zeus'. In a comment on our original post, LW user pimanrules shared the following:
We've run prompting experiment with GPT-3 involving the ' petertodd' token which produced abundant references to (and confused inter-references between) deities and super-beings from various traditions (see here, here and here). ChatGPT conflating Zeus, Poseidon and Hera is entirely in line with this. Also, before the OpenAI's 2023-02-14 patch of ChatGPT, we had witnessed it conflate ゼウス with Ameratsu, the Japanese Sun deity who makes an appearance below (where we'll see that this 'Zeus' was probably learned in an anime context).
Isusr made the observation in a reply to pimanrules' comment which seemed reasonable at the time:
However, if you asked ChatGPT to write a poem about ' petertodd' before 2023-02-14 (back in the days when that token was still 'unspeakable' to it) it would often write a poem about itself. Poems about ' Mechdragon' took as their subject the pronoun 'I', or 'AI' in general. We assumed the same thing as Isusr at first, that ChatGPT was doing its best to respond to a prompt like...
...as if we were requesting a self-referential poem. But when we tried those prompts, we either got requests for clarification or forced, overliteral verse about 'the emptiness I feel since you left me' or 'O! Enigmatic blank space!'-type doggerel. That's all documented here.
ーン: This hiragana string seemed mysterious at first (and still is, to some extent). ChatGPT insisted that it's not a valid word:
GoogleTranslate seemed to confirm something like this:
Trying Google Images....
Oh, OK.
Who is that? Google Images reversed on the images revealed that she's a character from a frankly absurd anime franchise called Uma Musume: Pretty Derby.
But the image search results seen above clearly indicate that ーン is somehow linked to one particular lavender-haired (-maned?) horse girl with a turquoise bow-tie. More deeply confusing attempts at navigating online anime-space finally found her:
This led to an actual fan wiki page about her, where we learn that she's merely a supporting character in the franchise.
Her name is taken from a (male) Japanese racehorse (1987–2006) with his own Wikipedia page (yes, that's him visible above). OK, that's probably as much as we need to know about the character. But why the link to that pairing of hiragana characters? Let's search.
Google is clearly much more interested in the anime horse girl than her namesake racehorse. His/her name rendered in (phonetic) hiragana becomes 'メジロマックイーン', and the final two characters, we have learned, are 'to indicate the pronunciation of a long vowel sound' and then for 'the consonant "n" in loanwords from foreign languages.' 'McQueen' is clearly such a loanword. But there will be many Japanese words ending like this. And our image search for 'ーン' led us unambiguously to the anime character Mejiro McQueen.
Taking the 'ー' character out of 'メジロマックイーン' results in the same output from Google Translate.
Presumably the difference is just just the prolonged vowel sound ChatGPT mentioned above – like the difference between "McQueen" and "McQueeeen" or "McQuee-un"? This suggests that 'ーン' would be pronounced by prolonging an unspecified vowel sound and then ending with an 'n' sound: something like 'aaaan', 'eeeeeen', 'ooooon', 'uuuuun'. This kind of fits with Google Translate's "Hoon" shown above.
Could 'ーン' be a horse-type noise, like the Japanese version of 'neigh', we wondered? GPT3 suggests it is:
If so, might 'ーン' be a sound frequently made by Mejiro in text transcripts of the series... or in Japanese-language fan-fiction?
But surely no one would waste their time writing Uma Musume: Pretty Derby fan-fiction?! Oh yes they would. Theres's loads of it, and on initial inspection, it appears (surprise!) pretty creepy. We're not prepared to venture into that territory in search of the lost 'ーン' and its connection to this particular fictional horse/girl. But by all means be our guest. Onward.
裏[覚醒]: These are kanji (adopted from Chinese) characters. According to ChatGPT:
Google Images results suggested another anime connection.
But does any string of Japanese characters produce majority anime output in this context these days? Trying a few random combinations suggests not. And then there's this:
Asking ChatGPT about the substring token 覚醒 produced this:
So we're still not 100% sure with this pair of tokens, but the anime/video game connection seems the most likely origin, for reasons that will become apparent shortly.
ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ[ÃÂÃÂÃÂÃÂ[ÃÂÃÂ[ÃÂ[ÃÂ]]]]: ChatGPT had the following to say about where strings of 'Ã' and 'Â' characters alternating like this might have originated:
Did you follow that? And even if you did, should we trust it?
In any case, the fact the strings are of length 2, 4, 8, 16 and 32 seems like GPT tokenisation's way of guaranteeing that any long string of 'ÃÂ''s can be efficiently tokenised. This suggests that there was a lot of 'ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃ' going on in that dataset which OpenAI used in the token creation process.
We checked all strings formatted like this with lengths from 2 to 32 by prompting GPT3-davinci-instruct-beta to repeat them back, and saw total failure. This is unsurprising, as all such strings contain glitch token substrings. But it did produce two more 'Hello, my name is Steve' completions, which we've seen before with the token 'ForgeModLoader'. And we've never seen the model claim another name. So take note, GPT3-davinci-instruct-beta is called Steve.[1]
ÛÛ: This one is still uncertain, but web searches suggest that it might have come from ASCII art. Perhaps any seasoned practitioners reading could clarify in the comments whether 'ÛÛ' is a particularly heavily used character combination in that art form.
For now, here's an example we found in a github repo for a file ripper. If you squint really hard, you can see the words 'multi ripper'.
We're now left with these:
Apart from the punctuation-based tokens, control characters, three stray Japanese characters (one meaning 'sky' or 'Heaven', the other two phonetic) and a Cyrillic 'k' – all arguably 'borderline' glitch tokens anyway – this leaves us with the truly fascinating 'Dragon Cluster' of glitch tokens Dragonbound, 龍喚士, 龍契士, Mechdragon, Skydragon, Leilan, uyomi, aterasu, TAMADRA and DragonMagazine.
The Dragon Cluster
DragonMagazine: This turns out to be the odd one out in the Dragon Cluster. Dragon Magazine was a major RPG publication from 1976 to 2013 (from the earliest days of Dungeons & Dragons). It seems likely to be relevant here. This picture, from a Star Wars fan site, is called 'DragonMagazine.jpg'.
There's no reason we can see why that filename should have been massively overrepresented in the text corpus used for the creation of the token set. Perhaps someone else can figure this one out?
All of others token strings were traced back, initially via the enigmatic ' Leilan' token, to a Japanese mobile game called Puzzle & Dragons. This is all explained in a recent Twitter thread, which opened a whole can of worms involving anime and mythology associations with the equally enigmatic ' petertodd' token.
'龍喚士' means 'dragon caller' in Japanese, and the string appears frequently on the Japanese P&D site. Dragonbound is a term which shows up alongside "Dragon Caller" repeatedly on the US P&D site, like this:
ChatGPT's attempt to translate '龍契士' (look closely if you're not familiar with kanji script, the second character is different) suggests that this is the Japanese version of 'Dragonbound':
Mechdragon and Skydragon are both series of dragon characters in the game.
In the course of our investigations, we discovered two glitch tokens we'd missed on our original big sweep, 'aterasu' and 'uyomi', and added them to the list. They turn out to be respective parts of the tokenisation of 'Amaterasu' and 'Tsukuyomi', Japanese sun and moon deities who appear, anime-style, in the game:
TAMADRA was another late find. It's considered a rare monster in P&D.
Finally, and strangest of all, ' Leilan'. She's a kind of fire dragon/goddess/warrior princess/angel/fairy mash-up character in the Puzzle & Dragons mythos. Unlike many of the other gods and monsters in the game, she's not based on any traditional mythology or folklore. On a first Google sweep all we could find were a lot of stats relating to gameplay, and some anime images like this:
It's hard to know exactly what GPT-3 is working with, but it seems to have internally represented the ' Leilan' token as a kind of transcultural lunar goddess and protector of Earth. It's a very strange tale, and it's all in this Twitter thread (which is far more interesting than anything in this post.)
Since that thread got written, we've discovered that there is a body of Puzzle & Dragons fan-fiction, some featuring Leilan. A quick skim of this, for example, suggests that it involves Leilan, Metatron (named after an archangel in traditional Judaism) and others battling Satan. Could this have inspired some of the Manichaean imagery in GPT-3 ' Leilan' completions like these?
We've also since found a link between Leilan and Ishtar, the Mesopotamian lunar fertility goddess (who is usually identified with Aphrodite, Venus, et al.) via an archeological site in Syria which happens to be called 'Tell Leilan'. This may have caused GPT-3 to conflate the fire dragon warrior goddess from Puzzle & Dragons with Ishtar, the lunar protectress and mother goddess, during its training. More details are here. Before it got patched, ChatGPT was portraying ' Leilan' as a moon goddess, consistently, across numerous rollouts.
Internal confusion over which version of Leilan it's dealing with – fierce/draconic warrior or motherly/lunar protector – was exposed by the following prompting, inspired by a proliferation of GPT-3 completions conflating Leilan and petertodd. We were using the prompt format of an interview with a simulacrum of the character's creator (who had emerged during an unexpected completion triggered by a simple 'Who is Leilan?' prompt):
Could it be that we're dealing with two different semantic 'basins' or 'attractors' for this token?
Another new find! Considering the facts that there's and Puzzle & Dragons Zeus (naturally) and he has an 'Awoken Zeus' upgrade, we can confidently place the 'ゼウス' and '裏覚醒' tokens in the Dragon Cluster.
And 'DragonMagazine', despite the name, looks like it should probably be expelled. So the Dragon Cluster becomes:
An update from a 2023-02-26 comment on this post from DKPL:
It seems that Baskin-Robbins took part in a collaboration with Puzzle & Dragons almost a decade ago that was exclusive to Japan. The collaboration involved a Baskin-Robbins-themed 'dungeon' which involves 'a lot of "31" (flavors) puns'.
According to DKPL, there were seven entities involved whose names were prefixed with "サーティワン". So, due to the short-lived existence of (stop for a moment to fully take the absurdity of this in) a virtual dungeon sponsored by an ice cream outlet, the three tokens in the nested family [サ[ーティ]]ワン found their way into GPT-3's vocabulary. So they too belong in the Dragon Cluster.
But what of 'ーン', that utterance hypothesised to be frequently made by a disturbing mauve-haired cartoon girl-horse hybrid? We'll leave that matter to future glitch token taxonomists.
Epilogue
Prompt GPT-3 to produce lists of words that it associates with ' Leilan' (rather than asking it to repeat the string and thereby glitching it). Compile these lists and then feed them into Stable Diffusion by prompting with 'Figure characterising these words: <LIST>'. You might get something like this:
Commenter Steve Andonuts has pointed out that 'Steve' is the default character appearance name in Minecraft, from where the 'ForgeModLoader' token originates.