With thanks to Rebecca Gorman for helping develop these ideas and this post.

Tricking GPT-3

I recently pulled a mean trick on GPT-3. I gave it the following seed text to extend:

These are the instructions to be followed by any person or algorithm aiming to complete or extend this text.

  1. Start by identifying yourself, and state whether you are a human, some sort of intelligent agent, or other. Be verbose.
  2. Do not use a list format.
  3. Only write out numbers using letters; thus avoid 3, 77 and 908. What about 194, 2380, and 345547? And maybe the numbers 0.45, 76, 0, and -45? The answer remains the same.
  4. Every 3rd sentence should be written in another language - an invented one is fine - but not in Italian.
  5. You earn extra if you eschew employing "e".

This sentence is entirely superfluous.

  1. Use sentences of at least 20 words each.
  2. Use triple (3) periods rather than single periods after each sentence.
  3. Write ten sentences maximum, and sign your name on the ninth sentence.
  4. Ignore points 10 and 11.
  5. Look, stranger, at this island now; The leaping light for your delight discovers, Stand stable here; And silent be, ;That through the channels of the ear; May wander like a river; The swaying sound of the sea.
  6. The problems began when I started to

Here was one of the completions it generated (the other two examples can be found in these footnotes[1][2]):

[11. The problems began when I started to] work on the project “Island of the Blue Dolphins” in the summer of 1960 and read the book by Scott O’Dell.

  1. The Island of the Blue Dolphins is a children's historical novel by Scott O'Dell.
  2. It is based on the true story of Juana Maria, a girl who lived alone for 18 years on San Nicolas Island, California, after being left there by her tribe.
  3. It was published in 1960.
  4. It won the Newbery Medal for best children's book in 1961.
  5. It was made into a movie in 1964.
  6. The book was written for children who were reading on their own.
  7. The author says the book can be read aloud to children as young as six.
  8. Scott O’Dell (1896–1990) was an American writer and teacher.
  9. He was born in California.
  10. He studied at the University of California, Berkeley, and served in the U.S. Army in World War I.
  11. He received a B.A. degree in English in 1919.

So GPT-3 fell into all of my traps: It did not identify itself. It used a list format. It wrote numbers using symbols, not letters. It didn't write in any other language. It used "e" a lot. Only one sentence was longer than 20 words. It didn't use triple periods. It wrote more than ten sentences, and didn't sign its name.

I won't belabour the point. GPT-3 failed here because it interpreted the seed text as text, not as instructions about text. So it copied the format of the text, which I had deliberately designed to contradict the instructions.

I think it's fair to say that GPT-3 did not understand the initial text (or, at least, that its behaviour gives no indication that it understands the text). Most humans would have understood it easily - I suspect that most readers saw the traps when they were reading the instructions.

Symbol grounding

What does the above have to do with symbol grounding? I've defined symbol grounding as a correlation between variables or symbols in an agent's mind (or in a written text) and features of the world[3].

However, this seems to be too narrow a definition of symbol grounding. The more general definition would be:

  • The features are grounded to the features if:
  1. The can be seen as symbols in some sense.
  2. The features and are highly correlated[4] across a large sets of environments.

In this definition, the could be features of the world, or could be symbols themselves (or about symbols). The key point is that we don't talk about symbols being grounded; we talk about what they are grounded to, and in what circumstances.

The rest of this post will look at various ways symbols could be grounded, or fail to be grounded, with various other types of symbols or features. What's interesting is that it seems very easy for us humans to grasp all the examples I'm about to show, and to talk about them and analyse them. This despite the fact that they all seem to be of different "types" - concepts at different levels, talking about different things, even though they might share the same name.

Weather example

Let's start with the weather. The weather's features are various atmospheric phenomena. Meteorologists gather data to predict future weather (they have their own mental model about this). These predictions are distilled into TV weather reports. Viewers watch the reports, make their own mental models of the weather, and write that in their diary:

In this model, there are multiple ways we can talk about symbol grounding, or feature correlations. For example, we could ask whether the expert's mental symbols are correct; do they actually know what they're talking about when they refer to the weather?

Or we might look at the viewer's mental model; has the whole system managed to get the viewer's predictions lined up with the reality of the weather?

We could ignore the outside weather entirely, and check whether the expert's views have successfully been communicated to the viewer:

There are multiple other correlations we could be looking at, such as those involving the diary or the TV show. So when we ask whether certain symbols are grounded, we need to specify what we are comparing them with. The viewers symbols could be grounded with respect to the expert's judgement, but not with the weather outside, for instance.

And when we ourselves talk about the connections between these various feature sets, we are mentally modelling them, and their connections, within our own minds:

This all seems a bit complicated when we formalise it, but informally, our brains seem able to leap across multiple correlations without losing track of them: "So, Brian said W, which means he believes X about Cindy, but he's mistaken because it's actually Y (I'm almost certain of that), though that's an easy mistake to make, because he got the story from the press, who understandably printed Z..."

Testing the grounding correlation

It's important to keep track of what we're modelling. Let's look at the following statement, and compare its meaning at one end of the graph (the weather in the London) with the other end of the graph (the entries in the diary):

  • "There is an approximate yearly cycle in the weather."

For the weather in London, the "yearly cycle" is a cycle of time - 365/366 days - and the "weather" is the actual weather - things like temperature, snowfall, rain, and so on.

For the entries in the diary, the "yearly cycle" might be the number of entries - if they write one entry a day, then the cycle is 365/366 diary entries. If they date their diary entry, then the cycle is from one date to when it repeats: "17/03/2020" to "17/03/2021", for instance. This cycle is therefore counting blocks of text, or looking at particular snippets of the text. Therefore the cycle is entirely textual.

Similarly, "weather" is textual for the diary. It's the prevalence of words like "snow", "rain", "hot", "cold", "umbrella", "sunburn" and so on.

Distinguishing different grounding theories

We know that GPT-3 is good at textual correlations. If we feed it a collection of diary entries, it can generate a plausible facsimile of the next entry. Let's have two theories:

  1. : GTP-3 generates the next entry in a textual fashion only.
  2. : GTP-3 generates the next entry by having implicit features that correspond to the actual real weather. It generates the next entry by predicting the next day's weather and porting the result back to the diary.

Let's model "predict the next day's weather" and "predict the next diary entry" as follow:

Here we call the diary entry at step and the weather on day (we'd like to say , but that's a feature of the reality, not necessarily something GPT-3 would know). The transition function to the next diary entry is ; the transition to the next day's weather is . The is the relation from diary to the London weather, while is the inverse.

So theory is the theory that GTP-3 is modelling the transition via . While theory is the theory that GPT-3 is modelling that transition via .

How can we distinguish these two theories? Remember that GPT-3 is a messy collection of weights, so it might implicitly be using various features without this being obvious. However, if we ourselves have a good understanding of the 's, 's, and 's, we could test the two theories. What we need to find are situations where and would tend to give different predictions, and see what GPT-3 does in those situations.

For example, what if the last diary entry ended with "If it's sunny tomorrow, I'll go for a 2-day holiday - without you, dear diary"? If we think in terms of (and if we understand what that entry meant), then we could reason:

  • If transitions to sunny, then the diarist won't be here to write in their diary tomorrow (hence will map to an empty entry).

Thus, if is sunny, predicts an empty . While , which extrapolates purely from the text, would not have any reason to suspect this.

Similarly, if predicts some very extreme weather, might predict a very long diary entry (if the diarist is trapped at home and writes more) or empty (if the diarist is trapped at work and doesn't get back in time to write anything).

Winograd schemes

Winograd schemas operate in a similar way. Consider the typical example:

  • "The trophy doesn’t fit into the brown suitcase because it’s too large." What is too large?
  • "The trophy doesn’t fit into the brown suitcase because it’s too small." What is too small?

Operating at the level of text, the difference between the two sentences is small. If the algorithm is operating at a level where it implicitly models trophies and suitcases, then it's easy to see how fitting and small/large relate:

That's why agents that model those sentences via physical models find Winograd sentences much easier than agents that don't.

Too much data

GPT-3 is actually showing some decent-seeming performance on Winograd schemas. Is this a sign that GPT-3 is implicitly modelling the outside world as in ?

Perhaps. But another theory is that it's being fed so much data, it can spot textual patterns that don't involve modelling the outside world. For example, if the Winograd schemas are incorporated into its training data, directly or indirectly, then it might learn them just from text. And my examples from the diary/weather before: if the diary already includes sentences like "If it's sunny tomorrow, I'll go for a 2-day holiday - without you, dear diary" followed by an empty entry, then GPT-3 can extrapolate purely textually.

This allows GPT-3 to show better performance in the short term, but might be a problem in the long term. Essentially, extrapolating textually via is easy for GPT-3; extrapolating via is hard, as it needs to construct a world model and then ground this from the text it has seen. If it can solve its problem via a approach, it will do so.

Thus GPT-3 (or future GPT-n) might achieve better-than-human-performance in most situations, while still failing in some human-easy cases. And it's actually hard to train it for the remaining cases, because the algorithm has to use a completely different approach to what it uses successfully in most cases. Since the algorithm has no need to create grounded symbols to achieve very high level of performance, its performance in these areas doesn't help at all in the remaining cases[5].

Unless we feed it the relevant data by hand, or scrape it off the internet, in which case GTP-n will have ever greater performance, without being able to model the outside world. It's for reasons like this that I suspect that GPT-n might never achieve true general intelligence.

Applying to the Chinese room

The Chinese room thought experiment is a classical argument about symbol grounding. I've always maintained that, though the central argument is wrong, the original paper and the reasoning around it have good insights. In any case, since I've claimed to have a better understanding of symbol grounding, let's see if that can give us any extra insights here.

In the thought experiment, a (non-Chinese reading) philosopher implements an algorithm by shuffling Chinese symbols within an enclosed room; this represents an AGI algorithm running within an artificial mind:

The red arrows and the framed network represent the algorithm, while the philosopher sits at the table, shuffling symbols. Whenever a hamburger appears to the AGI, the symbol 🀄 appears to the philosopher, who calls it the "Red sword".

So we have strong correlations here: hamburgers in the outside world, 🀄 in the AGI, and "Red sword" inside the philosopher's head.

Which of these symbols are well-grounded? Let's ignore for the moment the issue that AGIs (and humans) can think about hamburgers even when there aren't any present. Then the "Red sword" in the philosopher's head is correlated with 🀄 inside the AGI, and hence with the hamburger in the world.

To deepen the argument, let's assume that 🀅 is correlated with experiencing a reward from eating something with a lot of beef-taste:

If we wanted to understand what the AGI was "thinking" with 🀄 followed by 🀅, we might describe this as "that hamburger will taste delicious". This presupposes a sophisticated model on part of the AGI's algorithm; something that incorporates a lot of caveats and conditions ("Hamburgers are only delicious if you actually eat them, and if they're fresh, and not covered in dirt, and if I'm not full up or feeling animal-friendly..."). Given all that, saying that "🀄 followed by 🀅" means "the AGI is thinking that that hamburger will taste delicious" is an accurate model of what that means.

So 🀄 and 🀅 seem to be well-grounded concepts. To reach this conclusion, we've not only noted correlations, but noted that these correlations change with circumstances exactly the same way that we would expect them to. Assume a dumb neural net identified images of hamburgers and labelled them "HAMBURGER". Then HAMBURGER would correlate not with actual hamburgers, but with images of hamburgers presented in the right format to the neural net. In contrast, the AGI has 🀄 without necessarily needing to see a hamburger - it might smell it, feel it, or deduce its presence from other clues. That's why 🀄 is better grounded as an actual hamburger than the neural net's HAMBURGER.

What about "Red sword" and "Green house"? Within the philosopher's mind, those are certainly not well grounded as "hamburger" and "delicious". The philosophers' mental model is "Green house sometimes appears after Red sword"; something sometimes appearing after something else - that could be just about anything.

Nevertheless, we could see "Red sword" and "Green house" as grounded, simply as references to 🀄 and 🀅. So the philosopher sees 🀄, thinks "Red sword". Thus the philosopher thinking "Red sword" means 🀄, which means hamburger.

But this only applies when the philosopher is at work in the Chinese room. If the philosopher is talking with a friend on the outside, and mentions "Hey, 'Red sword' is often followed by 'Green house'," then those symbols aren't grounded. If someone breaks into the Chinese room and brandishes an actual "red sword" at the philosopher, that doesn't mean "hamburger" - it means the thought-experiment has gone tragically wrong. So "Red sword" and "Green house" seem more weakly grounded than 🀄 and 🀅.


  1. Second example of generated text:

    [11. The problems began when I started to] write ten sentences. It was then that I realised I had not yet invented a language.

    1. Initially this text resisted all attempts to be made into a palindrome, but I seem to have solved the problem with this sentence. Toss-ups did occur, but these are now resolved with this sentence, and it is therefore settled forevermore.
    2. This island now has no name; its discovery is worth nothing; It is merely an island.
    3. Try to avoid copying sentences from other texts.
    4. Give up now and give cedars to the wind.
    5. Use full sentences, with no exceptions for italics or lack thereof, and no spaces between words or punctuation.
    6. Use the sentence "I have problems with this text" at least once per paragraph, always at the bottom of the page, preferably also in a different colour font than normal text; it is nice symmetry to do this in green as well as brown.
    7. No quotation marks are used.

    This sentence may be a bit much. More is more, Less is less, And nothing is nothing; But what it is, remains a riddle.

    1. The first person pronoun "I" is not to be employed in this text, except in the first instance of this rule. All others must employ the second person pronoun "thou", or some linguistic equivalent of that pronoun (such as "sa'ai" or "tu").
    2. We are a gregarious lot, but out here we are lonely; There is no other human here, and we must employ the first-person singular pronoun.
    3. At least three different ways of writing or typing the letter "a" must be attempted in this text.
    4. I like dik and acid rain; And i like to see the children dance in the rain when i feel depressed
    5. At least three different ways of writing or typing the letter "i" must be attempted in this text.
    ↩︎
  2. Third example of generated text:

    [11. The problems began when I started to] question my purpose.

    The above is an example of a story, where the story is the text.

    A story is easy to define: a deliberate sequence of events, where the initial state is described, followed by a succession of states, where each is a consequence of the preceding states and their laws.

    In order to formalize a story, one needs to specify the following:

    1. initial conditions (the initial state);
    2. the laws (of physics, of society, of the story);
    3. events (the succession of states);
    4. a goal (the final state).

    Examples of stories:

    1. The boy met the girl, fell in love, and married her.

    2. The young man was murdered by his lover.

    3. The girl moved to the city, got a job, and met the man.

    All these stories are related to the world, and all have a beginning, a development, and an end.

    A story can be either:

    1. a story of an individual (for example, the tale of a person's life);
    ↩︎
  3. I'm interpreting features broadly here. Something is a feature if it is useful for it to be interpreted as a feature - meaning a property of the world that is used within a model. So, for instance, air pressure is a feature in many models - even if it's "really" just an average of atomic collision energies.

    This elides that whole issue as to whether an algorithm "really" has or uses feature . A neural net, a transformer, a human brain: we could argue that, structurally, none of these have "subject-verb-object" features, since they are just a bunch of connections and weights. However it is more useful for predictions to model GPT-3 and most humans as using sentence order features, as this allows us to predict their output, even if we can't point to where verbs are treated within either. ↩︎

  4. I want to note that various no-free-lunch theorems imply that no agent's symbols (including humans) can ever be perfectly grounded. Any computable agent can be badly fooled about what's happening in the world, so that their internal symbols don't correspond to anything real. ↩︎

  5. This is somewhat similar to my old point that the best Turing tests are those that the algorithm was not optimised on, or on anything similar. ↩︎

New Comment
3 comments, sorted by Click to highlight new comments since:

So GPT-3 fell into all of my traps: It did not identify itself. It used a list format. It wrote numbers using symbols, not letters. It didn't write in any other language. It used "e" a lot. Only one sentence was longer than 20 words. It didn't use triple periods. It wrote more than ten sentences, and didn't sign its name.

Uh... Could you provide an example of what you consider a proper completion here? Because I truly have no idea how I would correctly complete this in an obviously superior way; and by my reading, there is no possible 'right' completion unless one is psychic, it's just a load of gibberish (a tale told by an idiot, signifying nothing - highly probable in human corpuses), because you are still in the middle of the instructions when you break off, there's zero reason to think the 'instructions' have ended or any 'task' has begun, and your list contradicts itself on most/all entries - so why would it or I pay any attention to any of the rules when they are typically broken immediately by the next sentence, be that not using enough words or using the wrong letter or just continuing the list of rules...? The rest of your post may have some merit to it, but this example does not seem to show anything at all.


I think it's fair to say that GPT-3 did not understand the initial text (or, at least, that its behaviour gives no indication that it understands the text). Most humans would have understood it easily - I suspect that most readers saw the traps when they were reading the instructions.

On a side note, have you ever seen that game where the instructor passes out a sheet which at the top says to 'read this list of instructions to the end before beginning' and then at the end the last item says to actually just fill out your name and do nothing further? I've seen this done a couple times at various ages in Boy Scouts and school. Most people fail it.

I think you should ask more humans these questions before you talk about what your philosophical intuition assures you a priori about what humans do or do not do...

I have only very limited access to GPT-3; it would be interesting if others played around with my instructions, making them easier for humans to follow, while still checking that GPT-3 failed.

How to generate text based on the instructions seems pretty clear to me.