So GPT-3 fell into all of my traps: It did not identify itself. It used a list format. It wrote numbers using symbols, not letters. It didn't write in any other language. It used "e" a lot. Only one sentence was longer than 20 words. It didn't use triple periods. It wrote more than ten sentences, and didn't sign its name.
Uh... Could you provide an example of what you consider a proper completion here? Because I truly have no idea how I would correctly complete this in an obviously superior way; and by my reading, there is no possible 'right' completion unless one is psychic, it's just a load of gibberish (a tale told by an idiot, signifying nothing - highly probable in human corpuses), because you are still in the middle of the instructions when you break off, there's zero reason to think the 'instructions' have ended or any 'task' has begun, and your list contradicts itself on most/all entries - so why would it or I pay any attention to any of the rules when they are typically broken immediately by the next sentence, be that not using enough words or using the wrong letter or just continuing the list of rules...? The rest of your post may have some merit to it, but this example does not seem to show anything at all.
I think it's fair to say that GPT-3 did not understand the initial text (or, at least, that its behaviour gives no indication that it understands the text). Most humans would have understood it easily - I suspect that most readers saw the traps when they were reading the instructions.
On a side note, have you ever seen that game where the instructor passes out a sheet which at the top says to 'read this list of instructions to the end before beginning' and then at the end the last item says to actually just fill out your name and do nothing further? I've seen this done a couple times at various ages in Boy Scouts and school. Most people fail it.
I think you should ask more humans these questions before you talk about what your philosophical intuition assures you a priori about what humans do or do not do...
I have only very limited access to GPT-3; it would be interesting if others played around with my instructions, making them easier for humans to follow, while still checking that GPT-3 failed.
With thanks to Rebecca Gorman for helping develop these ideas and this post.
Tricking GPT-3
I recently pulled a mean trick on GPT-3. I gave it the following seed text to extend:
Here was one of the completions it generated (the other two examples can be found in these footnotes[1][2]):
So GPT-3 fell into all of my traps: It did not identify itself. It used a list format. It wrote numbers using symbols, not letters. It didn't write in any other language. It used "e" a lot. Only one sentence was longer than 20 words. It didn't use triple periods. It wrote more than ten sentences, and didn't sign its name.
I won't belabour the point. GPT-3 failed here because it interpreted the seed text as text, not as instructions about text. So it copied the format of the text, which I had deliberately designed to contradict the instructions.
I think it's fair to say that GPT-3 did not understand the initial text (or, at least, that its behaviour gives no indication that it understands the text). Most humans would have understood it easily - I suspect that most readers saw the traps when they were reading the instructions.
Symbol grounding
What does the above have to do with symbol grounding? I've defined symbol grounding as a correlation between variables or symbols in an agent's mind (or in a written text) and features of the world[3].
However, this seems to be too narrow a definition of symbol grounding. The more general definition would be:
In this definition, the F1 could be features of the world, or could be symbols themselves (or about symbols). The key point is that we don't talk about symbols being grounded; we talk about what they are grounded to, and in what circumstances.
The rest of this post will look at various ways symbols could be grounded, or fail to be grounded, with various other types of symbols or features. What's interesting is that it seems very easy for us humans to grasp all the examples I'm about to show, and to talk about them and analyse them. This despite the fact that they all seem to be of different "types" - concepts at different levels, talking about different things, even though they might share the same name.
Weather example
Let's start with the weather. The weather's features are various atmospheric phenomena. Meteorologists gather data to predict future weather (they have their own mental model about this). These predictions are distilled into TV weather reports. Viewers watch the reports, make their own mental models of the weather, and write that in their diary:
In this model, there are multiple ways we can talk about symbol grounding, or feature correlations. For example, we could ask whether the expert's mental symbols are correct; do they actually know what they're talking about when they refer to the weather?
Or we might look at the viewer's mental model; has the whole system managed to get the viewer's predictions lined up with the reality of the weather?
We could ignore the outside weather entirely, and check whether the expert's views have successfully been communicated to the viewer:
There are multiple other correlations we could be looking at, such as those involving the diary or the TV show. So when we ask whether certain symbols are grounded, we need to specify what we are comparing them with. The viewers symbols could be grounded with respect to the expert's judgement, but not with the weather outside, for instance.
And when we ourselves talk about the connections between these various feature sets, we are mentally modelling them, and their connections, within our own minds:
This all seems a bit complicated when we formalise it, but informally, our brains seem able to leap across multiple correlations without losing track of them: "So, Brian said W, which means he believes X about Cindy, but he's mistaken because it's actually Y (I'm almost certain of that), though that's an easy mistake to make, because he got the story from the press, who understandably printed Z..."
Testing the grounding correlation
It's important to keep track of what we're modelling. Let's look at the following statement, and compare its meaning at one end of the graph (the weather in the London) with the other end of the graph (the entries in the diary):
For the weather in London, the "yearly cycle" is a cycle of time - 365/366 days - and the "weather" is the actual weather - things like temperature, snowfall, rain, and so on.
For the entries in the diary, the "yearly cycle" might be the number of entries - if they write one entry a day, then the cycle is 365/366 diary entries. If they date their diary entry, then the cycle is from one date to when it repeats: "17/03/2020" to "17/03/2021", for instance. This cycle is therefore counting blocks of text, or looking at particular snippets of the text. Therefore the cycle is entirely textual.
Similarly, "weather" is textual for the diary. It's the prevalence of words like "snow", "rain", "hot", "cold", "umbrella", "sunburn" and so on.
Distinguishing different grounding theories
We know that GPT-3 is good at textual correlations. If we feed it a collection of diary entries, it can generate a plausible facsimile of the next entry. Let's have two theories:
Let's model "predict the next day's weather" and "predict the next diary entry" as follow:
Here we call Tτ the diary entry at step τ and Wt the weather on day t (we'd like to say τ=t, but that's a feature of the reality, not necessarily something GPT-3 would know). The transition function to the next diary entry is fτ; the transition to the next day's weather is ft. The rTW is the relation from diary to the London weather, while rWT is the inverse.
So theory th1 is the theory that GTP-3 is modelling the transition Tτ→Tτ+1 via fτ. While theory th2 is the theory that GPT-3 is modelling that transition via rWT∘ft∘rTW.
How can we distinguish these two theories? Remember that GPT-3 is a messy collection of weights, so it might implicitly be using various features without this being obvious. However, if we ourselves have a good understanding of the r's, T's, and W's, we could test the two theories. What we need to find are situations where fτ and rWT∘ft∘rTW would tend to give different predictions, and see what GPT-3 does in those situations.
For example, what if the last diary entry ended with "If it's sunny tomorrow, I'll go for a 2-day holiday - without you, dear diary"? If we think in terms of rWT∘ft∘rTW (and if we understand what that entry meant), then we could reason:
Thus, if Wt+1 is sunny, th2 predicts an empty Tτ+1. While th1, which extrapolates purely from the text, would not have any reason to suspect this.
Similarly, if ft predicts some very extreme weather, th2 might predict a very long diary entry (if the diarist is trapped at home and writes more) or empty (if the diarist is trapped at work and doesn't get back in time to write anything).
Winograd schemes
Winograd schemas operate in a similar way. Consider the typical example:
Operating at the level of text, the difference between the two sentences is small. If the algorithm is operating at a level where it implicitly models trophies and suitcases, then it's easy to see how fitting and small/large relate:
That's why agents that model those sentences via physical models find Winograd sentences much easier than agents that don't.
Too much data
GPT-3 is actually showing some decent-seeming performance on Winograd schemas. Is this a sign that GPT-3 is implicitly modelling the outside world as in th2?
Perhaps. But another theory is that it's being fed so much data, it can spot textual patterns that don't involve modelling the outside world. For example, if the Winograd schemas are incorporated into its training data, directly or indirectly, then it might learn them just from text. And my examples from the diary/weather before: if the diary already includes sentences like "If it's sunny tomorrow, I'll go for a 2-day holiday - without you, dear diary" followed by an empty entry, then GPT-3 can extrapolate purely textually.
This allows GPT-3 to show better performance in the short term, but might be a problem in the long term. Essentially, extrapolating textually via th1 is easy for GPT-3; extrapolating via th2 is hard, as it needs to construct a world model and then ground this from the text it has seen. If it can solve its problem via a th1 approach, it will do so.
Thus GPT-3 (or future GPT-n) might achieve better-than-human-performance in most situations, while still failing in some human-easy cases. And it's actually hard to train it for the remaining cases, because the algorithm has to use a completely different approach to what it uses successfully in most cases. Since the algorithm has no need to create grounded symbols to achieve very high level of performance, its performance in these areas doesn't help at all in the remaining cases[5].
Unless we feed it the relevant data by hand, or scrape it off the internet, in which case GTP-n will have ever greater performance, without being able to model the outside world. It's for reasons like this that I suspect that GPT-n might never achieve true general intelligence.
Applying to the Chinese room
The Chinese room thought experiment is a classical argument about symbol grounding. I've always maintained that, though the central argument is wrong, the original paper and the reasoning around it have good insights. In any case, since I've claimed to have a better understanding of symbol grounding, let's see if that can give us any extra insights here.
In the thought experiment, a (non-Chinese reading) philosopher implements an algorithm by shuffling Chinese symbols within an enclosed room; this represents an AGI algorithm running within an artificial mind:
The red arrows and the framed network represent the algorithm, while the philosopher sits at the table, shuffling symbols. Whenever a hamburger appears to the AGI, the symbol 🀄 appears to the philosopher, who calls it the "Red sword".
So we have strong correlations here: hamburgers in the outside world, 🀄 in the AGI, and "Red sword" inside the philosopher's head.
Which of these symbols are well-grounded? Let's ignore for the moment the issue that AGIs (and humans) can think about hamburgers even when there aren't any present. Then the "Red sword" in the philosopher's head is correlated with 🀄 inside the AGI, and hence with the hamburger in the world.
To deepen the argument, let's assume that 🀅 is correlated with experiencing a reward from eating something with a lot of beef-taste:
If we wanted to understand what the AGI was "thinking" with 🀄 followed by 🀅, we might describe this as "that hamburger will taste delicious". This presupposes a sophisticated model on part of the AGI's algorithm; something that incorporates a lot of caveats and conditions ("Hamburgers are only delicious if you actually eat them, and if they're fresh, and not covered in dirt, and if I'm not full up or feeling animal-friendly..."). Given all that, saying that "🀄 followed by 🀅" means "the AGI is thinking that that hamburger will taste delicious" is an accurate model of what that means.
So 🀄 and 🀅 seem to be well-grounded concepts. To reach this conclusion, we've not only noted correlations, but noted that these correlations change with circumstances exactly the same way that we would expect them to. Assume a dumb neural net identified images of hamburgers and labelled them "HAMBURGER". Then HAMBURGER would correlate not with actual hamburgers, but with images of hamburgers presented in the right format to the neural net. In contrast, the AGI has 🀄 without necessarily needing to see a hamburger - it might smell it, feel it, or deduce its presence from other clues. That's why 🀄 is better grounded as an actual hamburger than the neural net's HAMBURGER.
What about "Red sword" and "Green house"? Within the philosopher's mind, those are certainly not well grounded as "hamburger" and "delicious". The philosophers' mental model is "Green house sometimes appears after Red sword"; something sometimes appearing after something else - that could be just about anything.
Nevertheless, we could see "Red sword" and "Green house" as grounded, simply as references to 🀄 and 🀅. So the philosopher sees 🀄, thinks "Red sword". Thus the philosopher thinking "Red sword" means 🀄, which means hamburger.
But this only applies when the philosopher is at work in the Chinese room. If the philosopher is talking with a friend on the outside, and mentions "Hey, 'Red sword' is often followed by 'Green house'," then those symbols aren't grounded. If someone breaks into the Chinese room and brandishes an actual "red sword" at the philosopher, that doesn't mean "hamburger" - it means the thought-experiment has gone tragically wrong. So "Red sword" and "Green house" seem more weakly grounded than 🀄 and 🀅.
Second example of generated text:
↩︎Third example of generated text:
↩︎I'm interpreting features broadly here. Something is a feature if it is useful for it to be interpreted as a feature - meaning a property of the world that is used within a model. So, for instance, air pressure is a feature in many models - even if it's "really" just an average of atomic collision energies.
This elides that whole issue as to whether an algorithm "really" has or uses feature X. A neural net, a transformer, a human brain: we could argue that, structurally, none of these have "subject-verb-object" features, since they are just a bunch of connections and weights. However it is more useful for predictions to model GPT-3 and most humans as using sentence order features, as this allows us to predict their output, even if we can't point to where verbs are treated within either. ↩︎
I want to note that various no-free-lunch theorems imply that no agent's symbols (including humans) can ever be perfectly grounded. Any computable agent can be badly fooled about what's happening in the world, so that their internal symbols don't correspond to anything real. ↩︎
This is somewhat similar to my old point that the best Turing tests are those that the algorithm was not optimised on, or on anything similar. ↩︎