This post seems to have an important bug in its code that changes its conclusions, as pointed out to me by user @ckkissane . At a high level, there are probably more hard to speak tokens and fewer truly unspeakable tokens than the current text suggests. I will update it soon with corrections; currently I am traveling.
Produced as part of SERI-MATS winter cohort. Thanks to Eric Neyman and Neel Nanda for helpful discussions on this topic.
See this post for context; in brief, there are a small number of tokens that GPT3 and related large language models have a lot of trouble saying, and attempting to elicit these tokens from the model causes erratic and interesting behavior.
In the current post, we show that some of the same and related tokens have similar behavior in GPT2 (in all four sizes: small, medium, large, and xl), and moreover, because we have the weights of GPT2, we can explain why this is happening, at a mechanistic level.
Some of these tokens are unspeakable because their unembedding vectors are not maximal along any direction. Thus, there is no internal activation that the model can generate that will cause the token to be emitted when decoding at zero temperature. Some of them are, rather, hard-to-speak, because they are not maximal along their own direction, and thus one needs to point slightly away from them in order to emit them. Both phenomena are related to the phenomena laid out in the original post. The hard-to-speak tokens are plausibly very hard to speak, because most tokens that a transformer emits will be most effectively signaled by pointing roughly directly at them. (And recall that the most plausible explanation for how this situation arose in the first place is that these tokens were never seen even once during training; thus, the model has zero practice at having to point to them in weird ways, and is thus unlikely to be able to do so.)
In this post, we will make extensive use of Neel Nanda's TransformerLens library. This library is amazingly helpful in avoiding various rough edges and footguns in mechanistic interpretability, and I cannot recommend it highly enough.
First, we will show how to identify hard-to-speak tokens using TransformerLens:
import torch
from transformer_lens import HookedTransformer
# load the model (in this case, GPT2-small)
model = HookedTransformer.from_pretrained('gpt2').to('cpu')
# pick the vector furthest along each embedding direction
best_match = (model.W_U.T @ model.W_U).argmax(dim=-1)
# surface the tokens that are not their own argmax
for tok in (best_match != torch.arange(50257)).nonzero().flatten():
print(tok.item(), best_match[tok].item(), '~' + model.tokenizer.decode([tok.item()]) + '~',
'~' + model.tokenizer.decode([best_match[tok].item()]) + '~')
We can then examine these hard-to-speak tokens, and see some that are very familiar from the original SolidGoldMagikarp post:
Many of these tokens are unprintable (i.e., they don't display and I don't know what they are). The remainder mostly appear in the original SolidGoldMagikarp post.
Finally, we give a simple approach to verify that a particular token is unspeakable rather than just being hard-to-speak.
Briefly, we look for a direction x such that x minimizes the negative logit of the token under consideration (in this case, 30208, or " externalTo"), while satisfying the constraint that the logit of the token is higher than the logit of any other token by at least a small margin. When we do this, we find that small margins (like 1.01e-7 for the case of 30208) lead to infeasible linear programs (and thus no such direction exists), and even smaller margins (such as 1.00e-7) lead to "feasible" linear programs that are only feasible because of numerical errors in the calculation. That is, if you take the direction output by the solver, it is not actually a solution, because unembedding it still leads to a different token being ever-so-slightly higher. (In the case of 30208, it is the token 24973, " exting".)
Here are the results of testing whether something is hard-to-speak or truly unspeakable for the printable hard-to-speak tokens in GPT2-small:
Given that GPT2-xl (compared to GPT2-small) has more printable hard-to-speak tokens, of which a smaller fraction and a smaller absolute number are unspeakable, one guess would be that these trends continue for larger models, and a model the size of GPT-3 davinci has many hard-to-speak tokens but very few or even no unspeakable tokens. It is certainly intuitive to believe that, in high dimensions, with a fixed vocabulary size, very few unembedding vectors are not on the convex hull of the unembedding vector polytope. (h/t to Eric Neyman for this last point)
It seems like it might be interesting to try to measure the volume of the feasible region (or more precisely, the (n-1) dimensional surface area of the portion of the unit sphere in the feasible region), to ascertain how "precise" the model would have to be to produce a particular token.
Edit: there are way more unspeakable tokens than I thought
I forgot to include the bias term! All of these tokens in GPT2-xl are actually unspeakable!
This post seems to have an important bug in its code that changes its conclusions, as pointed out to me by user @ckkissane . At a high level, there are probably more hard to speak tokens and fewer truly unspeakable tokens than the current text suggests. I will update it soon with corrections; currently I am traveling.
Produced as part of SERI-MATS winter cohort. Thanks to Eric Neyman and Neel Nanda for helpful discussions on this topic.
See this post for context; in brief, there are a small number of tokens that GPT3 and related large language models have a lot of trouble saying, and attempting to elicit these tokens from the model causes erratic and interesting behavior.
In the current post, we show that some of the same and related tokens have similar behavior in GPT2 (in all four sizes: small, medium, large, and xl), and moreover, because we have the weights of GPT2, we can explain why this is happening, at a mechanistic level.
Some of these tokens are unspeakable because their unembedding vectors are not maximal along any direction. Thus, there is no internal activation that the model can generate that will cause the token to be emitted when decoding at zero temperature. Some of them are, rather, hard-to-speak, because they are not maximal along their own direction, and thus one needs to point slightly away from them in order to emit them. Both phenomena are related to the phenomena laid out in the original post. The hard-to-speak tokens are plausibly very hard to speak, because most tokens that a transformer emits will be most effectively signaled by pointing roughly directly at them. (And recall that the most plausible explanation for how this situation arose in the first place is that these tokens were never seen even once during training; thus, the model has zero practice at having to point to them in weird ways, and is thus unlikely to be able to do so.)
In this post, we will make extensive use of Neel Nanda's TransformerLens library. This library is amazingly helpful in avoiding various rough edges and footguns in mechanistic interpretability, and I cannot recommend it highly enough.
First, we will show how to identify hard-to-speak tokens using TransformerLens:
We can then examine these hard-to-speak tokens, and see some that are very familiar from the original SolidGoldMagikarp post:
Many of these tokens are unprintable (i.e., they don't display and I don't know what they are). The remainder mostly appear in the original SolidGoldMagikarp post.
Finally, we give a simple approach to verify that a particular token is unspeakable rather than just being hard-to-speak.
Briefly, we look for a direction x such that x minimizes the negative logit of the token under consideration (in this case, 30208, or " externalTo"), while satisfying the constraint that the logit of the token is higher than the logit of any other token by at least a small margin. When we do this, we find that small margins (like 1.01e-7 for the case of 30208) lead to infeasible linear programs (and thus no such direction exists), and even smaller margins (such as 1.00e-7) lead to "feasible" linear programs that are only feasible because of numerical errors in the calculation. That is, if you take the direction output by the solver, it is not actually a solution, because unembedding it still leads to a different token being ever-so-slightly higher. (In the case of 30208, it is the token 24973, " exting".)
Here are the results of testing whether something is hard-to-speak or truly unspeakable for the printable hard-to-speak tokens in GPT2-small:
Here are the printable hard-to-speak tokens from GPT2-xl:
And here are the corresponding tests for unspeakability:
Discussion and Future Directions
Given that GPT2-xl (compared to GPT2-small) has more printable hard-to-speak tokens, of which a smaller fraction and a smaller absolute number are unspeakable, one guess would be that these trends continue for larger models, and a model the size of GPT-3 davinci has many hard-to-speak tokens but very few or even no unspeakable tokens. It is certainly intuitive to believe that, in high dimensions, with a fixed vocabulary size, very few unembedding vectors are not on the convex hull of the unembedding vector polytope. (h/t to Eric Neyman for this last point)
It seems like it might be interesting to try to measure the volume of the feasible region (or more precisely, the (n-1) dimensional surface area of the portion of the unit sphere in the feasible region), to ascertain how "precise" the model would have to be to produce a particular token.
Edit: there are way more unspeakable tokens than I thought
I forgot to include the bias term! All of these tokens in GPT2-xl are actually unspeakable!