I think that taking a theoretical computer science perspective to GPT-3 might help clarify some part of the debate.
My background is in theoretical computer science, and that's my prefered perspective to make sense of the world. So in this post, I try to rephrase the task we ask GPT-3 to solve in terms of search and optimization problems, and I propose different criteria for solving such a task. I thus consider GPT-3 as a black-box, and only ask about how to decide if it is good at its task or not.
Epistemic status: probably nothing new, but maybe an interesting perspective. This is an attempt to understand the intuitions behind different statements about whether GPT-3 "solves some task".
The Promptp Search Problem
Let's fix a language (English for example). Then a very basic description of the task thrown at GPT-3 is that it's given a prompt p and it needs to give a follow-up text x such that x is a good answer to the prompt. For example, p might be a question about parenthesis or addition or a story to complete, and x is GPT-3's answer. I thus define Promptp to be this search problem for a given prompt p.
Now, what does it mean for an algorithm A to solve Promptp? Multiple possibilities come to mind:
(Deterministic Solution) For every execution t of A, such that the value of A for prompt p at the end of t is At(p), we have At(p)∈Promptp. This criterion has the advantage for simplicity, but it's also completely unrealistic -- even a human might fail to give a decent answer to a prompt.
(Nondeterministic Solution) There is an execution t of A such that satisfies At(p)∈Promptp. Here, we only require that A might give a correct answer, without any other guarantees. This is obviously too weak, as a monkeys bashing on keyboards will eventually give a good answer -- yet are not "solving it".
(Probabilistic Solution) For a meaningful distribution on the executions of A (for example a uniform distribution or a distribution based on simplicity), we havePr[At(p)∈Promptp]>l, for some l∈]0,1[ . This looks more realistic, as it allows for a margin of error. What about the value of l? I think we can go with anything >1/2, just like in Probabilistic Turing Machines: it ensures that we can make the probabilty of success as high as we want by repeating the call to A a polynomial number of times (So a polynomial number of rerolls).
Okay, so GPT-3 is solving Promptp iff it is a probabilistic solution to it. Is it enough?
Not exactly. In some cases, we don't want a binary choice between good and bad answers; there might be a scale of answers which are more and more correct. That is to say, it should be an optimization problem.
The BestPromptp Optimization Problem
To go from a search problem to an optimization problem, we need a measure of how good the solution is. Let's say we have a function measurep:Answers→R+ . The optimization problem BestPromptp can then be defined with our probabilistic criterion in multiple ways:
(Best Answer)A solves BestPromptp ifPr[∀x∈Answers:measurep(At(p))≥measurep(x)]>l, with l>1/2. So A must give one of the best possible answers with high probability. It's less obviously unrealistc than the deterministic case, but I would argue it's still too strong: humans might fail to give the best possible answer, whatever that means in this context.
(Good Enough Answer) A solves BestPromptp ifPr[measurep(At(p))≥thresh]>l with l>1/2 and thresh>0 . This asks of A to give good enough answers with high probability. The threshold depends on the measurep function, but if the latter captures what we look in the answer, there should be some point after which we're satisfied with every answer.
This new criterion feels much closer to what I want GPT-3 to do when I give it a prompt. But I don't know how to write a measure function that captures what I want (after all, that would be a utility function for the task!). We can do without, by instead comparing two answers with each other. I probably have a couple of answers I am waiting for when writing my prompt; and when I receive an answer, I feel that comparing it to one of my expected answer should be possible.
In consequence, I give the following criterion. Assume the existence of a set of expected answers expectedp and of a comparison relation between answers ≥p. Then
(As Good As Expected)A solves BestPromptp if Pr[∃x∈expectedp:At(p)≥px]>l , with l>1/2 .
I am personally happy with that criterion. And I think it is rather uncontroversial (with maybe the subtlety that some people would prefer a ∀ in place of the ∃).
But we're not done yet. This only constrains the algorithm about what it does for a fixed prompt. A big part of the discussion of GPT-3 turns around the number of prompts for GPT-3 should give an interesting answer. This requires a step back.
Solving the Task
When evaluating GPT-3, what I really care about is its answer for a specific task (like addition or causality). And this task can be encoded through many prompts. Let's say that we have a set P of prompts that somehow capture our task (from our perspective). What constrains should an algorithm A satisfy to be said to solve the task? We already have a meaningful criterion for answering correctly to a prompt: the "As Good as Expected" one. What's left to decide is for which prompts in P should A satisfy this criterion.
(All) Some critics of GPT-3 seem to argue that it should behave correctly for every prompt. It's a pretty strong condition, but it might be possible given that the condition for each prompt is probabilistic.
(One) On the other hand, gwern among others seems to defend the position that what matters is that some prompt exists for which GPT-3 answers correctly (according to our criterion on prompt).
(Some fraction) And there's the position in the middle, where there's a proportion r∈]0,1[ such that nbPromptsSolved|P|≥r .
Contrary to the part about solving a prompt, the answer doesn't look obvious to me here. I also believe that much of the disagreement about GPT-3's competence is based on a disagreement about this part.
Conclusion
Thinking about GPT-3, and the sort of problem we want it to solve, through the lens of theoretical computer science, helped me understand slightly better the debate surrounding it. I hope it can be helpful to you too. And I'm also very curious of any error you might find in this post, whether conceptual or technical.
I think that taking a theoretical computer science perspective to GPT-3 might help clarify some part of the debate.
My background is in theoretical computer science, and that's my prefered perspective to make sense of the world. So in this post, I try to rephrase the task we ask GPT-3 to solve in terms of search and optimization problems, and I propose different criteria for solving such a task. I thus consider GPT-3 as a black-box, and only ask about how to decide if it is good at its task or not.
Epistemic status: probably nothing new, but maybe an interesting perspective. This is an attempt to understand the intuitions behind different statements about whether GPT-3 "solves some task".
The Promptp Search Problem
Let's fix a language (English for example). Then a very basic description of the task thrown at GPT-3 is that it's given a prompt p and it needs to give a follow-up text x such that x is a good answer to the prompt. For example, p might be a question about parenthesis or addition or a story to complete, and x is GPT-3's answer. I thus define Promptp to be this search problem for a given prompt p.
Now, what does it mean for an algorithm A to solve Promptp? Multiple possibilities come to mind:
Okay, so GPT-3 is solving Promptp iff it is a probabilistic solution to it. Is it enough?
Not exactly. In some cases, we don't want a binary choice between good and bad answers; there might be a scale of answers which are more and more correct. That is to say, it should be an optimization problem.
The BestPromptp Optimization Problem
To go from a search problem to an optimization problem, we need a measure of how good the solution is. Let's say we have a function measurep:Answers→R+ . The optimization problem BestPromptp can then be defined with our probabilistic criterion in multiple ways:
This new criterion feels much closer to what I want GPT-3 to do when I give it a prompt. But I don't know how to write a measure function that captures what I want (after all, that would be a utility function for the task!). We can do without, by instead comparing two answers with each other. I probably have a couple of answers I am waiting for when writing my prompt; and when I receive an answer, I feel that comparing it to one of my expected answer should be possible.
In consequence, I give the following criterion. Assume the existence of a set of expected answers expectedp and of a comparison relation between answers ≥p. Then
(As Good As Expected) A solves BestPromptp if Pr[∃x∈expectedp:At(p)≥px]>l , with l>1/2 .
I am personally happy with that criterion. And I think it is rather uncontroversial (with maybe the subtlety that some people would prefer a ∀ in place of the ∃).
But we're not done yet. This only constrains the algorithm about what it does for a fixed prompt. A big part of the discussion of GPT-3 turns around the number of prompts for GPT-3 should give an interesting answer. This requires a step back.
Solving the Task
When evaluating GPT-3, what I really care about is its answer for a specific task (like addition or causality). And this task can be encoded through many prompts. Let's say that we have a set P of prompts that somehow capture our task (from our perspective). What constrains should an algorithm A satisfy to be said to solve the task? We already have a meaningful criterion for answering correctly to a prompt: the "As Good as Expected" one. What's left to decide is for which prompts in P should A satisfy this criterion.
Contrary to the part about solving a prompt, the answer doesn't look obvious to me here. I also believe that much of the disagreement about GPT-3's competence is based on a disagreement about this part.
Conclusion
Thinking about GPT-3, and the sort of problem we want it to solve, through the lens of theoretical computer science, helped me understand slightly better the debate surrounding it. I hope it can be helpful to you too. And I'm also very curious of any error you might find in this post, whether conceptual or technical.