Can you get AGI from a Transformer?

This was a really interesting read. I'm glad you decided to post it.

I find all of the pieces fairly interesting and plausible in isolation, but I think the post either underexplains or understates the case for how it all fits together. As written, the post roughly says:

first section: there are probably some kinds of computation which can't be efficiently supported (though such computations might not actually matter)
second section: here's one particular kind of computation which seems pretty central to how humans think (though I don't really see a case made here that it's necessary)
third section: some things which might therefore be difficult

I do think the case can be strengthened a lot, especially for low sample efficiency and the difficulty of inventing new concepts. Here's a rough outline of the argument I would make:

Sample efficiency is all about how well we approximate Bayesian reasoning. This is one of the few places where both the theory and the empirics have a particularly strong case for Bayes.
Bayesian reasoning, on the sort of problems we're talking about, means generative models. So if we want sample efficiency, then generative models (or a good approximation thereof) are a necessary element.
GPT-style models do not have "basin of attraction"-style convergence, where they learn the general concept of generative models and can then easily create new generative models going forward. They have to converge to each new generative model the hard way.

That last step is the one I'd be most uncertain about, but it's also a claim which is "just math", so it could be checked by either analysis or simulation if we know how the models in question are embedded.

[-]Vaniver5y100

There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.

Sure... but humans can't do those either, without an exorbitant performance penalty! Does this imply that humans alone aren't general intelligences (and thus the threshold we should be worried about is lower), or that they're not actually important for general intelligence?

will eventually be surpassed in AGI-type capabilities by a different kind of information processing

"Surpassed" seems strange to me; I'll bet that the first AGI system will have a very GPT-like module, that will be critical to its performance, that will nevertheless not be "the whole story." Like, by analogy to AlphaGo, the interesting thing was the structure they built around the convnets, but I don't think it would have worked nearly as well without the convnets.

[-]Steven Byrnes5y30

Sure... but humans can't do those either, without an exorbitant performance penalty!

Well, a big part of this post is an argument that the human neocortex is doing a different type of information processing than a DNN, with the neocortex's algorithms being more similar to the algorithms underlying probabilistic programming, message-passing, etc. Therefore I don't accept the premise that in general, if a DNN can't do a certain type of information processing efficiently, then neither can the human brain.

Do you think DNNs and human brains are doing essentially the same type of information processing? If not, how did you conclude "humans can't do those either"? Thanks!

[-]Vaniver5y10

Do you think DNNs and human brains are doing essentially the same type of information processing? If not, how did you conclude "humans can't do those either"? Thanks!

Sorry for the late reply, but I was talking from personal experience. Multiplying matrices is hard! Even for extremely tiny ones, I was sped up tremendously by pencil and paper. It was much harder than driving a car, or recognizing whether a image depicts a dog or not. Given the underlying computational complexity of the various tasks, I can only conclude that I'm paying an exorbitant performance penalty for the matmul. (And I'm in the top few percentiles of calculation ability, so this isn't me being bad at it by human standards.)

The general version of this is Moravec's Paradox.

[edit] Also if you look at the best training I'm aware of to solve a simpler arithmetic problems (the mental abacus method), it too demonstrates this sort of exorbitant performance penalty. They're exapting the ability to do fine motions in 3d space to multiply and add!

[-]Steven Byrnes5y10

Oh OK I think I misunderstood you.

So the context was: I think there's an open question about the extent to which the algorithms underlying human intelligence in particular, and/or AGI more generally, can be built from operations similar to matrix multiplication (and a couple other operations). I'm kinda saying "no, it probably can't" while the scaling-is-all-you-need DNN enthusiasts are kinda saying "yes, it probably can".

Then your response is that humans can't multiply matrices in their heads. Correct? But I don't think that's relevant to this question. Like, we don't have low-level access to our own brains. If you ask GPT-3 (through its API) to simulate a self-attention layer, it wouldn't do particularly well, right? So I don't think it's any evidence either way.

"Surpassed" seems strange to me; I'll bet that the first AGI system will have a very GPT-like module, that will be critical to its performance, that will nevertheless not be "the whole story." Like, by analogy to AlphaGo, the interesting thing was the structure they built around the convnets, but I don't think it would have worked nearly as well without the convnets.

I dunno, certainly that's possible, but also sometimes new algorithms outright replace old algorithms. Like GPT-3 doesn't have any LSTM modules in it, let alone HHMM modules, or syntax tree modules, or GOFAI production rule modules. :-P

[-]Vaniver5y10

Ah, I now suspect that I misunderstood you as well earlier: you wanted your list to be an example of "what you mean by DNN-style calculations" but I maybe interpreted as "a list of things that are hard to do with DNNs". And under that reading, it seemed unfair because the difficulty that even high-quality DNNs have in doing simple arithmetic is mirrored by the difficulty that humans have in doing simple arithmetic.

Similarly, I agree with you that there are lots of things that seem very inefficient to implement via DNNs rather than directly (like MCTS, or simple arithmetic, or so on), but it wouldn't surprise me if it's not that difficult to have a DNN-ish architecture that can more easily implement MCTS than our current ones. The sorts of computations that you can implement with transformers are more complicated than the ones you could implement with convnets, which are more complicated than the ones you could implement with fully connected nets; obviously you can't gradient descent a fully connected net into a convnet, or a convnet into a transformer, but you can still train a transformer with gradient descent.

It's also not obvious to me that humans are doing the more sophisticated thinking 'the smart way' instead of 'the dumb way'. Suppose our planning algorithms are something like MCTS; is it 'coded in directly' like AlphaGo's, or is it more like a massive transformer that gradient-descented its way into doing something like MCTS? Well, for things like arithmetic and propositional logic, it seems pretty clearly done 'the dumb way', for things like planning and causal identification it feels more like an open question, and so I don't want to confidently assert that our brains are doing it the dumb way. My best guess is they have some good tricks, but won't be 'optimal' according to future engineers who understand all of this stuff.

[-]Steven Byrnes5y10

I slightly edited that section header to make it clearer what the parenthetical "(matrix multiplications, ReLUs, etc.)" is referring to. Thanks!

I agree that it's hard to make highly-confident categorical statements about all current and future DNN-ish architectures.

I don't think the human planning algorithm is very much like MCTS, although you can learn to do MCTS (just like you can learn to mentally run any other algorithm—people can learn strategies about what thoughts to think, just like they can strategies about what actions to execute). I think the built-in capability is that compositional-generative-model-based processing I was talking about in this post.

Like, if I tell you "I have a banana blanket", you have a constraint (namely, I just said that I have a banana blanket) and you spend a couple seconds searching through generative models until you find one that is maximally consistent with both that constraint and also all your prior beliefs about the world. You're probably imagining me with a blanket that has pictures of bananas on it, or less likely with a blanket made of banana peels, or maybe you figure I'm just being silly.

So by the same token, imagine you want to squeeze a book into a mostly-full bag. You have a constraint (the book winds up in the bag), and you spend a couple seconds searching through generative models until you find one that's maximally consistent with both that constraint and also all your prior beliefs and demands about the world. You imagine a plausible way to slide the book in without ripping the bag or squishing the other content, and flesh that out into a very specific action plan, and then you pick the book up and do it.

When we need a multi-step plan, too much to search for in one go, we start needing to also rely on other built-in capabilities like chunking stuff together into single units, analogical reasoning (which is really just a special case of compositional-generative-model-based processing), and RL (as mentioned above, RL plays a role in learning to use metacognitive problem-solving strategies). Maybe other things too.

I don't think causality per se is a built-in feature, but I think it comes out pretty quickly from the innate ability to learn (and chunk) time-sequences, and then incorporate those learned sequences into the compositional-generative-model-based processing framework. Like, "I swing my foot and then kick the ball and then the ball is flying away" is a memorized temporal sequence, but it's also awfully close to a causal belief that "kicking the ball causes it to fly away". (...at least in conjunction with a second memorized temporal sequence where I don't kick the ball and it just stays put.) (See also counterfactuals.)

I'm less confident about any of this than I sound :)

[-]Gurkenglas5y30

Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.

[-]Davidmanheim5y20

I'm unsure that GPT3 can output, say, a ipython notebook to get the values it wants.

That would be really interesting to try...

[-]gwern5y00

I'm going to stop at your very first claim and observe: MuZero.

[-]LGS5y20

You are aware that MuZero has tree search hardcoded into it, yes? How does that contradict claim 1?

[-]gwern5y10

MuZero does not do MCTS and still outperforms.

[-]LGS5y110

It does do (a variant of) MCTS. Check it out for yourself. The paper is here:

https://arxiv.org/pdf/1911.08265.pdf

Appendix B, page 12:

"We now describe the search algorithm used by MuZero. Our approach is based upon Monte-Carlo tree search with upper confidence bounds, an approach to planning that converges asymptotically to the optimal policy in single agent domains and to the minimax value function in zero sum games [22]."

[-]gwern4y*10

No. I am very familiar with the paper, and MuZero does not use MCTS, nor does it support the claims of OP.

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.*

Second, using MCTS for the planning is not necessary. As they note, any kind of planning algorithm, not just MCTS would work ("For example, a naive search could simply select the k step action sequence that maximizes the value function. More generally, we may apply any MDP planning algorithm to the internal rewards and state space induced by the dynamics function.")

Third, NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion. What more do you want? I feel that we should at least acknowledge that TreeQN exists, wasn't insuperably hard to create, and, inasmuch as it runs on current hardware at all, doesn't seem to entail 'a factor of a million slowdown'. (VIN/VPN/Predictron might count as examples here too? There's a lot of model-based RL work which make the NN learn part of the planning process, like Imagination-based Planner or MCTSnets.)

Fourth, planning is not necessary at all for the NN to compute results just as strong as tree search would: just like regular AlphaZero, the policy network on its own, with no rollouts or trees involved of any sort, is very strong, and they show that it increases greatly in strength over training. We also have the scaling law work of Andy Jones, verifying the intuition that anything tree search does can be efficiently distilled into a non-tree-search model trained for longer. (I would also point out the steeply diminishing returns to both depth & number of iterations: AlphaZero or Master, IIRC, used only a few TPUs because the tree-search was a simple one which only descended a few plies; you can also see in the papers like the MuZero appendix referenced that most of the play strength comes from just a few iterations, and they don't even evaluate at more than 800, IIRC. It seems like what tree search does qualitatively is correct the occasional blind spot where the NN thinks forward a few moves for its best move and goes 'oh shit! That's actually a bad idea!'. It's not doing anything super-impressive or subtle. It's just a modest local policy iteration update, if you will. But the NN is what does almost all of the work.) This alone is completely fatal to OP's claims that tree search is an example of useful algorithms neural nets cannot do and that adding orders of magnitude more compute would not make a difference (it totally would - the exact scaling exponent for Go/ALE is unknown but I'd bet that anything you can do with MuZero+tree-search can be done with a larger MuZero's policy alone given another order or three of compute).

So, MuZero does not use MCTS; the symbolic tree planning algorithm(s) it uses are not that important; to the extent that explicit tree planning is useful it can be done in a pure neural fashion; and relatively modest (as these things go) increases in compute can obviate the need for even pure neural tree search.

This refutes Byrne's use of tree search as an example of "Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty." Tree search is not an example because it already has been cast into DNN form without exorbitant performance penalty.

* for more on what AlphaZero MCTS "really" is, https://arxiv.org/abs/2007.12509 & https://arxiv.org/abs/1804.04577 come to mind.

[-]Steven Byrnes4y*20

Thanks for the comment!

First, that's not MCTS. It is not using random rollouts to the terminal states (literally half the name, 'Monte Carlo Tree Search'). This is abuse of terminology (or more charitably, genericizing the term for easier communication): "MCTS" means something specific, it doesn't simply refer to any kind of tree-ish planning procedure using some sort of heuristic-y thing-y to avoid expanding out the entire tree. The use of a learned latent 'state' space makes this even less MCTS.

Yeah even when I wrote this, I had already seen claims that the so-called MCTS is deterministic. But DeepMind and everyone else apparently calls it MCTS, and I figured I should just follow the crowd, and maybe this is just one of those things where terminology drifts in weird directions and one shouldn't think too hard about it, like how we say "my phone is ringing" when it's actually vibrating. :-P

Looking into it again just now, I'm still not quite sure what's going on. This person says AlphaZero switches from random to non-random after 15 moves. And this person says AlphaZero is totally deterministic but "MCTS" is still the proper term, for reasons that don't make any sense to me. I dunno and I'm open to being educated here. Regardless, if you tell me that I should call it "tree search" instead of "MCTS", I'm inclined to take your word for it. I want to be part of the solution not part of the problem :-D

NNs absolutely can plan in a 'pure' fashion: TreeQN (which they cite) constructs its own tree which it does its own planning/exploration over in a differentiable fashion.

That's an interesting example. I think I need to tone down my claim a bit (and edit the post). Thank you. I will now say exactly what I'm making of that example:

Here is a spectrum of things that one might believe, from most-scaling-hypothesis-y to least:

If you take literally any DNN, and don't change the architecture or algorithm or hyperparameters at all, and just scale it up with appropriate training data and loss functions, we'll get AGI. And this is practical and realistic, and this is what will happen in the near future to create AGI.
If you take literally any DNN, and don't change the architecture or algorithm or hyperparameters at all, and just scale it up with appropriate training data and loss functions, we'll get AGI in principle … but in practice obviously both people and meta-learning algorithms are working hard to find ever-better neural network architectures / algorithms / hyperparameters that give better performance per compute, and they will continue to do so. So in practice we should expect AGI to incorporate some tweaks compared to what we might build today.
Those kinds of tweaks are not just likely for economic reasons but in fact necessary to get to AGI
…and the necessary changes are not just "tweaks", they're "substantive changes / additions to the computation"
…and they will be such a big change that the algorithm will need to involve some key algorithmic steps that are not matrix multiplications, ReLUs, etc.
More than that, AGI will not involve anything remotely like a DNN, not even conceptually similar, not even as one of several components.

My impression is that you're at #2.

I put almost no weight on #6, and never have. In fact I'm a big believer in AGI sharing some aspects of DNNs, including distributed representations, and gradient descent (or at least "something kinda like gradient descent"), and learning from training data, and various kinds of regularization, etc. I think (uncoincidentally) that the family of (IMO neocortex-like) learning algorithms I was talking about in the main part of this post would probably have all those aspects, if they scale to AGI.

So my probability weight is split among #2,3,4,5.

To me, the question raised by the TreeQN paper is: should I shift some weight from #5 to #4?

When I look at the TreeQN paper, e.g. this source code file, I think I can characterize it as "they did some tree-structure-specific indexing operations". (You can correct me if I'm misunderstanding.) Are "tree-structure-specific indexing operations" included in my phrase "matrix multiplications, ReLUs, etc."? I dunno. Certainly stereotypical DNN code involves tons of indexing operations; it's not like it looks out of place! On the other hand, it is something that humans deliberately added to the code.

I guess in retrospect it was kinda pointless for me to make an argument for #5. I shouldn't have even brought it up. In the context of this post, I could have said: "The space of all possible learning algorithms is much vaster than Transformers—or even "slight tweaks on Transformers". Therefore we shouldn't take it for granted that either Transformers or "slight tweaks on Transformers" will necessarily scale to AGI—even if we believe in The Bitter Lesson."

And then a different (and irrelevant-to-this-post) question is whether "matrix multiplications, ReLUs, etc." (whatever that means) is a sufficiently flexible toolkit to span much of the space of all possible useful learning algorithms, in an efficiently-implemented way. My change from yesterday is: if I interpret this "toolkit" to include arbitrary indexing & masking operations, and also arbitrary control flow—basically, if this "toolkit" includes anything that wouldn't look out of place in today's typical DNN source code—then this is a much broader space of (efficiently-implemented) algorithms than I was mentally giving it credit for. This makes me more apt to believe that future AGI algorithms will be built using tools in this toolkit, but also more apt to believe that those algorithms could nevertheless involve a lot of new ideas and ingenuity, and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs". For example, here is a probabilistic program inference algorithm that's (at least arguably/partly) built using this "toolkit", and I really don't think of probabilistic program inference as "similar to DNNs".

[-]gwern4y40

Yeah, I didn't want to just nitpick over "is this tree search a MCTS or not", which is why I added in #2-4, which address the steelman - even if you think MuZero is using MCTS, I think that doesn't matter because one doesn't need any tree search at all, so a fortiori that question doesn't matter.

(I also think the MuZero paper is generally confusing and poorly-written, and that's where a lot of confusion is coming from. I am not the only person to read it through several times and come away confused about multiple things, and people trying to independently reimplement MuZero tell me that it seems to leave out a lot of details. There's been multiple interesting followup papers, so perhaps reading them all together would clarify things.)

Yes, so on your spectrum of #1-6, I would put myself at closer to 3 than 2. I would say that while we have the global compute capacity now to scale up what are the moral equivalents of contemporary models to what the scaling laws would predict is human-equivalence (assuming, as seems likely but far from certain, that they more or less hold - we haven't seen any scaling law truly break yet), at the hundreds of trillions to quadrillion parameter regime of Transformers or MLPs, this is only about the compute for a single training run. The hardware exists and the world is wealthy enough to afford it if it wanted to (although it doesn't).

But we actually need the compute for the equivalent of many runs. The reason hardware progress drives algorithmic software progress is because we are absolutely terrible at designing NNs, and are little more than monkeys banging at giant black boxes with trial-and-error, confabulating or retrospectively cherrypicking theories to explain the observed results. Thus we need enough compute to blow on enough runs that a grad student can go 'what if I added a shortcut connection? Oh' or 'these MLP things never work beyond 3 or 4 layers, everyone knows that... but what if I added any kind of normalization, the way we normalize every other kind of NN? Oh' and figure out the right detail which makes it Just Work.

So, we will need a lot of algorithmic efficiency beyond the bare minimum of '1 training run, once', to afford all the slightly-broken training runs.

(Unless we get 'lucky' and the prototyping small runs are so accurate and the code so solid that you can prototype at a tiny scale and do 1 run; I tend to disbelieve this because there's so many issues that always come up as you move several magnitudes, both at just the code level and training.)

On the other hand, it is something that humans deliberately added to the code.

/shrug. If you don't like TreeQN example, I have others! Just keep making the NN deeper (and/or more recurrent, same thing really, when unrolled...), and it'll keep approximating the value function better at fairly modest additional cost compared to 'real' tree search. (After all, the human brain can't have any symbolic discrete tree in it either, it just passes everything forward for the initial glance and then recurs for System 2 thinking through the game tree.)

I see symbolic vs neural as a bias-variance continuum, per the Bitter Lesson: symbolic learns quickly for little compute, but then it tops out, and eventually, the scissors cross, and the more neural you go, the better it gets. So the question ultimately becomes one of budgets. What's your budget? How much constant-factor performance optimization and ultimate ceiling do you need, and how much hand-engineering of that specialized complicated symbolic architecture are you willing to buy? If you have little compute and don't mind attaining less than superhuman performance and buying a lot of complicated domain-specific code, you will move far down the symbolic end; if you have lots of compute and want the best possible generic code...

and less apt to believe that it's feasible for something like AutoML-Zero to search through the whole space of things that you can do with this toolkit, and less apt to describe the space of things you can build with this toolkit as "algorithms similar to DNNs".

But that's where the scaling laws become concerning. Can AutoML-Zero successfully search for "code to implement MCTS with pUCT exploration heuristic and domain-specific tuned hyperparameters with heavy playouts using a shallow MLP for value approximation"? Probably no. That's complex, specialized, and fragile (a half-working version doesn't work at all). Can AutoML-Zero learn "add 10 moar layers to $DEFAULT_NN lol"? ...Probably yes.

[-]gwern4y30

One paper I forgot that bears especially on the question of why you use planning: "On the role of planning in model-based deep reinforcement learning", Hamrick et al 2020:

Model-based planning is often thought to be necessary for deep, careful reasoning and generalization in artificial agents. While recent successes of model-based reinforcement learning (MBRL) with deep function approximation have strengthened this hypothesis, the resulting diversity of model-based methods has also made it difficult to track which components drive success and why. In this paper, we seek to disentangle the contributions of recent methods by focusing on three questions: (1) How does planning benefit MBRL agents? (2) Within planning, what choices drive performance? (3) To what extent does planning improve generalization? To answer these questions, we study the performance of MuZero (Schrittwieser et al., 2019), a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of MuZero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. We show that deep, precise planning is often unnecessary to achieve high reward in many domains, with 2-step planning exhibiting surprisingly strong performance even in Go. (3) Planning alone is insufficient to drive strong generalization. These results indicate where and how to utilize planning in reinforcement learning settings, and highlight a number of open questions for future MBRL research.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

33

Can you get AGI from a Transformer?

33

Introduction

Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (= matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.

Background Claim 2: We can't just brush aside Background Claim 1 by appealing to the "Bitter Lesson"

Generative-model-based information processing: Background and motivation

Predicting inputs with generative models (a.k.a. “analysis by synthesis” or “probabilistic programming”)

What’s so great about generative-model-based processing?

So, at the end of the day, is the generative-model-based approach essential for superintelligent AGI?

Back to Transformers and GPT-N

Conclusion