The two major points I take away:
Scaling Just Works: as blase as we may now be at seeing 'lines go straight', I continue to be shocked in my gut that they do just keep going straight and something like Gato can be as straightforward as 'just train a 1.2b-param Transformer on half a thousand different tasks, homes, nbd' and it works exactly like you'd think and the scaling curve looks exactly like you'd expect. It is shocking how unshocking the results are conditional on a shocking thesis (the scaling hypothesis). So many S-curves and paradigms hit an exponential wall and explode, but DL/DRL still have not. We should keep this in mind that every time we have an opportunity to observe scaling explode in a giant fireball, and we don't.
Multi-task learning is indeed just another blessing of scale: as they note, it used to be that learning multiple Atari games in parallel was really hard. It did not work, at all. You got negative transfer even within ALE. People thought very hard and ran lots of experiments to try to create things like Popart less than 4 years ago where it was a triumph that, due to careful engineering a single checkpoint could play just the ALE-57 games with mediocre performance.
Decision Transformer definitely made 'multi-task learning is a blessing of scale' the default hypothesis, but no one had actually shown this, the past DT and other work (aside from MetaMimic) were all rather low n and k; you could wonder if they would interfere at a certain point or break down, and require fancy engineering like MoEs to enable learning at all. (Similarly, Andy Jones showed nice scaling laws for DRL and I scraped together a few examples like Ms Pacman, but nothing across really complicated tasks or many tasks.)
Now you can throw in not just ALE, but DMLab, Metaworld, Procgen, hell, let's just throw in a bunch of random Internet scraped text and images and captions and treat those as 'reinforcement learning tasks' too why not, and to make them all play together you do... nothing, really, you just train on them all simultaneously with a big model in about the dumbest way possible and it works fine.
(Also, if one had any doubts, DM is now fully scale-pilled.)
If anyone was wondering whether DM planned to follow it up in the obvious way because of the obvious implications of its obvious generality and obvious scalability, Hassabis says on the Fridman podcast: " it's just the beginning really, it's our most general agent one could call it so far but um you know that itself can be scaled up massively more than we've done so far obviously we're in the in the middle of doing that."
From the abstract, emphasis mine:
(Will edit to add more as I read. ETA: 1a3orn posted first.)
The section on broader implications is interesting. Selected quote:
They also do some scaling analysis and yup, you can make it smarter by making it bigger.
What do I think about all this?
Eh, I guess it was already priced in. I think me + most people in the AI safety community would have predicted this. I'm a bit surprised that it works as well as it does for only 1.2B parameters though.