PhD Student at Mila, Montreal
The Q-Learning baseline is a model-free control of MuZero. So it shares implementation details of MuZero (network architecture, replay ratio, training details etc.) while removing the model-based components of MuZero (details in sec A.2) . Some key differences you'd find vs a typical Q-learning implementation:
I would guess larger networks + higher sample reuse have the biggest effect size compared to standard Q-learning implementations.
The ProcGen competition also might have used the easy difficulty mode compared to the hard difficulty mode used in our paper.
We do actually train/evaluate on the full distribution (See Figure 5 rightmost). MuZero+SSL versions (especially reconstruction) continue to be a lot more sample-efficient even in the full-distribution, and MuZero itself seems to be quite a bit more sample efficient than PPO/PPG.
Worth noting that they already use BERT in Search. https://blog.google/products/search/search-language-understanding-bert/
Thanks, glad you liked it, I really like the recent RL directions from OpenAI too! It would be interesting to see the use of model-based RL for the "RL as fine-tuning paradigm": making large pre-trained models more aligned/goal-directed efficiently by simply searching over a reward function learned from humans.