AI ALIGNMENT FORUM
AF

Ankesh Anand
Ω9020
Message
Dialogue
Subscribe

PhD Student at Mila, Montreal

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
Ankesh Anand4y00

Thanks, glad you liked it, I really like the recent RL directions from OpenAI too! It would be interesting to see the use of model-based RL for the "RL as fine-tuning paradigm": making large pre-trained models more aligned/goal-directed efficiently by simply searching over a reward function learned from humans. 

Reply
EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
Ankesh Anand4y*50

The Q-Learning baseline is a model-free control of MuZero. So it shares implementation details of MuZero (network architecture, replay ratio, training details etc.) while removing the model-based components of MuZero (details in sec A.2) . Some key differences you'd find vs a typical Q-learning implementation:  

  • Larger network architectures: 10 block ResNet compared to a few conv layers in typical implementations.
  • Higher sample reuse: When using a reanalyse ratio of 0.95, both MuZero and Q-Learning use each replay buffer sample an average of 20 times. The target network is updated every 100 training steps.
  • Batch size of 1024 and some smaller details like using categorical reward and value predictions similar to MuZero.
  • We also have a small model-based component which predicts reward at next time step which lets us decompose the Q(s,a) into reward and value predictions just like MuZero.

I would guess larger networks + higher sample reuse have the biggest effect size compared to standard Q-learning implementations. 

The ProcGen competition also might have used the easy difficulty mode compared to the hard difficulty mode used in our paper.

Reply
EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised
Ankesh Anand4y60

We do actually train/evaluate on the full distribution (See Figure 5 rightmost). MuZero+SSL versions (especially reconstruction) continue to be a lot more sample-efficient even in the full-distribution, and MuZero itself seems to be quite a bit more sample efficient than PPO/PPG. 

Reply
Are we in an AI overhang?
Ankesh Anand5y30

Worth noting that they already use BERT in Search. https://blog.google/products/search/search-language-understanding-bert/

Reply
No posts to display.