This is a linkpost for https://arxiv.org/abs/2210.14215
This model is a proof of concept of powerful implicit mesa optimizer, which is evidence towards "current architectures could be easily inner misaligned".
Notably the model was trained across multiple episodes to pick up on RL improvement.
Though the usual inner misalignment means that it’s trying to gain more reward in future episodes by forgoing reward in earlier ones, but I don’t think this is evidence for that.
Authors train transformers to imitate the trajectory of reinforcement learning (RL) algorithms. Find that the transformers learn to do in-context RL (that is, the transformers implement an RL algorithm)---the authors check this by having the transformers solve new RL tasks. Indeed, the transformers can sometimes do better than the RL algorithms they're trained to imitate.
Seems like more evidence for the "a generative model contain agents" point.
Abstract: