User Comment Replies — AI Alignment Forum

Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes

Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you're referencing this paper, which trains a shallow attention-only transformer where they get rid of the no... (read more)

AI ALIGNMENT FORUM
AF

All of abhayesian's Comments + Replies