All of abhayesian's Comments + Replies

Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes

Do you have a reference for this?  I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks.  I think you're referencing this paper, which trains a shallow attention-only transformer where they get rid of the no... (read more)