All of deepthoughtlife's Comments + Replies

I had an idea when reading it that I think is pretty interesting. You mention that both the grokking of a small amount of data repeated many times, and models trained on a great deal of data are highly general. Repeated data during training is also mentioned as a significant negative for large models. These are very much in tension.

My idea is this. Split the training data into two parts, one vastly larger than the other. First, train a model on a small amount of data many times in a way designed to make it grok the task, such as weight decay. Second, train... (read more)

3Neel Nanda
To be clear, the paper I cite on data quality focuses on how repeated data is bad for generalisation. From the model's perspective, the only thing it cares about is train loss (and maybe simplicity), and repeated data is great for train loss! The model doesn't care whether it generalises, only whether generalisation is a "more efficient" solution. Grokking happens when the amount of data is such that the model marginally prefers the correct solution, but there's no reason to expect that repeated data screwing over models is exactly the amount of data such that the correct solution is better. Though the fact that larger models are messed up by fewer repeated data points is fascinating - I don't know if this is a problem with my hypothesis, or just a statement about the relative complexity of different circuits in larger vs smaller models. Your experiment idea is interesting, I'm not sure what I'd expect to happen! I'd love to see someone try it, and am not aware of anyone who has (the paper I cite is vaguely similar - there they train the model on the repeated data and unrepeated data shuffled together, and compare it to a model trained on just the unrepeated data).  Though I do think that if this is a real task there wouldn't be an amount of data that leads to general grokking, rather than amount of data to grok varies heavily between different circuits.