Forget OOD for a minute; ERM can't even learn to avoid spurious correlations that have counterexamples in the training data. Datasets like Waterbirds (used in that previously linked paper) are good toy datasets for figuring this out. I think we need to solve this problem first before trying to figure out OOD generalization.
I expect that these kinds of problems could mostly be solved by scaling up data and compute (although I haven't read the paper). However, the argument in the post is that even if we did scale up, we couldn't solve the OOD generalisation problems.
Forget OOD for a minute; ERM can't even learn to avoid spurious correlations that have counterexamples in the training data. Datasets like Waterbirds (used in that previously linked paper) are good toy datasets for figuring this out. I think we need to solve this problem first before trying to figure out OOD generalization.