Review
This is a linkpost for https://arxiv.org/abs/2305.11206
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases;
I'm not sure how well this metric tracks what people care about — performance on particular downstream tasks (e.g. passing a law exam, writing bugless code, automating alignment research, etc)
Abstract
Implications
Data Quality & Capabilities
Along with TinyStories and QLoRA I'm becoming increasingly convinced that data quality is all you need, definitely seems to be the case for finetuning, and may be the case for base-model training as well. Better scaling laws through higher-quality corpus?
Also for who haven't updated, it seems very likely that GPT-4 equivalents will be essentially free to self-host and tune within a year. Plan for this!
Perplexity != Quality
Because of this, the authors manually select checkpoints between the 5th and 10th epochs (out of 15) using the held-out 50-example development set.