Does anyone know how Brian Christian came to be interested in AI alignment and why he decided to write this book instead of a book about a different topic? (I haven't read the book and looked at the Amazon preview but couldn't find the answer there.)
Apparently this has been nominated for the review. I assume that this is implicitly a nomination for the book, rather than my summary of it. If so, I think the post itself serves as a review of the book, and I continue to stand by the claims within.
The Alignment Problem: Machine Learning and Human Values, by Brian Christian, was just released. This is an extended summary + opinion, a version without the quotes from the book will go out in the next Alignment Newsletter.
Summary:
This book starts off with an explanation of machine learning and problems that we can currently see with it, including detailed stories and analysis of:
- The gorilla misclassification incident
- The faulty reward in CoastRunners
- The gender bias in language models
- The failure of facial recognition models on minorities
- The COMPAS controversy (leading up to impossibility results in fairness)
- The neural net that thought asthma reduced the risk of pneumonia
It then moves on to agency and reinforcement learning, covering from a more historical and academic perspective how we have arrived at such ideas as temporal difference learning, reward shaping, curriculum design, and curiosity, across the fields of machine learning, behavioral psychology, and neuroscience. While the connections aren't always explicit, a knowledgeable reader can connect the academic examples given in these chapters to the ideas of specification gaming and mesa optimization that we talk about frequently in this newsletter. Chapter 5 especially highlights that agent design is not just a matter of specifying a reward: often, rewards will do ~nothing, and the main requirement to get a competent agent is to provide good shaping rewards or a good curriculum. Just as in the previous part, Brian traces the intellectual history of these ideas, providing detailed stories of (for example):
- BF Skinner's experiments in training pigeons
- The invention of the perceptron
- The success of TD-Gammon, and later AlphaGo Zero
The final part, titled "Normativity", delves much more deeply into the alignment problem. While the previous two parts are partially organized around AI capabilities -- how to get AI systems that optimize for their objectives -- this last one tackles head on the problem that we want AI systems that optimize for our (often-unknown) objectives, covering such topics as imitation learning, inverse reinforcement learning, learning from preferences, iterated amplification, impact regularization, calibrated uncertainty estimates, and moral uncertainty.
Opinion:
I really enjoyed this book, primarily because of the tracing of the intellectual history of various ideas. While I knew of most of these ideas, and often also who initially came up with the ideas, it's much more engaging to read the detailed stories of _how_ that person came to develop the idea; Brian's book delivers this again and again, functioning like a well-organized literature survey that is also fun to read because of its great storytelling. I struggled a fair amount in writing this summary, because I kept wanting to somehow communicate the writing style; in the end I decided not to do it and to instead give a few examples of passages from the book in this post.
Passages:
Note: It is generally not allowed to have quotations this long from this book; I have specifically gotten permission to do so.
Here’s an example of agents with evolved inner reward functions, which lead to the inner alignment problems we’ve previously worried about:
Maybe everyone but me already knows this, but here’s one of the best examples I’ve seen about the benefits of transparency:
Finally, on the importance of reward shaping: