Safety in Machine Learning

Gordon Seidoh Worley

In this week's AI alignment newsletter was a link to a paper titled "Safe Policy Learning from Observations". The paper uses the word "safe" or "safety" to describe ML systems 34 times, but reading through I didn't get a very clear sense of what the authors mean by "safe" or why "safe" was an appropriate word to describe the property they care about. The main clues came in these lines of the introduction:

Safe policy learning is an ambiguous term [Garcıa and Fernández, 2015]. Safety can be an absolute measure, e.g. worst-case criterion [Tamar et al., 2013], or relative to a baseline policy [Ghavamzadeh et al., 2016].

From the first referenced paper, "A Comprehensive Survey on Safe Reinforcement Learning", it becomes clear that here "safety" is taken to mean the kind of safety people are interested in at factories, offices, on streets, and in the air: avoiding doing things that lead to injury or death. The concerns raised related to safety are all cases of what we around these parts call Goodharting, but generally in the context of AI being used in places where over-optimization on the wrong thing could lead to somebody getting hurt. There is not, however, any reference to concerns about existential risks.

This reminds me of something I often forget: many more people are working on AI safety in the context of prosaic safety than are working in the context of existential safety. Nonetheless these folks may do work that is of value to those concerned about existential risk, even if only incidentally. For myself their work seems not that relevant since I'm not directly engaged with machine learning, but I can imagine how for others it may be.

However for whatever work may be done towards existential AI safety by work on commonplace safety, I can't help by wonder how much it really helps. Yes, prosaic safety gives better respectability to AI safety as a whole so that concern about existential risks from AI seem less unusual, but how much does it occlude work on existential safety or make it appear that more is being done than really is? I have no real answers to this, and I'm sure others have considered the idea already in more detail (although I can only recall discussing it in person and so have nothing to link), but I was surprised enough to realize I live in a bubble of concern about x-risks from AI that it seemed worth saying something about that surprise in case others are caught in similar bubbles.