I'll split this review into two parts, since the nominations called for review of both the post and the comments:
I think this post should be reviewed for its excellent comment section at least as much as for the original post, and also think that this post is a pretty central example of the kind of post I would like to see more of.
~ habryka
Summary
I think this was a good post. I think Katja shared an interesting perspective with valuable insights and that she was correct in highlighting a confused debate in the community.
That said, I think the post and the discussion are reasonably confused. The post sparked valuable lower-level discussion of AI risk, but I don't think that the discussion clarified AI risk models in a meaningful way.
The problem is that people are debating "is value fragile?" without realizing that value fragility is a sensitivity measure: given some initial state and some dynamics, how sensitive is the human-desirability of the final outcomes to certain kinds of perturbations of the initial state?
Left unremarked by Katja and the commenters, value fragility isn't intrinsically about AI alignment. What matters most is the extent to which the future is controlled by systems whose purposes are sufficiently entangled with human values. This question reaches beyond just AI alignment.
They also seem to be debating an under-specified proposition. Different perturbation sets and different dynamics will exhibit different fragility properties, even though we're measuring with respect to human value in all cases. For example, perturbing the training of an RL agent learning a representation of human value, is different from perturbing the utility function of an expected utility maximizer.
Setting loose a superintelligent expected utility maximizer is different from setting loose a mild optimizer (e.g. a quantilizer), even if they're both optimizing the same flawed representation of human value; the dynamics differ. As another illustration of how dynamics are important for value fragility, imagine if recommender systems had been deployed within a society which already adequately managed the impact of ML systems on its populace. In that world, we may have ceded less of our agency and attention to social media, and would therefore have firmer control over the future and value would be less fragile with respect to the training process of these recommender systems.
The Post
But exactly how complex and fragile?and its comments debate whether "value is fragile." I think this is a bad framing because it hides background assumptions about the dynamics of the system being considered. This section motivates a more literal interpretation of the value fragility thesis, demonstrating its coherence and its ability to meaningfully decompose AI alignment disagreements. The next section will use this interpretation to reveal how the comments largely failed to explore key modelling assumptions. This, I claim, helped prevent discussion from addressing the cruxes of disagreements.
The post and discussion both seem to slip past (what I view as) the heart of 'value fragility', and it seems like many people are secretly arguing for and against different propositions. Katja says:
it is hard to write down what kind of future we want, and if we get it even a little bit wrong, most futures that fit our description will be worthless.
But this leaves hidden a key step:
it is hard to write down the future we want, feed the utility function punchcard into the utility maximizer and then press 'play', and if we get it even a little bit wrong, most futures that fit our description will be worthless.
Here is the original 'value is fragile' claim:
Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth.
Eliezer claims that if the future is not shaped by a goal system, there's not much worth. He does not explicitly claim, in that original essay, that we have to/will probably build an X-maximizer AGI, where X is an extremely good (or perfect) formalization of human values (whatever that would mean!). He does not explicitly claim that we will mold a mind from shape Y and that that probably goes wrong, too. He's talking about goal systems chartering a course through the future, and how sensitive the outcomes are to that process.
Let's ground this out. Imagine you're acting, but you aren't quite sure what is right. For a trivial example, you can eat bananas or apples at any given moment, but you aren't sure which is better. There are a few strategies you could follow: preserve attainable utility for lots of different goals (preserve the fruits as best you can); retain option value where your normative uncertainty lies (don't toss out all the bananas or all of the apples); etc.
But what if you have to commit to an object-level policy now, a way-of-steering-the-future now, without being able to reflect more on your values? What kind of guarantees can you get?
In Markov decision processes, if you're maximally uncertain, you can't guarantee you won't lose at least half of the value you could have achieved for the unknown true goal (I recently proved this for an upcoming paper). Relatedly, perfectly optimizing an ϵ-incorrect reward function only bounds regret to 2ϵper time step (see also Goodhart's Curse). The main point is that you can't pursue every goal at once. It doesn't matter whether you use reinforcement learning to train a policy, or whether you act randomly, or whether you ask Mechanical Turk volunteers what you should do in each situation. Whenever your choices mean anything at all, no sequence of actions can optimize all goals at the same time.
So there has to be something which differentially pushes the future towards "good" things and away from "bad" things. That something could be 'humanity', or 'aligned AGI', or 'augmented humans wielding tool AIs', or 'magically benevolent aliens' - whatever. But it has to be something, some 'goal system' (as Eliezer put it), and it has to be entangled with the thing we want it to optimize for (human morals and metamorals). Otherwise, there's no reason to think that the universe weaves a "good" trajectory through time.
Hence, one might then conclude
Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will not be optimized for human morals and metamorals.
But how do we get from "will not be optimized for" to "will contain almost nothing of worth"? There are probably a few ways of arguing this; the simplest may be:
our universe has 'resources'; making the universe decently OK-by-human-standards requires resources which can be used for many other purposes; most purposes are best accomplished by not using resources in this way.
This is not an argument that we will deploy utility maximizers with a misspecified utility function, and that that will be how our fragile value is shattered and our universe is extinguished. The thesis holds merely that
Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth.
Concretely, I can claim 'value is fragile' and then say 'for example, if we deployed a utility-maximizer in our society but we forgot to have it optimize for variety, people might loop a single desirable experience forever.' But on its own, the value fragility claim doesn't center on AI.
[Human] values do not emerge in all possible minds. They will not appear from nowhere to rebuke and revoke the utility function of an expected paperclip maximizer.
Touch too hard in the wrong dimension, and the physical representation of those values will shatter - and not come back, for there will be nothing left to want to bring it back.
And the referent of those values - a worthwhile universe - would no longer have any physical reason to come into being.
Let go of the steering wheel, and the Future crashes.
Katja (correctly) implies that concluding that AI alignment is difficult requires extra arguments beyond value fragility:
... But if [the AI] doesn’t abruptly take over the world, and merely becomes a large part of the world’s systems, with ongoing ability for us to modify it and modify its roles in things and make new AI systems, then the question seems to be how forcefully the non-alignment is pushing us away from good futures relative to how forcefully we can correct this. And in the longer run, how well we can correct it in a deep way before AI does come to be in control of most decisions. So something like the speed of correction vs. the speed of AI influence growing.
As I see it, Katja and the commenters mostly discuss their conclusions about how AI+humanity might steer the future, how hard it will be to achieve the requisite entanglement with human values, instead of debating the truth value of the 'value fragility' claim which Eliezer made. Katja and the commenters discuss points which are relevant to AI alignment, but which are distinct from the value fragility claim. No one remarks that this claim has truth value independent of how we go about AI alignment, or how hard it is for AI to further our values.
Value fragility quantifies the robustness of outcome value to perturbation of the "motivations" of key actors within a system, given certain dynamics. This may become clearer as we examine the comments. This insight allows us to decompose debates about "value fragility" into e.g.
In what ways is human value fragile, given a fixed optimization scheme?
In other words: given fixed dynamics, to what classes of perturbations is outcome value fragile?
What kinds of multi-agent systems tend to veer towards goodness and beauty and value?
In other words: given a fixed set of perturbations, what kinds of dynamics are unusually robust against these perturbations?
What kinds of systems will humanity end up building, should we act no further? This explores our beliefs about how probable alignment pressures will interact with value fragility.
I think this is much more enlightening than debating
VALUE_FRAGILE_TO_AI == True?
The Comments
If no such decomposition takes place, I think debate is just too hard and opaque and messy, and I think some of this messiness spilled over into the comments. Locally, each comment is well thought-out, but it seems (to me) that cruxes were largely left untackled.
To concretely point out something I consider somewhat confused, johnwentsworth authored the top-rated comment:
I think [Katja's summary] is an oversimplification of the fragility argument, which people tend to use in discussion because there's some nontrivial conceptual distance on the way to a more rigorous fragility argument.
The main conceptual gap is the idea that "distance" is not a pre-defined concept. Two points which are close together in human-concept-space may be far apart in a neural network's learned representation space or in an AGI's world-representation-space. It may be that value is not very fragile in human-concept-space; points close together in human-concept-space may usually have similar value. But that will definitely not be true in all possible representations of the world, and we don't know how to reliably formalize/automate human-concept-space.
The key point is not "if there is any distance between your description and what is truly good, you will lose everything", but rather, "we don't even know what the relevant distance metric is or how to formalize it". And it is definitely the case, at least, that many mathematically simple distance metrics do display value fragility.
This is a good point. But what exactly happens between "we write down something too distant from the 'truth'" and the result? The AI happens. But this part, the dynamics, it's kept invisible.
So if you think that there will be fast takeoff via utility maximizers (a la AIXI), you might say "yes, value is fragile", but if I think it'll be more like slow CAIS with semi-aligned incentives making sure nothing goes too wrong, I reply "value isn't fragile." Even if we agree on a distance metric! This is how people talk past each other.
Crucially, you have to realize that your mind can hold separate the value fragility considerations, the considerations as to how vulnerable the outcomes are to the aforementioned perturbations, you have to know you can hold these separate from your parameter values for e.g. AI timelines.
Many other comments seem off-the-mark in a similar way. That said, I think that Steve Byrnes left an underrated comment:
Corrigibility is another reason to think that the fragility argument is not an impossibility proof: If we can make an agent that sufficiently understands and respects the human desire for autonomy and control, then it would presumably ask for permission before doing anything crazy and irreversible, so we would presumably be able to course-correct later on (even with fast/hard takeoff).
The reason that corrigibility-like properties are so nice is that they let us continue to steer the future through the AI itself; its power becomes ours, and so we remain the "goal system with detailed reliable inheritance from human morals and metamorals" shaping the future.
Conclusion
The problem is that people are debating "is value fragile?" without realizing that value fragility is a sensitivity measure: given some initial state and some dynamics, how sensitive is the human-desirability of the final outcomes to certain kinds of perturbations of the initial state?
Left unremarked by Katja and the commenters, value fragility isn't intrinsically about AI alignment. What matters most is the extent to which the future is controlled by systems whose purposes are sufficiently entangled with human values. This question reaches beyond just AI alignment.
I'm glad Katja said "Hey, I'm not convinced by this key argument", but I don't think it makes sense to include But exactly how complex and fragile? in the review.
Thanks to Rohin Shah for feedback on this review. Further discussion is available in the children of this comment.
I've thought about (concepts related to) the fragility of value quite a bit over the last year, and so I returned to Katja Grace's But exactly how complex and fragile? with renewed appreciation (I'd previously commented only a very brief microcosm of this review). I'm glad that Katja wrote this post and I'm glad that everyone commented. I often see private Google docs full of nuanced discussion which will never see the light of day, and that makes me sad, and I'm happy that people discussed this publicly.
I'll split this review into two parts, since the nominations called for review of both the post and the comments:
Summary
I think this was a good post. I think Katja shared an interesting perspective with valuable insights and that she was correct in highlighting a confused debate in the community.
That said, I think the post and the discussion are reasonably confused. The post sparked valuable lower-level discussion of AI risk, but I don't think that the discussion clarified AI risk models in a meaningful way.
The problem is that people are debating "is value fragile?" without realizing that value fragility is a sensitivity measure: given some initial state and some dynamics, how sensitive is the human-desirability of the final outcomes to certain kinds of perturbations of the initial state?
Left unremarked by Katja and the commenters, value fragility isn't intrinsically about AI alignment. What matters most is the extent to which the future is controlled by systems whose purposes are sufficiently entangled with human values. This question reaches beyond just AI alignment.
They also seem to be debating an under-specified proposition. Different perturbation sets and different dynamics will exhibit different fragility properties, even though we're measuring with respect to human value in all cases. For example, perturbing the training of an RL agent learning a representation of human value, is different from perturbing the utility function of an expected utility maximizer.
Setting loose a superintelligent expected utility maximizer is different from setting loose a mild optimizer (e.g. a quantilizer), even if they're both optimizing the same flawed representation of human value; the dynamics differ. As another illustration of how dynamics are important for value fragility, imagine if recommender systems had been deployed within a society which already adequately managed the impact of ML systems on its populace. In that world, we may have ceded less of our agency and attention to social media, and would therefore have firmer control over the future and value would be less fragile with respect to the training process of these recommender systems.
The Post
But exactly how complex and fragile? and its comments debate whether "value is fragile." I think this is a bad framing because it hides background assumptions about the dynamics of the system being considered. This section motivates a more literal interpretation of the value fragility thesis, demonstrating its coherence and its ability to meaningfully decompose AI alignment disagreements. The next section will use this interpretation to reveal how the comments largely failed to explore key modelling assumptions. This, I claim, helped prevent discussion from addressing the cruxes of disagreements.
The post and discussion both seem to slip past (what I view as) the heart of 'value fragility', and it seems like many people are secretly arguing for and against different propositions. Katja says:
But this leaves hidden a key step:
Here is the original 'value is fragile' claim:
Eliezer claims that if the future is not shaped by a goal system, there's not much worth. He does not explicitly claim, in that original essay, that we have to/will probably build an X-maximizer AGI, where X is an extremely good (or perfect) formalization of human values (whatever that would mean!). He does not explicitly claim that we will mold a mind from shape Y and that that probably goes wrong, too. He's talking about goal systems chartering a course through the future, and how sensitive the outcomes are to that process.
Let's ground this out. Imagine you're acting, but you aren't quite sure what is right. For a trivial example, you can eat bananas or apples at any given moment, but you aren't sure which is better. There are a few strategies you could follow: preserve attainable utility for lots of different goals (preserve the fruits as best you can); retain option value where your normative uncertainty lies (don't toss out all the bananas or all of the apples); etc.
But what if you have to commit to an object-level policy now, a way-of-steering-the-future now, without being able to reflect more on your values? What kind of guarantees can you get?
In Markov decision processes, if you're maximally uncertain, you can't guarantee you won't lose at least half of the value you could have achieved for the unknown true goal (I recently proved this for an upcoming paper). Relatedly, perfectly optimizing an ϵ-incorrect reward function only bounds regret to 2ϵ per time step (see also Goodhart's Curse). The main point is that you can't pursue every goal at once. It doesn't matter whether you use reinforcement learning to train a policy, or whether you act randomly, or whether you ask Mechanical Turk volunteers what you should do in each situation. Whenever your choices mean anything at all, no sequence of actions can optimize all goals at the same time.
So there has to be something which differentially pushes the future towards "good" things and away from "bad" things. That something could be 'humanity', or 'aligned AGI', or 'augmented humans wielding tool AIs', or 'magically benevolent aliens' - whatever. But it has to be something, some 'goal system' (as Eliezer put it), and it has to be entangled with the thing we want it to optimize for (human morals and metamorals). Otherwise, there's no reason to think that the universe weaves a "good" trajectory through time.
Hence, one might then conclude
But how do we get from "will not be optimized for" to "will contain almost nothing of worth"? There are probably a few ways of arguing this; the simplest may be:
This is not an argument that we will deploy utility maximizers with a misspecified utility function, and that that will be how our fragile value is shattered and our universe is extinguished. The thesis holds merely that
As Katja notes, this argument is secretly about how the "forces of optimization" shape the future, and not necessarily about AIs or anything. The key point is to understand how the future is shaped, and then discuss how different kinds of AI systems might shape that future.
Concretely, I can claim 'value is fragile' and then say 'for example, if we deployed a utility-maximizer in our society but we forgot to have it optimize for variety, people might loop a single desirable experience forever.' But on its own, the value fragility claim doesn't center on AI.
Katja (correctly) implies that concluding that AI alignment is difficult requires extra arguments beyond value fragility:
As I see it, Katja and the commenters mostly discuss their conclusions about how AI+humanity might steer the future, how hard it will be to achieve the requisite entanglement with human values, instead of debating the truth value of the 'value fragility' claim which Eliezer made. Katja and the commenters discuss points which are relevant to AI alignment, but which are distinct from the value fragility claim. No one remarks that this claim has truth value independent of how we go about AI alignment, or how hard it is for AI to further our values.
Value fragility quantifies the robustness of outcome value to perturbation of the "motivations" of key actors within a system, given certain dynamics. This may become clearer as we examine the comments. This insight allows us to decompose debates about "value fragility" into e.g.
In other words: given fixed dynamics, to what classes of perturbations is outcome value fragile?
In other words: given a fixed set of perturbations, what kinds of dynamics are unusually robust against these perturbations?
I think this is much more enlightening than debating
The Comments
If no such decomposition takes place, I think debate is just too hard and opaque and messy, and I think some of this messiness spilled over into the comments. Locally, each comment is well thought-out, but it seems (to me) that cruxes were largely left untackled.
To concretely point out something I consider somewhat confused, johnwentsworth authored the top-rated comment:
This is a good point. But what exactly happens between "we write down something too distant from the 'truth'" and the result? The AI happens. But this part, the dynamics, it's kept invisible.
So if you think that there will be fast takeoff via utility maximizers (a la AIXI), you might say "yes, value is fragile", but if I think it'll be more like slow CAIS with semi-aligned incentives making sure nothing goes too wrong, I reply "value isn't fragile." Even if we agree on a distance metric! This is how people talk past each other.
Crucially, you have to realize that your mind can hold separate the value fragility considerations, the considerations as to how vulnerable the outcomes are to the aforementioned perturbations, you have to know you can hold these separate from your parameter values for e.g. AI timelines.
Many other comments seem off-the-mark in a similar way. That said, I think that Steve Byrnes left an underrated comment:
The reason that corrigibility-like properties are so nice is that they let us continue to steer the future through the AI itself; its power becomes ours, and so we remain the "goal system with detailed reliable inheritance from human morals and metamorals" shaping the future.
Conclusion
I'm glad Katja said "Hey, I'm not convinced by this key argument", but I don't think it makes sense to include But exactly how complex and fragile? in the review.
Thanks to Rohin Shah for feedback on this review. Further discussion is available in the children of this comment.