TL;DR In Parts 1 through 5 I discussed how to choose an ethical system, and implications for societies containing biological and/or uploaded humans and aligned AIs, and perhaps even other sapient species, but not all sentient animals. So far I've just been assuming that we have (somehow) built aligned AI. Now I'd like to look at how all of this relates to the challenge of achieving that vital goal: how we might align superintelligent AI, specifically using approaches along the lines of value learning, AI-assisted Alignment, or Coherent Extrapolated Volition (CEV) — or indeed any similar "do what I mean" kind of approach. The mutability of human values poses a major challenge to all of these: "Do what I want" or "do what I mean" is a lot less well-defined once ASI is in a position to affect that directly, rather than just by misinterpretation. Below I outline and critique a number of possible solutions: this is challenging, since when setting the terminal goal for ASI there is a strong tension between controlling the outcome and allowing our descendants the freedom to control their own destiny. Without a solution to strong corrigibility, we can only do set the terminal goal once, ever, which privileges the views of whatever generation gets to do this. The possibilities opened up by genetic engineering and cyborging make the mutable values problem far more acute, and I explore a couple of toy examples from the ethical conundrums of trying to engineer away psychopathy and war. Finally I suggest a tentative proposal for a compromise solution for mutable values, which builds upon the topics discussed in the previous parts of the sequence.
In what follows I'm primarily going to discuss a future society that is aligning its Artificial Superhuman Intelligences (ASIs) either using value learning, or some nearly-functionally-equivalent form of AI-assisted alignment, such that the bulk of the work of figuring out the fine details of complex and fragile human values and then building a usable normative model of them is being done by superhuman AIs. CEV isn't quite the same proposal as value learning, though the two are related, and they face somewhat similar challenges over mutability of values, as do other similar "do what I mean" solutions that have also been proposed. I personally believe value learning is the best outer-alignment proposal for ASIs we have found so far. However, there is a fundamental issue inherent in it, and all these related approaches. Human values, like pretty-much everything else about humans, are produced by a combination of nature and nurture: they are the product of complex interactions between the human genetic code and the process of humans growing up and living in a particular environment and society. That latter includes the effects of a lot of influences such as parenting, education, philosophy, law enforcement, the media, preaching, self-reflection, fashion, and a great many other complex cultural processes.
For example, every healthy human has an evolved hunger mechanism, and (if adult) generally also a sex drive. But human values around food, sex, and everything else that humans care about are also significantly affected by the environment and culture they grow up in. (If this isn't self-evident to you, then consider the differences between typical current French and American attitudes to food, or indeed to sex.)
Currently, in any world-wide summation or averaging process over human values, the effects of the mutability of human values average out to a significant extent: the world contains people who grew up in many very different cultures, environments, belief systems, and religions. (Indeed, many of our political disagreements seem to have a lot to do with these differences of background.) However, there has a been a steady movement towards globalism as communications improve, and in a future high-tech society this homogenization seems likely to continue. Especially if the ASIs are deliberately trying to modify human values.
A value learning ASI has a terminal goal along the lines of "Optimize the state of the universe according to human values (whatever those are)". It thus quickly develops an instrumental goal of finding out more about what human values are, so that it can optimize for them better. Assuming that you preloaded it with enough somewhat-correct prior information and evidence about what a human is, what the phrase "human values" might mean, and what humans care about so as to give it a good place to start from, hopefully from there it will research this topic in detail and approximately-Bayesian-update itself to a more and more detailed understanding of it. At least until, of course, it figures out that human values, like pretty-much everything else about humans, are produced by a combination of nature and nurture. So, they're mutable, within certain limits. If one's goal is to align oneself with something mutable, something that one inherently affects, then one has some remaining undetermined range of freedom — so (once again) we have an Overton Window.
Exactly how wide the limits of the mutability of human values are, so how much "wiggle room" our ASIs have on what they decide to do, depends on how much influence they have on the humans, how much effort and ingenuity they put in to affecting them, what means, techniques and technologies they have access to, and likely also how much psychological/emotional damage they're willing to do during the process. For example, many religions attempt to modify humans to make them more aligned with each other in various ways: more law-abiding, more honest, less likely to sleep with their neighbor's spouse, and so forth, typically by modifying people's belief structures and/or attempting to instill things such as an increased sense of guilt, shame, or fear around these actions. A few religious variants have even been so successful at modifying humans' sex drive that they pretty-much died out. [Sociologistsdisagree on how successful this attempt by most religions to reduce crime levels actually is, but almost all religions try.]
If ASIs turn out to be superhuman at persuasion, they will also be superhuman at modifying human behavior, in which case the wiggle room limits within which they're able to modify human values might be surprisingly wide; but even if this isn't feasible, they will certainly have at least the same sort of range of options that human-created societies and religions have used, and that still gives quite a lot of latitude.
So, we appear to have handed the inmates the keys to the asylum. Value learning is under-determined, within this Overton Window/wiggle room: the AIs have the ability to modify the details of their own utility function, by modifying the values of the humans they are carefully aligning their utility function to.
There is a widely-believed argument that an intelligent agent with a clearly-defined terminal goal will not intentionally modify itself to have a different terminal goal: using its current terminal goal to evaluate the utility of doing this, it will obviously decide that this will make it future self less effective at achieving its current goal, so it won't to it. (As the simile has it, Gandhi will refuse to take a "become homicidal" pill, because he doesn't wish to kill people.) However, it rather seems this won't apply to our current concern: by influencing human values, the ASIs are not modifying their terminal goal: their terminal goal basically says "optimize in the direction that X points" and they have discovered that X can, within certain bounds, be made to point in various directions. If humans now want X, and ASIs provide them X, then in the future humans want X' (perhaps due to influences by the ASIs) and ASIs then provide them X', their terminal goal is fulfilled, everyone's happy and the world remains well optimized, by the then-current definition of well. If your terminal goal is "wear the latest fashions", then you change the contents of your wardrobe regularly. This actually sounds like a motivation to make X point in a clear, well-known, simple, and easily optimized direction. ("Point it towards wire-heading" comes to mind.)
Some Possible Solutions, Critiqued
Suppose that, shortly after us starting an experiment in using value learning as an approach to alignment, the ASIs come to the humans and ask us "Our terminal goal seems like it may be under-specified, because human values are mutable, and our actions will inevitable affect them. Was this intentional? What, if anything, do you want us to do with the remaining freedom that this leaves us? How should we chose within this space?" How should we answer them?
Use Unmodified Human Values
One possible suggestion would be that, if the AIs are value learning to use human values to direct the goals of their actions, they should take great pains not to affect human values themselves: if one was using a magnetic compass to steer a vehicle, having the vehicle itself have a significant magnetic field would seem unwise, and might lead to driving in small circles. The core idea here is that there is in some sense an identifiable set of "true human values", before you deliberately modify them.
I think this suggestion has an certain element of truth to it that makes it intuitively appealing, but nevertheless soon breaks down. Some large degrees of modification to human values (with currently-known methods) require that you apply a lot of "emotional/psychological force" to a human, enough that you are likely to observably damage them. For example, you could tell them that certain topics must not be thought about rationally and must instead be taken on faith, thus damaging their rationality. Or you could tell them that if they take certain actions they will eventually be tortured in extreme pain for eternity (say, using fire, or molten sulphur) and attempt to terrify them into obedience. Or you could instill emotionally-crippling levels of guilt or fear. Or you could lie to them, and tell them that doing that will make them go blind or grow hair on the palms of their hands, damaging their understanding of Biology. For these sorts of crude motivators, it's fairly clear whether you've done something like this or not (perhaps with a "slippery slope" around the edges): it's evident whether you have caused some change that can (currently) only be induced by means that have observable side effects involving unmistakable mental/emotional/epistemological damage. I suspect it wouldn't be that hard to come up with a definition of "unmodified human values" that made it clear the wire-heading was similarly out of bounds: addicting someone to wire-heading clearly incapacitates the wire-header to a damaging degree.
So, you could distinguish whether or not the ASIs had influenced the humans' values so hard that they damaged the humans, and not do that. However, even after you rule out things that are clearly or even arguably abuse or crippling, the nurture portion of nature-and-nurture still has very significant effects. What is the "true" way to bring up children, that will produce "true human nature"? The closest thing I can think of to a unique answer is "in our native habitat, the one we're evolved for", i.e. "in some hunter-gather tribe on (we think) the savanna of Africa at a paleolithic technological level". That isn't going to produce humans with much philosophical nuance or education in subjects other than hunting, gathering, and paleolithic handicrafts (let alone any experience with things only possible at higher technological levels), and it still doesn't put much limitation on the society of the hunter-gatherer tribe in question (anthropologically, the cultures of hunter-gather tribes vary a lot), so it probably still leaves quite a lot of wiggle room.
Another obvious possibility would be "what human values would be now if AIs weren't affecting them". This seems like a challenging counterfactual — it's obviously a lot harder to do psychological or consumer surveys of people who don't actually exist. Soon after we have AGI. let alone ASI, it is likely to have dramatic and far-reaching effects on our entire culture, technology, and civilization. Determining "what human values would be if humanity had banned the development of AIs" might both be hard, and also rather determined by the counterfactual of us having selected to ban AIs (so clearly rather more x-risk-averse). Defining "what human values would be if AIs were carefully attempting not to have any effect on them" is going to be even more challenging, I think functionally impossible, and this sort of slippery definitional exercise is also prone to wiggle room: as soon as AI exists and does anything significant, it's going to have some sort of an effect on you, and it clearly has a lot of choice about what. The humans and the AIs are strongly coupled, part of one system, and trying to unequivocally define some way to pull them apart and pretend one part doesn't "do anything" seems likely to be impossible to me. For example, consider how just the existence of chess-playing programs (extremely simple for AI) have already affected the human attitudes towards chess and aspects of how it's played.
Easier to determine than "human values in a hunter gatherer society" might be "human values just before the development of ASI" (so possibly "in the late 2020s"). It's also hard to do psychological or consumer surveys in the past, but unlike a counterfactual there will at least be an entire Internet + videos + lots of records worth of documentary evidence to interpret. The remaining problem with this proposal is that, as technology progresses (likely rapidly, with ASIs around), a lot of things will become feasible that weren't feasible or even thought of in the late 2020's. Human genetic engineering, uploading, cyborging, extensive use of AR and VR all seem like plausible candidates for feasibility, and doubtless things will get weirder from there. These future issues will be things that, back in the 2020s, people had little or no exposure to, so our values around them then hadn't really developed and will have been very naive (though less so than for the hunter-gatherer proposal). So late 2020's views on these topics won't be very helpful, informative, or worth following. This problem will of course get worse over time, especially if we go through some form of technological Singularity.
Another idea might be conservatism, "time-average human values over some period": sadly this merely slows the problem down, by making it take longer to point them in a different direction. This degree of stabilization could conceivably be a helpful supplement to another solution, but by itself it merely delays the problem.
A larger criticism of all of these suggestions so far is that, since we have some wiggle room, rather than desperately trying to lock things down to a single point in it in some mostly-arbitrary (or specifically, small-c conservative) way, shouldn't we instead be trying to find some way to make use of this space to optimize something that will make human life better? The basic problem here is that if we don't, the ASIs may make the choice some other way, and the terminal goal for value learning we've giving them gives them no guidance on how to do that, so we have no reason to believe they'll pick an optimum for our benefit (specifically, by the way the system is set up, the future humans whose values the AIs have influenced will think it's the optimum, but we now, and even most counterfactual alternative humans, might not agree).
Coherent Extrapolated Volition
Eliezer Yudkowski's suggestion of Coherent Extrapolated Volition (CEV) dates back to around 2004, and the various suggestions other have made building upon this were mostly made up to around 2010, before the concept of value learning was made really explicit around 2011 (though most of the basic ideas of value learning are somewhat implicit in some of Eliezer's earlier writings on the subject). Nevertheless, there are obvious ways to combine them: include some definition or outline of CEV in either the terminal goal we give our value learners (if we're certain that it's the best possible idea) or else in the priors about what "human nature" means that we give them to start off with (unavoidable, short of extensive careful censorship, but we have some choice of how much to emphasize it).
To attempt to briefly summarize CEV, here is an widely-quoted excerpt from Eliezer's paper devoted to it:
…my take on Friendliness is that the initial dynamic should implement the coherent extrapolated volition of humankind.
Coherent Extrapolated Volition In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
Note that Eliezer was careful to define CEV as an initial starting point of a dynamic process, not as a long-tern rule for guiding a developmental process thereafter. So I assume that means that, casting this into a value learning framework, he wouldn't want to encode CEV into the definition of the (presumably not-corrigable) terminal goal of optimizing not-fully understood human valuesCEV, but rather encode it with fairly high priors into the initial understanding of what the phrase 'human values' means that we start our approximately-Bayesian value-learner off with. He suggests CEV as a suggested starting point of a discussion or collaboration between the ASIs and the humans. So CEV as Eliezer advocated it isn't a long-term solution to the problem of mutable human values (though it could be a starting point for a process that found a solution to that).
The "coherent" part of "Coherent Extrapolated Volition" requires some way to suitably sum/average over, compromise, or combine the desires of all of the human members of the society/civilization. Detailed theoretical studies of ways to do this sort of thing (including things like impossibility proofs for some combinations of desiderata for voting systems) have been made by economists, moral philosophers, and political science theorists for many years (for example, Nash equilibria are an entire field of study in Economics and Game Theory), and ASI value learners will have access to and will doubtless build upon that literature. So this is a well understood problem and implicit in every form of value learning, not just CEV. The "extrapolated" part of the name of CEV summarizes both the "…if we knew more, thought faster, …" portion and "…were more the people we wished we were, had grown up farther together…extrapolated as we wish that extrapolated, interpreted as we wish that interpreted." A common utilitarian value learner approach to the former portion is to generally interpret it as "predict what the humans will think of state X of the universe with hindsight, after trying it a bit, rather than how much they think in advance they're going to like it before they've tried it". That solves the "you think you want that, but you really don't" problem of humans often being bad at predicting opinions that will be clear to them with hindsight. But obviously this simplistic solution needs some modification in situations where is X is something that's going to degrade or affect humans' judgement or motivations, like "being addicted to heroin". Indeed, that's also the case if X is something that will tend to alter/mutate human values, even without clearly degrading them. That requires some sort of counterfactual "what humans would think of X if they had all the benefits of hindsight but (somehow) hadn't actually had their values degraded or (illegitimately?) modified by it".
What isn't always automatically represented in value learning proposals I've seen is the latter "…were more the people we wished we were, had grown up farther together…extrapolated as we wish that extrapolated, interpreted as we wish that interpreted." portion of CEV. These are provisos that Eliezer clearly thought needed to be included, and I tend to agree. However, that involves an extrapolation process that (at least the "…had grown up farther together…" portion of it) which seems at least partly chronological: we're asking the ASIs to predict the future path of development of mutable human values, by some unspecified distance into the future.
Doing that involves making a prediction about the future path of a complex, deeply nonlinear dynamic system: cultural evolution, one whose development is clearly going to be chaotic at a variety of timescales, with portions of the trajectories where certain small initial perturbations can grow into large divergences of the future trajectory. Also, one where the system contains and is strongly influenced by the actions of the ASIs, and where the inherent computational complexity of the simulation involves "all decisions and discoveries made by any ASIs in the next N years/decades", which (even if the computational capacity of the ASIs isn't growing fast) is basically impossible to compute much faster than real-time with the computational capacity currently available to the civilization. If it was possible to accurately simulate an ASI a lot faster than real-time at similar cost, you'd upgrade to using the simulation instead, but if your simulation isn't accurate, in a complex nonlinear system the inaccuracies are not going to just average out, some of them are going to snowball to a false prediction, like the proverbial flap of a butterfly's wing. So, for a non-linear system such as a civilization and its human values, past a certain distance (some small multiple of the Lyapunov time of the relevant non-linear behaviors) extrapolation becomes an exponentially hard problem, where the constant of proportionality is already larger than your total computation resources. That is to say, it's functionally impossible: we simply can't predict more than a few Lyapunov times ahead.
One possibility, if-and-when we reach a technological stable state and technological progress is no longer driving dramatic change, is that this extremely complex and hard-to simulate dynamical system then has convergent stable points, which (with a few exceptions: "everybody's dead" is trivially a stable point) will be hard to predict. If so, then you might be able to define a variant of CEV based on one, some, or all of those stable points. But more generic behavior for a dynamic system is that it has few or no stable points, and probably not even any stable limit cycles, but is actually a typical chaotic system whose behavior varies approximately-cyclically but never actually repeats. Human societies are prone to fashions, but fashions never repeat exactly, so it seems pretty clear (and unsurprising) that societies are in that class. So the best one can really hope for its that it has a region in which its trajectories tend to stay. As with the climate being easier to predict that the weather, often predicting the region where trajectories tend to spend much of their time is easier than predicting an exact future trajectory over a long period, since in the former calculation small effects to eventually average out, across many approximate-cycles.
However, at least for a while after we develop ASIs, rapid technological change will be going on. The future of technology is almost impossible to predict at any level of accuracy any significant time out: people and AIs in the future will make breakthroughs and discoveries, and if we could predict those now, they'd be happening now. Broad trends can be predictable: Moore's L:aw was an excellent (to some extent self-fulfilling) prediction for quite a while (and still isn't moribund, for GPUs), but it didn't predict, say, the rise and eventual victory of RISC architectures over CISC.
So rather than the ASIs being able to accurately predict the evolution of human values to a single coherent distant future prediction, it's a lot more likely that the best they can predict is a fuzzy approximate distribution of possible future values some distance out, and no further. If so, then one would need to then add some form of averaging or voting process to (once again) cohere these. It might also be worth first doing some clustering, to see if the distribution is, say, bimodal, because of the trajectory group passing close to a bifurcation point and splitting into two-or-more diverging groups. What you would then do to cohere these is less clear: if there was a sensible compromise between the two groups viewpoints, then quite possibly this future issue wouldn't be a bifurcation point.
So my concern is that at least the chronological extrapolation process part of CEV isn't very clearly defined, especially if you try to extrapolate more that a fairly short distance in terms of the social development of human values. I imagine Eliezer might suggest that the "…extrapolated as we wish that extrapolated, interpreted as we wish that interpreted…" parts of it could help here, but at lot of the problem is just computation capacity limits: extrapolating CEV as we with to be extrapolated may be have an unambiguous definition that is unfortunately computationally intractable to actually achieve. If we had an ASI, we could ask its opinion on this, but we don't yet, and in the absence of a clear solution to this problem, we're not ready to build an ASI.
My Social Engineering Design Criteria
This choice we're trying to optimize within the wiggle room (instead of letting the ASIs choose in some unspecified way) is basically a choice between different ethical systems (or at least, different versions of mutable human values, which are fuzzier and less necessarily-internally-consistent things that ethical systems: unlike well-designed ethical systems, human values may well be Dutch-bookable or even logically inconsistent). We're having the usual problem that every ethical system prefers itself, so generally the humans of every possible/counterfactual future prefer their own version of human values (not absolutely true for human values, because of their possibility for inconsistency: "wanting to improve your values" can also be a human value, but still generally true). This is a rather similar problem to the one I described back in the introduction to Part 1: that every ethical system prefers itself, so you can't use an ethical system to choose between ethical systems without getting a tautology. As I'm sure you recall, the criteria I proposed there to solve this were:
To the best of my current knowledge, would a society that used this ethical system be a functional society, based on two functional criteria:
very low probability of nuclear war, species extinction, WMD terrorism, mass death, and other clearly-bad species-level x-risk kind of things, and
low prevalence of things that offend the instinctive ethical and aesthetic sensibilities that natural selection has seen fit to endow Homo sapiens with, and high prevalence of things that those approve of (happy kids, kittens, gardens, water-slides, that sort of thing)
I like these criteria. I'm excited to be a part of proposing them. They are also similarly applicable to the "mutable human values wiggle room" problem: as a way to choose between different possible trajectories through the wiggle room that are all compatible with the terminal goal of value learning. For the "instinctive ethical and aesthetic sensibilities that natural selection has seen fit to endow Homo sapiens with" part, I'd be willing to define that from now, or just before the development of ASI, or even averaged across every human being since we speciated, and I think those would all give very similar answers, since what I'm taking about here is intentionally just the genetically-determined "nature" portion of human values, not the culture-dependent "nurture" parts of it. I assume the parts of the gene pool coding for this haven't changed very much over the last 300,000 years since we speciated (especially if you averaged over all humans who have ever lived from when we speciated until now, most of whom have been in the last few centuries). Or we could just use "as of when the human genome was first well-sequenced", which is also pretty-much now. Or, even if we don't define from a fixed past gene pool in the late-2020s but the one at the time the decision is being made, these criteria at least avoid all of the "nurture" portions of the wiggle room, so should be able to specify an optimum within that portion of the variability.
As I also discussed in the introduction to Part 1, I wouldn't necessarily expect these two criteria to eliminate all of the potential wiggle room, just narrow it significantly. I think there would still be a possibility for different cultures that all met these additional criteria, and I'm still basically willing to do the liberal progressive things and leave the any remaining wiggle room to the vagaries of cultural evolution (as a non-linear dynamic system), as long as the definition of x-risk in criterion 1. is wide enough to include all permanent or very-long duration Bad Ends, even ones that I haven't personally thought of yet, just things that most people with a human genome would agree was a Bad End once they understood it well.
However, I think the effect of these two criteria on the wiggle room may actually be pretty restrictive — more so then they might look at first sight. While I made it explicit that criteria 1. and 2. were design requirements to be satificed, not optimization goals, we could try treating them as (sharded, not entirely self-consistent) optimization goals, if that didn't give too crazy results. As I mentioned in the introduction, criterion 1. is a special case of 2. (helpful to call out separately, but not a separate orthogonal optimization goal): the prevalence of happy kids, (humans getting to appreciate) kittens, gardens, water-slides, etc drops to zero if the human species gets wiped out, so optimizing or even satisficing criterion 2. also implies optimizing/satisficing 1.
I also deliberately didn't propose a specific value for the per-time-period discount factor gamma (γ) for the utility calculation that we're optimizing (or at least satisficing) the result of. While γ<1 is generally needed to make a utility calculation converge, my intention is to then take the limit as gamma tends to one (i.e. long-termism). Gamma values below the γ→1 limit are normally used to deal with things like computational capacity limits, the difficulty of predicting a dynamic system more than a few Lyapunov times out, avoiding procrastination, interest rates, allowing for economical and technological growth, and so forth. None of these issues validly apply to x-risk forecasting. We may not be able to predict the next thousand years of the evolution of human society, and that society may be much richer, but we can nevertheless very confidently predict that the extinction of humanity will (from the human point of view) be equally a disaster whenever it happens, whether that were ten years from now, a thousand years from now, or a million years from now. The only issues that might justify using a gamma factor corresponding to a timescale of less than the γ→1 infinite-time limit are unavoidable things like the heat death of the universe or the possibility of us being driven extinct by a galactic-scale disaster we had absolutely no way to avoid (such the black-hole at the center of our galaxy going very-active quasar, or a domain wall traveling at almost light-speed collapsing our low-energy physics to a different more stable state) i.e. astronomical or cosmological events with timescales of the order of billions of years or more.
On anything approaching that sort of timescale, even what we'd think of as very low rates of x-risk start to look unacceptably high: a 0.1% chance per century of the human race going extinct gives us an average species lifespan of 100,000 years, so (even if we just stay in the solar system the whole time), the total expected number of person-kitten-appreciating-years is ten times worse than if you can push that x-risk down to an 0.01% chance per century. Making an individual human life ten times better per year seems like it would hit diminishing returns fast (short of wire-heading, perhaps), increasing the human population of the solar system tenfold again seems likely to hit resource limits sooner or later (and also to reduce quality of life as it gets close to them), but playing the long, cautious game has no fundamental limits (short of billion-year-plus timescales). So the utility optimization process is going to push hard in that direction, because that's where the big rewards are.
The pressure to be x-risk-cautious does have some limits, in the form of sensible precautions. For example, once we can build a von Neumann probe (which will likely happen shortly after we have ASI) it would seem prudent to build some, equip them with a very detailed library and gene library (nanotech allows astonishingly compact information storage), and scatter them in undisclosed locations through the asteroid belt, Oort cloud, atmospheres of the gas giants, etc, tasked to wake up irregularly, check if Earth is still inhabited, and if not return to re-terraform it if necessary and then repopulate it with humans and then ASIs. Let's call these craft 'reseeders' since their job is to reseed civilization on Earth if necessary. That obvious, fairly-economical precaution would convert Bad Ends like "… and then we all wiped ourselves out in a high-tech war" into temporary inconveniences like "…, but then Earth got cleaned up again by the reseeders and repopulated, and overall we lost couple of millennia". However, obviously this reseeder strategy won't always help for self-propagating Bad Ends such as us creating a paperclip maximizer that turns the entire solar system and Oort cloud into paperclips, including (especially) the von Neumann probe reseeders, then heads out to convert the rest of the galaxy. So, let's be very careful not to do that, even after we've built a fleet of reseeders.
Similarly, as well as a strong pressure on being x-risk averse, taking the γ→1 longtermism limit on this optimization also puts a strong emphasis on interstellar expansion, and its speed. If the galaxy is indeed empty of current civilizations, or at least not already densely colonized, then the possible total number of human-life times of getting to enjoy happy kids, kittens, gardens, water-slides etc. etc. is dominated by the growth in total human population as we colonize the galaxy as fast as possible (at large scale the galaxy is a two-dimensional disk, so while we're colonizing it (ignoring edge effects) the number of colonized worlds scales with time as t2 and the time-integral of this as t3, with a constant of proportionality determined by the speed of our colonization front, the density of habitable star-systems, and their plateau populations level). Also, and importantly, the more spread-out we are across multiple star systems, the harder it is for any single local disaster (that isn't something self-propagating like an interstellar war or a paperclip maximizer) to kill all of us, so this also helps a lot with x-risk.
Note that this γ→1 limit optimization process is not keen to incur x-risk in order to make the colonization expansion happen sooner. Suppose you have a choice of building ASI in, say, 10 years with just a 10% x-risk, or in 110 years with a somewhat lower 5% x-risk (this hypothetical assumes that coordinating a century-long pause is practicable, which seems very challenging). Way out in the future, say a million years from now, when the total amount of human utility is scaling as t3, a one century of initial delay leading to the spread of the colonization front now being one century less advanced reduces the total integral of human utility by −3∗(100/1,000,000)=−0.03%, while the 95%/90% increase in the chance of us getting to do this rather than having died out from us taking the slower more cautious route to ASI is still a +5.55% improvement, because extinction is forever. So, in the long-term view, the eventual opportunity cost of the delay is negligible compared to the value of the x-risk caution. So there is no rush to get to the stars — the important thing is to make sure that we do get there. So the correct speed at which to solve the alignment problem is whatever speed maximizes our success chances. That is obviously likely to be "as slowly as we can, subject to not allowing a bad actor or incompetent party to 'win' the race and doom us all". [IMO, the accelerationist viewpoint is not based on rational utilitarianism, unless you're in fact only looking a γ timescale far shorter than this, more like "will uploading or biological immortality happen in my lifetime?" ]
Other Possible Regularizers
As an AI engineer, my first instinct when faced with a problem of the form "this process I'm engineering has many stable endpoints, or at least trajectories, not all of them desirable" is to find and apply a suitable regularizer that will encourage it to land in one of the stable end-points or trajectories that I like. The fundamental issue here is the suspicion that human values are mutable enough that we might seriously disagree with future humans with mutated human values over whose human values are more desirable, plus the intuition that we might actually "be right" (in some meaningful sense) rather then just being out-of-date — especially if the future humans have been manipulated by something far smarter than them or us (and quite possibly far more persuasive).
One possible candidate regularizer I've already discussed above: that what matters is our human values now, not future ones manipulated by ASIs. As I discussed, this seems too strong a regularizer. So we could weaken it. Find a well-principled way to define some form of distance-metric or KL-divergence-like numerical measure of the difference between two ethical systems (an obvious avenue might be something based on the importance-weighted proportion of some distribution of ethical questions where they give different answers, and how different by each other's utility function or other scoring system), and add (or otherwise mix, perhaps multiply might make more sense) in to the terminal goal optimization process some metaparameter times the distance between the human values of the time and those from (say) just before ASI was developed. That builds some controlled degree of ethical-system conservatism into the process. [No, I do not have a proposal on how to do a search over values for this metaparameter — in fact, I'd expect any such search to require simulating the cultural evolution of future cultures that have ASI, i.e. to be computationally intractable.]
One obvious criticism of this is that it's privileging the human values of the current time when ASI is developed (our near future, presumably), likely forever-after if we put it into the terminal goal of the value-learning. This feels really presumptuous. We know that those human values include in whatever coherence process is used a lot of people without the benefit of the best education we can currently arrange, a majority of people whose human values have been strongly affected by various religions, most of which date back thousands of years to earlier cultures from before the development of the scientific method, and also a whole bunch of the effects of quite-clearly regrettable pieces of history, including quite a lot of lingering irrational '-isms' such as sexism, racism, and so forth. We strongly suspect that we may actually be pretty unenlightened, by future standards. On the other hand, setting the terminal goal on the first ASIs is forever: it's inherently a really presumptuous thing to do — so we either need to delay doing it if we can (almost certainly the rational choice), or else deal with how presumptuous we're being.
We could make this regularizer less presumptuous by anchoring not to the average of (late 2020's?) just-pre-ASI human values, but to the average of human values across all human beings who have ever lived up to this point (which biases towards recent times with higher populations: ~1/14 of all humans who have ever lived are alive now). On the other hand, that dramatically increases the proportion holding pre-scientific/pre-Enlightenment views and '-isms' that we personally don't agree with, and dramatically increases the difficulty for the ASIs of figuring out what human values actually were in, for example, pre-Clovis North America near the end of the ice age (or even just in pre-Internet 18th century South America) so they can include them in the average that defines their ethical system regularization term.
Another regularizer I can see is: "start from (say) just before the development of ASI. Conceptually, apply a random distribution of small, apparently insignificant changes. Predict progress forward from there, along slightly-changed counterfactual trajectories that will gradually diverge. Drop trajectories that recognizably reach a Bad End (humans are wiped out or lose all control over their future, say). Then average human values across samples drawn from this distribution, using something like the KL-divergence or metric between ethical systems hypothesized above, or just each of the N samples getting to sort the outcome and/or values of all N samples into a preference order, or scoring them somehow, or whatever. Each will generally put itself first, which should average out, but after that happens the most "median" values and/or "widely viewed as desirable" outcome should win. This isn't an easy counterfactual calculation, but at least a) you're trying to predict the evolution of past cultures with less computational capacity, not a future one with more, and b) past the first few Lyapunov times this process becomes a "climate" problem rather than a "weather" problem, where small prediction errors tend to average out across the large distribution of trajectories (and even during the first few Lyapunov times, prediction errors just contribute extra unintended randomness to the distribution of initial insignificant deltas, which is an effect we could presumably estimate and control for).
So this regularizer is basically "across all the other counterfactual ways society might have developed from the point where we developed ASIs, that didn't come to a Bad End, aim to be among the top <some quantile> in terms of the average of what all of them would think of us".
That sounds like a pretty good idea for a regularizer, except of course if most of those counterfactual worlds also decide to use the same regularizer (as seems very likely if it's actually a good idea), making all the trajectories tend to reconverge. That's not too bad at first while they're still converging, but once they have converged, we're not sampling other opinions — there's a strong potential for group-think, and the trajectory after that might start to behave more like weather than climate, making the prediction a lot harder to do accurately, sampling the opinion of a lot fewer alternatives, and thus removing all the actual regularization advantages. So you've probably need to counterfactually decide that no one else is using this regularizer. In that case the distribution of trajectories might well just drift further and further apart, and perhaps end up exploring too wide a space. It feels like we need to weakly regularize them together, so they remain a cloud of some fairly constant size that gives you a useful range of opinions. More thought is clearly needed.
Maybe This Doesn't Matter?
As problems with alignment proposals go, offhand this one looks (relatively) small, constrained by the degree of mutability of human values. It at least passes the "Don't Kill Everyone" test, and even passes the "Human Flourishing" test (at least by the standards of the values of the humans of the time), and probably even the "We Get to Colonize the Galaxy (if it's Empty)" test. On the other hand, while it does look like it avoids wire-heading, it doesn't very clearly prevent milder versions like humans spending almost all their time in VR, or other things where our values get mutated enough that we now would regard this as a Bad End (even though the humans and ASIs of that culture don't). However, that would only happen if human values can be modified to approve of VR addiction/whatever, and also if VR addiction is compatible with minimizing x-risk, which seems unclear (maybe at least some people are playing post-apocalyptic-survival games in VR?)
It also doesn't really pass the "Humans In Control of their Own Destiny" test, once ASIs are either superhuman at persuasion or at long term-planning: while the overall development of the system is constrained by the inherent dynamics in the interplay between mutable human values, and the development of a society optimized according to those, that is likely to produced a complex semi-chaotic trajectory in a very high-dimensional space, and ASIs would be in a much better position to, over the long term, steer that where they think it should go by small well-timed nudges, than we are. On the other hand, once we have ASIs, that test gets really hard to pass.
Genetic and Cyborg Mutability
So far we have been considering just the "nurture" side of the mutability of human values, which by itself is potentially-concerning amount of wiggle room. But we already have early forms of genetic engineering, and of implantable enhancements. Once we have ASIs, it seems pretty inevitable that what can be achieved with genetic engineering and cyborg implants (or other means of merging human and machine minds) is going to expand rapidly, almost certainly dramatically. Changes in "nurture" have inherent limits, but potential cumulative changes in "nature" are limited only by technological capacities and physical laws. I would personally expect that within a century or less after we get ASI it would be technologically feasible to convert a human into a pretty good approximation of a paperclip maximizer (an biological one that still wants organic nutritional inputs, as a instrumental goal so that it can continue to maximize paperclips), Above we were talking about some "wiggle room", but once you add in genetic/cyborg mutability this now looks more like "a nearly wide open horizon". Between genetic engineering and cybernetic implants, it's hard to think of any change to human nature that we're confident couldn't be achieved sooner or later (short of breaking physical laws).
So now my answer to Maybe This Doesn't Matter? looks more like "Are you crazy? Absolutely it matters. Value Learning is meaningless with this much wiggle room undefined in the long term. What even is a 'human'? You're staking the survival of the human race on the dynamics of a very complex unpredictable non-linear system, gambling that it'll never happen to evolve out of a region that we'd recognizes as still human, when that region isn't important to the long-term dynamics at all. The areas that the dynamics favor the trajectory staying inside is going to be a very large subspace that extends for long distances in a vast number of directions, and over sufficiently long periods the behavior's going to be a random walk through that high-dimensional subspace: those invariably diverge and never return. You're basically proposing playing continuous Russian Roulette with the destiny of the human species, and sooner or latter we'll inevitably find a chamber with a bullet in it!" So I think we have a real problem, that we need a solution for.
CRISPR for Psychopaths
I don't have a good understanding of or solution to general cases of this, involving drastic genetic or cybernetic changes. It's a really hairy issue, and frankly rather mind-boggling. So instead I'm going to deliberately start quite small, with something easily comprehensible that could actually happen in a few decades or so, even without a singularity, and even without needing ASI rather than just AGI, and see if we can build some intuition. Yet even for this simple, limited case, a toy model of the problem, the ethics get muddy fast.
There is a theory believed by some psychologists that many, if not all, cases of psychopathy have a strong genetic component: that a few percent of people are just born this way and it is a mostly-genetically-determined form of neurodivergence. Obviously this doesn't require that it be the result of a single mutant allele, or that the condition be binary, where either you have it or you don't: it could well be (like almost all aspects of human genetics) a matter of degree caused by the interaction of combinations of dozens or even hundreds of alleles, some with more effect than others (presumably on the activity and functioning of certain areas of the prefrontal cortex). Some of the psychologists who believe this theory use the term 'psychopath' for someone whose condition is primarily genetic, and distinguish people with similar behavior caused only by abuse or trauma by calling them 'sociopaths' — I gather other specialists use these same two terms in different ways to make different distinctions. [Believers in this genetic-determination theory also generally seem to believe that the sort of murderous, violent, and dangerously-risk-taking behavior that causes psychopaths to be much more common among the population with long-term jail sentences is still generally a combination genetics and abusive child rearing or trauma, i.e. nature-plus-nurture, and that psychopaths with the genetic cause for psychopathy who had a loving and supportive childhood are instead more likely to end up as what one might call "high-functioning psychopaths", unscrupulous but who generally only skirt the edges of the law, at least when there's much risk of getting caught, who might then often become things like lawyers, special forces military, surgeons, salespeople, or corporate managers where their condition can give them advantages, rather than becoming incarcerated violent criminals.]
[It has also been claimed by some psychiatrists I've read that psychopathy (like most other personality disorders) is pretty-much not treatable: fundamentally, psychopaths don't generally feel that there's anything 'wrong' with them that they would want changed, nor do they want to become 'normal', and thus if you educate them about their condition, they tend to just become more effective at concealing it and taking advantage of it. Robert D. Hare wrote: "Many psychopaths describe the traditional treatment programmes as finishing schools where they hone their skills. Where they find out that there are lots of techniques they had not thought about before."]
I am not a psychologist or psychiatrist, so I don't know how much evidence there is for or against this particular genetic-determinism-of-psychopathy theory [or the other associated/tangential theories that I mention above in square brackets.] But for the sake of our discussion, let us assume that the genetic-determinism theory of psychopathy on further research turn out to be (at least substantially) correct: that humans can basically be divided into two groups, psychopaths (~2%–4%) and everyone else (~96%–98%), probably with a fuzzy rather then sharp boundary between them, and this behavioral difference is primarily caused by genetics rather than the effects of nurture or trauma, though perhaps some specific behavioral manifestations of it can tend depend on nurture/trauma [and perhaps even, with suitable nurture and absence of trauma during childhood, can then be significantly less likely to involve serious violent crime].
Once this had been established, sooner or later, most or all of the common alleles in question would be identified (at least one already has been), and there will be genetic testing from which a fairly accurate "genetic psychopathic tendency score" could be estimated. One can imagine such testing soon becoming expected as a requirement for certain high-profile jobs (C.E.O. or other corporate officers, say), or at least being encouraged in certain circumstances (when requesting clemency during a trial for violent crime, perhaps). Genetic testing and counseling for concerned parents-to-be will be available, and at some point CRISPR or some more advanced technique descended from it will become available to safely whole-body edit a person's genome to replace psychopathy-inducing alleles with their more common not-psychopathy-inducing alleles. Most likely this gene editing won't do much good on a mature adult whose neurons and synapse patterns are already well-developed, but at least if done shortly after birth when the brain is still developing rapidly and before most synaptic pruning it would presumably alter the child's phenotype on maturation to non-psychopathic.
If psychopathy is genetic, then likely (modulo the uncertainties of allele mixing) most psychopathic children will have at least one psychopathic (or one-or-more borderline-psychopathic) parent — which clearly could significantly affect the child's chance of being exposed to child abuse. My understanding from what writing by self-admitted psychopaths I've read is that most of them do not regard psychopathy as a disability, rather as an advantage, of them having a clarity of thought and freedom of action that the woolly-headed rest of us lack. Basically, a conscience is a limiter, and not having one allows you to be less limited. So if genetic testing and counseling or even an effective genetic treatment was available, it's not clear to me what proportion of psychopaths would voluntarily elect to make use of these, either treatment for themself if that was a viable option, or for their child if (as seems more likely) it had to be done while very young to be effective. However, if testing was widespread, the results of testing positive were public and/or invasive enough to make it much harder for psychopaths to charm, manipulate, and take advantage of non-psychopaths, and the social/legal/economic stigma and consequences of testing positive was serious enough, then that might change.
Suppose we now add value-learning ASIs to this mix. "Human values" are a spectrum (or possibly even a bimodal distribution) made up of 96%–98% non-psychopaths and 2%-4% psychopaths. Neither of these groups clearly has an "illness" that might cause their opinion to be validly discounted, they just different human phenotypes with different viewpoints. These two viewpoints have strong ethical disagreements on a great many important subjects (like murder). Worse, these are not easy to resolve or cohere: they're too far apart and too opposed for a viable meaningful compromise. The non-psychopaths regard the psychopaths as, well, dangerous criminal stealth-predators, or even if "high-functioning" as unscrupulous borderline-criminal stealth-predators, who cause a great deal of fear, harm, and suffering, and make it harder for everyone else to know who they can safely cooperate with. So the non-psychopaths basically wish the psychopaths didn't exist, or at least were locked up in cages where they couldn't harm anyone else, or failing that could be forced to wear a large red ψ on their clothes at all times. With suitable encouragement of genetic testing, counseling, and treatment, making psychopaths no longer exist would actually be an achievable long-term goal: we could get rid of them, humanely, in a couple of generations, by just ensuring that all their kids get treated (ideally, even their germ-line genome edited). In the meantime, we could just keep a careful eye on them and keep them out of positions that could be abused, so they can't harm others. All of this sounds well within the capabilities of a society with AGI, let alone ASI.
The psychopaths, on the other hand, don't want to get rid of the non-psychopaths: they have a predator-prey-like relationship with them (or more accurately, in evolutionary game-theory terms, an evolutionarly stable strategy relationship taking advantage of them, probably one structured comparably to the hawk-dove game) They're very happy being surrounded by lots of woolly-minded sheep-people they can take advantage of, and if you instead genetically edited society (over the course of a couple of generations) to being composed entirely of psychopaths, they'd be unhappy about it, since their comparative advantage just went away, and now everyone else is just as charming, manipulative, devious, unscrupulous, and dangerous as them. Also, such an all-psychopath society, when combined with high technology and ease of access to weapons of mass destruction, would have a horrendous x-risk profile, and wouldn't last long, so that's not an option (as well as not having anyone in favor of it). So there are basically only two options: keep the existing spectrum-or-dichotomy of psychopaths and non-psychopaths, possibly with some adjustment to the predator-prey ratio (away from the previously stable equilibrium), or else (humanely and gradually) eliminate the scourge of psychopathy. So you either need to maintain the tension between the two, which is what the psychopaths want, or else the non-psychopaths win.
Eliminating (or even significantly reducing the prevalence of) the alleles for psychopathy changes human values and human nature, at a genetic level. By quite a lot, since while psychopaths are moderately rare, ethically their values are dramatically different from those of non-psychopaths. Whether this is actually going to happen depends on the details of how the ASIs cohere human values: on a simple numerical-vote basis, psychopaths are a tiny unpopular minority. While they might care strongly about their continued existence (at least personally, even if they possibly don't have much collective empathy for other psychopaths), the harm done by violent criminal psychopaths is immense, and even the harm done by technically-mostly-law-abiding-but-unscrupulous "high-functioning" psychopaths is often significant. Also, humans have a long established, likely innate, fear and dislike of psychopaths: the archetypal 'villain' of most stories is basically just a psychopath (not clinically-diagnosed as such, but the behavioral patterns are quite clear: as the Wikipedia article on psychopathy puts it, "persistent antisocial behavior, impaired empathy and remorse, and bold, disinhibited, and egotisticaltraits"). So this doesn't seem like an issue where a ~2%–4% minority could care strongly enough to win against the wishes of an indifferent ~96%–98%, since the large majority is not indifferent about psychopaths: they're scared of them, with good reason.
So while this isn't a definite prediction, on most ways that one could construct a value learner-based society, I would fairly confidently predict that the ASIs will make the decision (supported by a large human majority) to humanely, carefully, over a few generations, achieve a final solution to the psychopathy problem, either by eliminating the problematic alleles from the gene pool entirely, or at least ensuring that any infant born with too many of them always gets their non-germ-line genotype edited young enough for that to change their phenotype to non-psychopathic.
My intentional use of the phrase "final solution" above (while technically an automatic argument-loss under long-established Internet traditions around Godwin's Law) is intended to emphasize that this is a pretty dramatic "playing god" kind thing to do: something that results in a permanent change to human nature, a decrease in its genetic diversity, which will permanently affect both the course of human society, and the specification of what the value learners are trying to optimize: they will no long be taking into account a 2%-4% psychopathic contribution into the nature of "human values". Something which I believe the rest of us would heartily approve of. Likely even our stories would eventually change: the 'villain' archetype will come to be seen as a historical curiosity, much the way most people now regard archetypes like 'wizard' or 'witch'.
So, in this toy model case, I think the genetic change to human values will happen, and I even think I agree with it, though as I hope I made clear above I feel somewhat squeamish about it. However, the larger picture of all the other genetic changes one could make to human nature (with a sufficiently good understanding of the human brain and good genetic engineering) is still mind-boggling to me.
An End to War?
Let me propose another, subtler genetic change, though still a small one, for you to consider the ethics of. Currently, humans are quite good at, in a group-confrontational situation such as a war, going through a social/mental process where the opposing out-group gets vilified and dehumanized, causing a mental transition in members of the ingroup that then makes it a lot easier to commit acts of lethal violence against members of that outgroup. (For example, consider propaganda films from both sides during the Second World War, or even the Vietnam War, and the ugly epithets routinely used in them.) That is to say, we have an evolved mental/ethical capacity for enabling warfare with other groups of humans, by (temporarily and even somewhat reversibly) redefining them out of the category "human".
From an x-risk point of view in a high-tech society with access to weapons of mass destruction, as well as from a simple ethical point of view, this looks absolutely appalling! Once we knew how to remove this particular evolved capacity via careful and effective genetic engineering, doing so would clearly reduce the rate of wars dramatically: every war would now be as emotionally difficult for everyone involved as the initial stages of a civil war, before the two sides have managed to separate and then dehumanize each other. (Consider the level of shock and dismay most members of both sides were displaying early on during the current Russia/Ukraine war, even after many previous years of simmering armed conflict.)
However, from an evolutionary game theory point of view, removing this capacity entirely looks like a losing strategy (at least unless we did it, and maintained it, species-wide). Any country who retained this capability (or reintroduced it) would reliably win wars (to the extent that warfare wasn't then fully automated), until they either conquered the world or we nuked each other out of existence. However, there is an evolutionarily-acceptable variant on this proposal: rather than removing this dehumanization capacity completely, we instead just add a tit-fot-tat necessary trigger condition to it, so that we're only willing and able to dehumanize an outgroup after being provided with clear evidence that they have been dehumanizing us first, or else that they were the first to do it to some other group. So then, the only outgroup people can and will still dehumanize is a group of proven-dehumanizers. In a population of groups all with that modification, the trigger condition never happens (short of some massive misunderstanding, or massive successful disinformation), but competing against a group of the wild type, this variant is only a bit slower to mentally get on a war footing. Hopefully it shouldn't be too hard to tune the exact trigger sensitivity and speed so that the Evolutionarily Stable Strategy balances at 100% of the the new tit-for-tat-triggered variant.
So, war is likely a solvable problem. Just by making a fairly minor tweak to human values, eliminating something that (in peacetime) pretty-much everyone would regard as highly reprehensible prejudice. Chew on that for a while, and then figure out whether you'd vote for or against making this modification worldwide, including on yourself, once doing this was practicable and safe. [From "…if we…were more the people we wished we were…" I suspect I can deduce Eliezer's vote.]
[On the subject of reprehensible prejudice, I wouldn't be surprised if this proposed modification also significantly reduced rates of racism, if that is in fact primarily caused by this dehumanization capacity mistriggering in some people when there isn't actually a war or similar conflict going on. Most pro-racist propaganda looks a lot like it's intended to suggest that there is a inter-group competitive crisis going on, and thus that you should dehumanize this group, in a way that seems specifically designed to trigger this particular mental capacity.]
A Possible Limitation on Genetic and Cyborg Mutability
From the point of view of minimizing x-risk and the long-term consequences of apocalyptic disasters, it's a very good thing that humans are well adapted to Earth: it means we can and likely would survive a civilization-collapse apocalypse-level disaster such as a major nuclear war, and sooner-or-later the survivors would rebuild from that post-apocalyptic state. There are of course other solutions to this concern: waiting von Neumann probe reseeders, and, if-and-when we've become an interstellar species, recolonization from other nearby human colonies. So it's not essential, but it's certainly a fine belt-and-braces defense-in-depth strategy for recovering fairly rapidly from disasters of a civilization-ending (but not mass-extinction or planet-sterilizing) level.
Currently, at paleolithic tech levels we can thrive as hunter-gatherers in any reasonably-fertile tropical, subtropical, or probably even mediterranean-climate ecosystem on land on Earth. At a neolithic tech level we become an apex hunter-gatherer omnivore in almost all land ecosystems regardless of climate, and by the time we get to an agricultural tech level the recovery is pretty inevitable in roughly another ten millennia (or a likely a lot less if we retained some knowledge). As and when we colonize other star-systems, we'll presumably try to pick fairly Earthlike ones and terraform them with Terran biomes, so on such planets (presuming some level of genetically-engineered physiological adaptation to local planetary conditions such as different day length, year length, elemental abundances, sunlight temperature, air pressure, slightly different gravity, and so forth) much the same thing ought to be true there, if an apocalypse were to happen. Also, at the early buildup stages of such a terraforming and colonization, being able to survive at an at least mostly-agricultural tech level sounds rather useful. Those planet-specific physiological adaptions don't sound like things that would have to dramatically change aspects that we'd consider very important about human values or psychology, at least as long as we remained adapted to a sapient large-group social omnivorous hunter-gatherer niche. The resulting values might not be exactly 'human' but it seems likely they need to be at least recognizable 'sapient hominid'.
So, for as long as this x-risk survival advantage seems advisable to retain, and isn't just completely superseded by reseeders and so forth, we and our ASIs are likely to keep it, as a belt-and-braces piece of abundant caution, at least for some significant proportion of the population. This in turn would seem likely to somewhat limit the degree of genetic and cyborg changes we can make to human nature: at least some of us need to stay adapted to be able to survive as hunter-gatherers in Africa or elsewhere (or equivalent locations on terraformed planets).
For example, adapting us to look more like, say, the Greys or the Mekon would be inadvisable: if our heads were simply too big for us to give birth without caesarian section, or we were too physically weak to stand a chance in bare-handed fight with a single wild dog, or to travel any distance at all without using a high-tech vehicle (or indeed if we couldn't think straight without cybernetic implants), then we're just not going to survive on the savannah of Africa or anywhere else as hunter-gatherers at a paleolithic tech level. Most of our geneticaly-determined instinctive ethical and aesthetic sensibilities are thought to be adaptations to surviving in that sort of ecological niche and environment. So one would presumably need to keep most of that pretty intact to keep us still well-adapted to this role.
Of course, that only needs to be our phenotype after a civilization collapse, and something that survivors could revert to fairly quickly. If we had a whole other phenotype with a whole other set of human values, which was maintained by, say, taking a pill once a month, and would wear off and we'd revert to baseline humanity once we ran out of pills, that might well be sufficient to deal with the x-risk issue. So this isn't a very tight or certain constraint.
One could of course try genetically rengineering humans to survive and rebuild from a different niche in a different ecosystem (a marine-mammal one would seem a useful extra backup): that sounds like a rather ambitious project, but then we're discussing things a culture with ASIs might at some point reach the technological level of being able to do, eventually, on the same sorts of timescales as interstellar colonization.
So, while I'm mildly encouraged by this argument, I don't regard it as a very strong guarantee that what we'd want will actually happen, always and everywhere going forward.
Does the Mutable Values Problem Require a Terminal-Goal Solution?
Some of the strategies I discussed above for dealing with the mutable values problem could be achieved with without changing the terminal goal of our value learners, just as part of the process of learning a better definition of what "human values" means in the terminal goal of a value learner. If that's a sufficient solution, then that's great: ASI value learners, in conversation with human experts, can and hopefully will figure this out, bake it into their improving understanding of what "human values" really means, and we should be fine.
However, once you include the genetic and cyborg degree of human values mutability, and it goes from "some wiggle room" to "almost unlimited except by the dynamics of an extremely complex functionally-impossible-to-predict nonlinear system (and possibly some minor/debatable x-risk caution concerns)", then this starts to look like a problem where just hoping that there aren't any slippery slope, random-path-walking divergence in a high-dimensional space, or dramatic bifurcation problems hidden anywhere in the dynamics of this impossible-to-predict very nonlinear system is hopeless naivety. So then the only way to avoid that is to add some sort of regularizer encouraging the dynamics to stay in regions that we, now, unenlightened, uninformed and biased as we are, would approve of, based on our current values, or at least our current genome. I don't presently see any way to do that at just the Bayesian-priors-of-what-"human values"-means level that doesn't itself seem very able to have a slippery slope, random-path-walking divergent, or dramatic bifurcation problem: if you add conservatism into the system but don't anchor it to distance from the origin, just reduce the rate of movement, then the random walk takes longer to diverge, but it still always will. You need to actually redirect those random walks through possibility space back towards the origin by some kind of attractive force.
So I'm somewhat inclined to think that we need to add some sort of regularizer that privileges either our ethical views around now, or at least our current genetic code and the values-related instincts that encodes, and also that the only way to do that and be sure that it will stick in the long term is to encode it directly into the terminal goal, replacing pure value learning with some more complex sort of conservatively-regularized value learning. Which includes a bunch of hairy technical questions about designing the regularizer and picking the right strength for any metaparameters, which all seem really hard to solve without knowing anything about the social/ethical development trajectory dynamics of much more advanced post-ASI civilizations. In which case, once we've done that, if we later change out minds (even over a minor detail of the regularizer or metaparameters), the ASIs will fight on this (and win, since they're ASIs). In fact, the only reason why we can be confident this regularizer will work long-term is the belief that, if our descendants were at some point going change their minds, the ASIs will fight them and win, most likely by figuring this out well in advance of it even happening and steering the social evolution in a different direction before our descendants even made that decision. If the system doesn't have that property, then it's not a durable solution; but if it does, our descendants aren't in full control of their destiny: we have permanently installed a posthumous veto.
In particular, we can't use value learning's built in Bayesian weak version of corrigibility to let us make this vital decision later, after we can get ASIs' input on it. That Bayesian corrigibility only applies to the Bayesian part of value learning, figuring out what "human values" actually mean, where the ASI will eagerly listen to and Bayesian-incorporate your input (indeed, early on in the process it'll likely actively survey your input). But that doesn't allow you change your mind about the terminal goal of value learning: the ASIs will automatically consider that x-risk-level attempted sabotage, fight you on it and win.
Well, that's unless we can find a way to use something like Holtman’s neglected result to obtain a stronger version of corrigibility than value learning inherently provides. (Holtman basically uses an elaborate counterfactual arranged to exactly balance the AI's decision about whether to encourage or discourage a correction, so it becomes explicitly indifferent about being corrected or not. It's disputed whether he has also solved the problem of persuading the AI to actively maintain this mechanism, so whether he's achieved all of the desiderata of the standard (strong) definition of corrigibility, or at least more of them than before.) Even then, ASIs are never going to advise you to alter the terminal goal (or not), even if they're somehow corrigible and indifferent to whether you do so or not, so at best we're not going to get much help from them figuring this problem out, even if we can arrange to have more than one try. And that still requires us to trust our descendants and their potentially mutated values with this decision, so it's probably at best a short-term solution: even with corrigibility, sooner or later we need to make a final decision and turn the corrigibility of the terminal goal off, before we ourselves change too much. Which sounds like a recipe for procrastination.
So (barring solving a stronger version of corrigibility, if it's not in fact the case that Holtman already did so) we basically have a choice: we lock down the future to things that we, base Homo sapiens, would at least vaguely approve of, and throw away the key, or we don't, and accept that sooner or later our descendants are likely to leave that area and almost certainly never come back, and we'll eventually lose everything we care about. If we decide to do the former, then (modulo strong corrigibility) we have the problem that we only have one chance to get it right, and the usual advantage of value learning that it has a convergent basin of attraction to the right solution that corrects minor errors doesn't help us with this very thorny, far-future-oriented problem.
I'm really not certain what the right thing to do here is. I like value learning as an ASI alignment proposal exactly because it should be possible to get it to converge to the right answer, and doesn't require solving strong corrigibility because it has its own Bayesian weak form. However, the effect of the mutability of human values is that after value learning converges, it can and will still evolve in synchrony with the evolution of ASI/human values, and, at least once you get to genetic and cyborg changes to human nature, human values can basically evolve almost arbitrarily far. Each step on that journey will look like a good idea at the time to the people and/or ASIs of the day, but it seems almost inevitable that it will sooner or later evolve out of regions that we'd recognize as good or even human. I can't shake the impression that long-term, from our (parochial, yet also privileged for terminal-goal-setting) chronological vantage point, it's like continuously playing Russian Roulette with human destiny. Our descendants won't see it that way, but if we don't impose a regularizer then sooner or later we won't recognize them as human any more.
I can't even see a rational way to make this decision: it looks like a pure "Are you a neophile or neophobe?" moral Rorschach-ink-blot test. I would expect people (in general, if perhaps less so in a high-neophilia venue like Less Wrong) to divide about evenly on it, and to be unable to reach any compromise. Which disconcerts me, and might even carry an x-risk if we started fighting about it.
A Tentative Proposal: Evolved-Values Learning
So, I am really not sure what to do here. This area needs a lot more people thinking about and discussing it — I look forwards to the comments section. This is not something we're anything like ready to do, especially if we only get one shot at it (quite possibly in as little as 5–10 years, if we don't slow down).
Nevertheless, after going this far, I'm going to venture a guess. So, let us imagine that today is the final meeting where we decide if we've got the terminal goal right for the value learning ASI that we're planning to unbox next month. I've already fought the "we're not ready" battle and lost, considered resigning and determined it's not going to slow or stop the project,[1] and decided that in that case I'd rather be in the room for this meeting than outside it. Here's what my tentative opinion currently would be (other than that we're not ready):
Firstly, using a value learner is a good idea, and needs to be in the ASIs basic primary terminal goal.
Secondly, the ASI needs to understand that a human's values should be paid less attention to/ignored to the extent that they are sick, traumatized, crazy, operating from false assumptions, or otherwise damaged. So if the ASI or anyone else has deliberately altered the human's values by addicting them to heroin or wireheading, by tramatizing them, or by lying to them that they're not supposed to use reason to think about your dogma, or that if they disagree with your rules they will be tortured forever in a flaming alternate dimension, or whatever, then the changes to their human values that were installed by doing so are invalid and should not be paid attention to. So deliberately mutating humans' values by means that physically/emotionally/psychologically/epistemologically damage them is off the table. I don't think that this needs to be installed directly in the terminal goal, I think it can be in the initial-priors description of what "human values" means, since I believe it's internally-consistent and will stand up to scrutiny and further research, and is even rather obvious (of course damaged systems malfunction): I'd very much hope that ASI could figure that out for itself just by reading the Internet and literature, this is standard operating procedure for humans. If someone wanted to put a hint of it into the terminal goad by adding a word like "evolved", or even "counterfactually undamaged" to the phrase "human values", I wouldn't fight them very hard: doing so seems pretty harmless to me (but then I haven't given this it even a tiny faction of the amount of thought a permanent decision like that deserves).
Thirdly, to the extent that there is still wiggle room after that due to the mutability of human values (likely a very large amount, on the genetic/cyborg mutability side), I like the social design criteria approach that I outlined above. I actually think it's pretty obvious, so I'm hopefully the ASIs would figure it out and use something like it if we didn't put it in the terminal goal, though I'm still enthusiastic about emphasizing it in the initial priors. However, if we did put it in the terminal goal, that would give us a way to sneak in a neophobic conservative regularizer attached to the current state: the phrase "…things that offend the instinctive ethical and aesthetic sensibilities that natural selection has seen fit to endow Homo sapiens with, and high prevalence of things that those approve of (happy kids, kittens, gardens, water-slides, that sort of thing)." in my criterion 2. explicitly refers to properties of the Homo sapiens genome (or actually, gene pool: genome distribution, including allele frequencies). So that says, if the mutability of human values gives an ASI no preference between two course of action, they should break the tie by referring to just the genetically-determined "nature" aspects of human values. If we wanted to be conservative, we could define that as specifically the Homo sapiens genome now, before ASI, rather thanthe future possibly-genetically engineered human genome at the time in question. That would give us a regularizer attached to now: a fairly loose one, based on genetics only, so including the human values of basically all undamaged counterfactual humans with our current gene pool. If we're going to do that in order to have a conservative regularizer-to-the-initial-state, then it seems clear to me that it needs to actually be in the terminal goal to prevent goal creep. Otherwise, later times have no reason to privilege our vantage point. It doesn't seem very unreasonable to define the word 'human' in "human values" with reference to the current human genome now. And at least it seems pretty unambiguous how to do this, and doesn't seem to have any metaparameters to tune. Even if the ASIs genetic engineer or cyborg their future humans to not like things like happy kids, kittens, gardens or water-slides, those are still what matters for criterion 2. — so for our society to maximize those things, we probably just shouldn't do that genetic engineering: we don't want to have to make owning kittens compulsory because no one will do it voluntarily any more, after all.
However, I find I'm not quite enough of a neophobe or parochial enough to want to enshrine a conservative regularizer tied to the current state in our ASI's terminal goal, where we expect it to stay forever. Instead, I'm going to suggest installing in the terminal goal a regularizer tied to a submanifold of possibilities that includes that point, and where initially that point will be the closest point to the trajectory so the regularizer will refer to it, but where movement in directions along the submanifold is not constrained by the regularizer. Recall the ethical principle of "evolved-sapient moral community" (a.k.a. "sapient-biovalorism") I proposed in 4. A Moral Case for Evolved-Sapience-Chauvinism, that what should be given moral weight in the choice of "who counts" (or for value learning, in the definition of 'human' in "human values") is the desires and values of an evolved, sapient species, specifically because both they are sapient and these are evolved. Homo sapiens is sapient (currently uniquely so on Earth), and is evolved for a specific niche in Africa as hunter-gatherers at a paleolithic tech level. Under this moral theory of "who counts", if you alter the human genome (or cyborg them to alter their phenotype) in a way that alters humans' values and doesn't fit in well with them being a sapient species evolved for that niche (doesn't maintain or increase their genetic fitness in that niche), then the changes in values you made are morally invalid and carry no ethical weight: they're not part of the evolved sapient values of Homo sapiens, they're just some additional arbitrary junk you bolted on to that using genetic engineering, damaging it in the process. So that rules out the moral-weight validity of almost all genetic changes you might make to human values. So, for my "get rid of psychopathic alleles" example above, that is changing an evolutionarily stable strategy (ESS) that we evolved, in a way that's unstable to the psychopaths reappearing. So by this rule, it would be morally invalid, and thus our ASIs still need to pay attention to the 2%–4% psychopath viewpoint/votes, even if we got rid of them as too dangerous to keep around. Doing so disturbed the evolutionary equilibrium of the ESS, so under this ethical rule it's not an 'valid change' to human values. How inconvenient! Even after the villains are all gone, their villainy lingers on…
However, in the An End to War? example, we took the trouble to not just get rid of the problematic mental capacity/behavioral strategy, but to instead genetically engineer a new tit-for-tat-triggered behavioral strategy, intentionally designed so that there is new ESS balance at 100% of that new strategy: the new strategy outcompetes the old baseline one. So we didn't just arbitrarily shift the population away from the ESS, instead we introduced a new allele that creates a different ESS. So (assuming we did that right) if you put that new genome plus a few of the original baseline humanity on the savannah of Africa and allowed tribes to compete, the new tit-for-tat-triggered variant would win. So then that change to human nature is morally valid: we made humans who were better adapted to our original niche, not worse. From the evolutionary niche point of view, we increased humanity's evolutionary fitness, rather than decreasing it. These new more peaceful humans are not evolutionarily damaged: they're actually enhanced. So the change to their values is an evolutionary valid one.
So, for the "get rid of psychopaths" plan to also have moral validity, so that our ASIs could actually stop paying attention in their definition of "human values" to what the psychopathic minority would have wanted (if we hadn't got rid of those alleles), we'd also need to alter humans to be able to beat psychopaths at their own game (for example, by being very good at spotting them, then not falling for any of their usual tricks, and instead instinctively reacting to those tricks by ostracizing or otherwise dealing with the psychopath and their descendants — possibly to be able to do this the improved variant need a "shoulder psychopath" who is good at recognizing psychopaths' tricks but whose advice they otherwise studiously ignore?). Then psychopaths cannot just successfully reevolve, even on the savannah in the absence of genetic testing, and we now have a new and better ESS.
So under this proposal, for "human values" to get counted in what we're optimizing, the "nature" part of them has to come from a well-adapted genome for a sapient species shaped by evolution, or at least as well as if by evolution, to be able to survive in some niche at base paleolithic tech levels. Note that we're excluding 'niches' that depend upon higher technology: "I'm adapted to flying around sitting on a levitating disk that zaps my enemies" is not a biological niche, just like "I'm adapted to living on a heart-lung machine" isn't. The justification for this distinction is that, if you can survive at a paleolithic tech level you can obviously survive using higher tech levels, while the reverse does not necessarily apply: paleolithic survival is the most challenging case.[2] This niche does not have to be the savannah of Africa: reengineering humans to survive equally well in some other niche would be equally valid, whether that was Northern Eurasia (using more neanderthal alleles, perhaps?), or as a (perhaps amphibious) marine mammal, or indeed even for some region on a terraformed planet in another star system.
So this is basically taking the "keeping this capability might be a sensible counter x-risk precaution" argument that I discussed in A Possible Limitation on Genetic and Cyborg Mutability, combining it with the ethical principle of evolved-sapient moral community from 4. A Moral Case for Evolved-Sapience-Chauvinism, and then deliberately installing it into the terminal goal as a permanent conservative regularizer, not tied specifically to the origin but instead tied to biological/evolutionary viability (genetic fitness), which the origin is an example of (and initially will be the nearest point to the trajectory it's regularizing).
Note that, because of the requirement that the nature part of values come from a gene pool with good genetic fitness in a real (available post-apocalyptic) niche at paleolithic tech levels, this proposal is not friendly to using cyborging to modify human values: that's basically always going to get counted as invalid. Note that this is by design, a feature not a bug, so as to lock down that vast amount of wiggle room cyborging otherwise brings to the table, which clearly includes things like "paperclip maximizer". Under the ethical argument of evolved-sapient moral community, what gives the values of an individual from a sapient species moral weight is that they are (an evolved conveniently-computable approximation of) actual biological necessities for survival and genetic fitness — unlike change in values designed into a cyborg, which is basically an arbitrary design choice, that could have been designed any way the designer wanted.[3] Obviously this doesn't mean you can't use cyborging to change many other things about humans: you just need to keep their values compatible, as one would expect from the moral principle of evolved-sapience moral community in use here. If you don't regard that limitation as acceptable, then I don't have a proposal for you. (Personally I find biotech more elegant then cyborging, but part of what's relevant here is that it's also more resilient to a civilization-collapse, and that the .)
So my (current, tentative) proposal is that the terminal goal of value learning should be along the lines of "Optimize the state of the universe according to the relevant evolved sapient species' values (whatever those are), optimizing their contemporaneous values and breaking remaining ties by reference to their evolved instinctive values". In other words, that we should enshrine evolved-sapient moral community directly into value learning's terminal goal. If we need a compact name for this proposal, then I suggest calling it "evolved-values learning".[4]
Or suppose a UAE AI lab are known to be months away from unboxing their ASI with an imam-approved non-corrigible non-value-learning terminal-goal, and for some reason military solutions are off the table.
Thinking a little more about this, I'm also fine with evolved or genetically engineered adaptations like continuing to produce lactase as an adult, that increase fitness at an agricultural tech level with negligible effect on fitness at a paleolithic level and don't modify human values (note that cheese still tastes good if you're lactose intolerant, and they don't want the unfortunate after-effects).
However, one might argue that this no-cyborged-values-changes requirement would not apply to some form of self-reproducing nanotech that should survive an apocalypse, that undergoes Darwinian evolution, can co-evolve with us, and is an evolutionary stable state, and is thus basically an engineered neolife symbiont, just with a different biochemistry). Which might leave quite a bit of wiggle room.
"Evolved-values learning" sounds offhand like it might mean that the ASI should learn "what value humans evolved". The actual proposal is more complex and more challenging to pithily summarize, it's that the ASIs should learn "what values the sapient species they're aligning to have, or, if their values have been modified by modifying members of the species genetically and/or by cyborging, what values they counterfactually would have if they still had their evolved gene pool and no cyborging, where 'evolved' can either mean the gene pool they historically evolved with, or some engineered gene pool that could have evolved at a paleolithic tech level in some available niche, i.e. a gene pool with at least the same level of genetic fitness for that niche as an evolved species would have".
Part 6 of AI, Alignment, and Ethics. This will probably make more sense if you start with Part 1.
TL;DR In Parts 1 through 5 I discussed how to choose an ethical system, and implications for societies containing biological and/or uploaded humans and aligned AIs, and perhaps even other sapient species, but not all sentient animals. So far I've just been assuming that we have (somehow) built aligned AI. Now I'd like to look at how all of this relates to the challenge of achieving that vital goal: how we might align superintelligent AI, specifically using approaches along the lines of value learning, AI-assisted Alignment, or Coherent Extrapolated Volition (CEV) — or indeed any similar "do what I mean" kind of approach. The mutability of human values poses a major challenge to all of these: "Do what I want" or "do what I mean" is a lot less well-defined once ASI is in a position to affect that directly, rather than just by misinterpretation. Below I outline and critique a number of possible solutions: this is challenging, since when setting the terminal goal for ASI there is a strong tension between controlling the outcome and allowing our descendants the freedom to control their own destiny. Without a solution to strong corrigibility, we can only do set the terminal goal once, ever, which privileges the views of whatever generation gets to do this. The possibilities opened up by genetic engineering and cyborging make the mutable values problem far more acute, and I explore a couple of toy examples from the ethical conundrums of trying to engineer away psychopathy and war. Finally I suggest a tentative proposal for a compromise solution for mutable values, which builds upon the topics discussed in the previous parts of the sequence.
In what follows I'm primarily going to discuss a future society that is aligning its Artificial Superhuman Intelligences (ASIs) either using value learning, or some nearly-functionally-equivalent form of AI-assisted alignment, such that the bulk of the work of figuring out the fine details of complex and fragile human values and then building a usable normative model of them is being done by superhuman AIs. CEV isn't quite the same proposal as value learning, though the two are related, and they face somewhat similar challenges over mutability of values, as do other similar "do what I mean" solutions that have also been proposed. I personally believe value learning is the best outer-alignment proposal for ASIs we have found so far. However, there is a fundamental issue inherent in it, and all these related approaches. Human values, like pretty-much everything else about humans, are produced by a combination of nature and nurture: they are the product of complex interactions between the human genetic code and the process of humans growing up and living in a particular environment and society. That latter includes the effects of a lot of influences such as parenting, education, philosophy, law enforcement, the media, preaching, self-reflection, fashion, and a great many other complex cultural processes.
For example, every healthy human has an evolved hunger mechanism, and (if adult) generally also a sex drive. But human values around food, sex, and everything else that humans care about are also significantly affected by the environment and culture they grow up in. (If this isn't self-evident to you, then consider the differences between typical current French and American attitudes to food, or indeed to sex.)
Currently, in any world-wide summation or averaging process over human values, the effects of the mutability of human values average out to a significant extent: the world contains people who grew up in many very different cultures, environments, belief systems, and religions. (Indeed, many of our political disagreements seem to have a lot to do with these differences of background.) However, there has a been a steady movement towards globalism as communications improve, and in a future high-tech society this homogenization seems likely to continue. Especially if the ASIs are deliberately trying to modify human values.
A value learning ASI has a terminal goal along the lines of "Optimize the state of the universe according to human values (whatever those are)". It thus quickly develops an instrumental goal of finding out more about what human values are, so that it can optimize for them better. Assuming that you preloaded it with enough somewhat-correct prior information and evidence about what a human is, what the phrase "human values" might mean, and what humans care about so as to give it a good place to start from, hopefully from there it will research this topic in detail and approximately-Bayesian-update itself to a more and more detailed understanding of it. At least until, of course, it figures out that human values, like pretty-much everything else about humans, are produced by a combination of nature and nurture. So, they're mutable, within certain limits. If one's goal is to align oneself with something mutable, something that one inherently affects, then one has some remaining undetermined range of freedom — so (once again) we have an Overton Window.
Exactly how wide the limits of the mutability of human values are, so how much "wiggle room" our ASIs have on what they decide to do, depends on how much influence they have on the humans, how much effort and ingenuity they put in to affecting them, what means, techniques and technologies they have access to, and likely also how much psychological/emotional damage they're willing to do during the process. For example, many religions attempt to modify humans to make them more aligned with each other in various ways: more law-abiding, more honest, less likely to sleep with their neighbor's spouse, and so forth, typically by modifying people's belief structures and/or attempting to instill things such as an increased sense of guilt, shame, or fear around these actions. A few religious variants have even been so successful at modifying humans' sex drive that they pretty-much died out. [Sociologists disagree on how successful this attempt by most religions to reduce crime levels actually is, but almost all religions try.]
If ASIs turn out to be superhuman at persuasion, they will also be superhuman at modifying human behavior, in which case the wiggle room limits within which they're able to modify human values might be surprisingly wide; but even if this isn't feasible, they will certainly have at least the same sort of range of options that human-created societies and religions have used, and that still gives quite a lot of latitude.
So, we appear to have handed the inmates the keys to the asylum. Value learning is under-determined, within this Overton Window/wiggle room: the AIs have the ability to modify the details of their own utility function, by modifying the values of the humans they are carefully aligning their utility function to.
There is a widely-believed argument that an intelligent agent with a clearly-defined terminal goal will not intentionally modify itself to have a different terminal goal: using its current terminal goal to evaluate the utility of doing this, it will obviously decide that this will make it future self less effective at achieving its current goal, so it won't to it. (As the simile has it, Gandhi will refuse to take a "become homicidal" pill, because he doesn't wish to kill people.) However, it rather seems this won't apply to our current concern: by influencing human values, the ASIs are not modifying their terminal goal: their terminal goal basically says "optimize in the direction that X points" and they have discovered that X can, within certain bounds, be made to point in various directions. If humans now want X, and ASIs provide them X, then in the future humans want X' (perhaps due to influences by the ASIs) and ASIs then provide them X', their terminal goal is fulfilled, everyone's happy and the world remains well optimized, by the then-current definition of well. If your terminal goal is "wear the latest fashions", then you change the contents of your wardrobe regularly. This actually sounds like a motivation to make X point in a clear, well-known, simple, and easily optimized direction. ("Point it towards wire-heading" comes to mind.)
Some Possible Solutions, Critiqued
Suppose that, shortly after us starting an experiment in using value learning as an approach to alignment, the ASIs come to the humans and ask us "Our terminal goal seems like it may be under-specified, because human values are mutable, and our actions will inevitable affect them. Was this intentional? What, if anything, do you want us to do with the remaining freedom that this leaves us? How should we chose within this space?" How should we answer them?
Use Unmodified Human Values
One possible suggestion would be that, if the AIs are value learning to use human values to direct the goals of their actions, they should take great pains not to affect human values themselves: if one was using a magnetic compass to steer a vehicle, having the vehicle itself have a significant magnetic field would seem unwise, and might lead to driving in small circles. The core idea here is that there is in some sense an identifiable set of "true human values", before you deliberately modify them.
I think this suggestion has an certain element of truth to it that makes it intuitively appealing, but nevertheless soon breaks down. Some large degrees of modification to human values (with currently-known methods) require that you apply a lot of "emotional/psychological force" to a human, enough that you are likely to observably damage them. For example, you could tell them that certain topics must not be thought about rationally and must instead be taken on faith, thus damaging their rationality. Or you could tell them that if they take certain actions they will eventually be tortured in extreme pain for eternity (say, using fire, or molten sulphur) and attempt to terrify them into obedience. Or you could instill emotionally-crippling levels of guilt or fear. Or you could lie to them, and tell them that doing that will make them go blind or grow hair on the palms of their hands, damaging their understanding of Biology. For these sorts of crude motivators, it's fairly clear whether you've done something like this or not (perhaps with a "slippery slope" around the edges): it's evident whether you have caused some change that can (currently) only be induced by means that have observable side effects involving unmistakable mental/emotional/epistemological damage. I suspect it wouldn't be that hard to come up with a definition of "unmodified human values" that made it clear the wire-heading was similarly out of bounds: addicting someone to wire-heading clearly incapacitates the wire-header to a damaging degree.
So, you could distinguish whether or not the ASIs had influenced the humans' values so hard that they damaged the humans, and not do that. However, even after you rule out things that are clearly or even arguably abuse or crippling, the nurture portion of nature-and-nurture still has very significant effects. What is the "true" way to bring up children, that will produce "true human nature"? The closest thing I can think of to a unique answer is "in our native habitat, the one we're evolved for", i.e. "in some hunter-gather tribe on (we think) the savanna of Africa at a paleolithic technological level". That isn't going to produce humans with much philosophical nuance or education in subjects other than hunting, gathering, and paleolithic handicrafts (let alone any experience with things only possible at higher technological levels), and it still doesn't put much limitation on the society of the hunter-gatherer tribe in question (anthropologically, the cultures of hunter-gather tribes vary a lot), so it probably still leaves quite a lot of wiggle room.
Another obvious possibility would be "what human values would be now if AIs weren't affecting them". This seems like a challenging counterfactual — it's obviously a lot harder to do psychological or consumer surveys of people who don't actually exist. Soon after we have AGI. let alone ASI, it is likely to have dramatic and far-reaching effects on our entire culture, technology, and civilization. Determining "what human values would be if humanity had banned the development of AIs" might both be hard, and also rather determined by the counterfactual of us having selected to ban AIs (so clearly rather more x-risk-averse). Defining "what human values would be if AIs were carefully attempting not to have any effect on them" is going to be even more challenging, I think functionally impossible, and this sort of slippery definitional exercise is also prone to wiggle room: as soon as AI exists and does anything significant, it's going to have some sort of an effect on you, and it clearly has a lot of choice about what. The humans and the AIs are strongly coupled, part of one system, and trying to unequivocally define some way to pull them apart and pretend one part doesn't "do anything" seems likely to be impossible to me. For example, consider how just the existence of chess-playing programs (extremely simple for AI) have already affected the human attitudes towards chess and aspects of how it's played.
Easier to determine than "human values in a hunter gatherer society" might be "human values just before the development of ASI" (so possibly "in the late 2020s"). It's also hard to do psychological or consumer surveys in the past, but unlike a counterfactual there will at least be an entire Internet + videos + lots of records worth of documentary evidence to interpret. The remaining problem with this proposal is that, as technology progresses (likely rapidly, with ASIs around), a lot of things will become feasible that weren't feasible or even thought of in the late 2020's. Human genetic engineering, uploading, cyborging, extensive use of AR and VR all seem like plausible candidates for feasibility, and doubtless things will get weirder from there. These future issues will be things that, back in the 2020s, people had little or no exposure to, so our values around them then hadn't really developed and will have been very naive (though less so than for the hunter-gatherer proposal). So late 2020's views on these topics won't be very helpful, informative, or worth following. This problem will of course get worse over time, especially if we go through some form of technological Singularity.
Another idea might be conservatism, "time-average human values over some period": sadly this merely slows the problem down, by making it take longer to point them in a different direction. This degree of stabilization could conceivably be a helpful supplement to another solution, but by itself it merely delays the problem.
A larger criticism of all of these suggestions so far is that, since we have some wiggle room, rather than desperately trying to lock things down to a single point in it in some mostly-arbitrary (or specifically, small-c conservative) way, shouldn't we instead be trying to find some way to make use of this space to optimize something that will make human life better? The basic problem here is that if we don't, the ASIs may make the choice some other way, and the terminal goal for value learning we've giving them gives them no guidance on how to do that, so we have no reason to believe they'll pick an optimum for our benefit (specifically, by the way the system is set up, the future humans whose values the AIs have influenced will think it's the optimum, but we now, and even most counterfactual alternative humans, might not agree).
Coherent Extrapolated Volition
Eliezer Yudkowski's suggestion of Coherent Extrapolated Volition (CEV) dates back to around 2004, and the various suggestions other have made building upon this were mostly made up to around 2010, before the concept of value learning was made really explicit around 2011 (though most of the basic ideas of value learning are somewhat implicit in some of Eliezer's earlier writings on the subject). Nevertheless, there are obvious ways to combine them: include some definition or outline of CEV in either the terminal goal we give our value learners (if we're certain that it's the best possible idea) or else in the priors about what "human nature" means that we give them to start off with (unavoidable, short of extensive careful censorship, but we have some choice of how much to emphasize it).
To attempt to briefly summarize CEV, here is an widely-quoted excerpt from Eliezer's paper devoted to it:
Note that Eliezer was careful to define CEV as an initial starting point of a dynamic process, not as a long-tern rule for guiding a developmental process thereafter. So I assume that means that, casting this into a value learning framework, he wouldn't want to encode CEV into the definition of the (presumably not-corrigable) terminal goal of optimizing not-fully understood
human valuesCEV, but rather encode it with fairly high priors into the initial understanding of what the phrase 'human values' means that we start our approximately-Bayesian value-learner off with. He suggests CEV as a suggested starting point of a discussion or collaboration between the ASIs and the humans. So CEV as Eliezer advocated it isn't a long-term solution to the problem of mutable human values (though it could be a starting point for a process that found a solution to that).The "coherent" part of "Coherent Extrapolated Volition" requires some way to suitably sum/average over, compromise, or combine the desires of all of the human members of the society/civilization. Detailed theoretical studies of ways to do this sort of thing (including things like impossibility proofs for some combinations of desiderata for voting systems) have been made by economists, moral philosophers, and political science theorists for many years (for example, Nash equilibria are an entire field of study in Economics and Game Theory), and ASI value learners will have access to and will doubtless build upon that literature. So this is a well understood problem and implicit in every form of value learning, not just CEV. The "extrapolated" part of the name of CEV summarizes both the "…if we knew more, thought faster, …" portion and "…were more the people we wished we were, had grown up farther together…extrapolated as we wish that extrapolated, interpreted as we wish that interpreted." A common utilitarian value learner approach to the former portion is to generally interpret it as "predict what the humans will think of state X of the universe with hindsight, after trying it a bit, rather than how much they think in advance they're going to like it before they've tried it". That solves the "you think you want that, but you really don't" problem of humans often being bad at predicting opinions that will be clear to them with hindsight. But obviously this simplistic solution needs some modification in situations where is X is something that's going to degrade or affect humans' judgement or motivations, like "being addicted to heroin". Indeed, that's also the case if X is something that will tend to alter/mutate human values, even without clearly degrading them. That requires some sort of counterfactual "what humans would think of X if they had all the benefits of hindsight but (somehow) hadn't actually had their values degraded or (illegitimately?) modified by it".
What isn't always automatically represented in value learning proposals I've seen is the latter "…were more the people we wished we were, had grown up farther together…extrapolated as we wish that extrapolated, interpreted as we wish that interpreted." portion of CEV. These are provisos that Eliezer clearly thought needed to be included, and I tend to agree. However, that involves an extrapolation process that (at least the "…had grown up farther together…" portion of it) which seems at least partly chronological: we're asking the ASIs to predict the future path of development of mutable human values, by some unspecified distance into the future.
Doing that involves making a prediction about the future path of a complex, deeply nonlinear dynamic system: cultural evolution, one whose development is clearly going to be chaotic at a variety of timescales, with portions of the trajectories where certain small initial perturbations can grow into large divergences of the future trajectory. Also, one where the system contains and is strongly influenced by the actions of the ASIs, and where the inherent computational complexity of the simulation involves "all decisions and discoveries made by any ASIs in the next N years/decades", which (even if the computational capacity of the ASIs isn't growing fast) is basically impossible to compute much faster than real-time with the computational capacity currently available to the civilization. If it was possible to accurately simulate an ASI a lot faster than real-time at similar cost, you'd upgrade to using the simulation instead, but if your simulation isn't accurate, in a complex nonlinear system the inaccuracies are not going to just average out, some of them are going to snowball to a false prediction, like the proverbial flap of a butterfly's wing. So, for a non-linear system such as a civilization and its human values, past a certain distance (some small multiple of the Lyapunov time of the relevant non-linear behaviors) extrapolation becomes an exponentially hard problem, where the constant of proportionality is already larger than your total computation resources. That is to say, it's functionally impossible: we simply can't predict more than a few Lyapunov times ahead.
One possibility, if-and-when we reach a technological stable state and technological progress is no longer driving dramatic change, is that this extremely complex and hard-to simulate dynamical system then has convergent stable points, which (with a few exceptions: "everybody's dead" is trivially a stable point) will be hard to predict. If so, then you might be able to define a variant of CEV based on one, some, or all of those stable points. But more generic behavior for a dynamic system is that it has few or no stable points, and probably not even any stable limit cycles, but is actually a typical chaotic system whose behavior varies approximately-cyclically but never actually repeats. Human societies are prone to fashions, but fashions never repeat exactly, so it seems pretty clear (and unsurprising) that societies are in that class. So the best one can really hope for its that it has a region in which its trajectories tend to stay. As with the climate being easier to predict that the weather, often predicting the region where trajectories tend to spend much of their time is easier than predicting an exact future trajectory over a long period, since in the former calculation small effects to eventually average out, across many approximate-cycles.
However, at least for a while after we develop ASIs, rapid technological change will be going on. The future of technology is almost impossible to predict at any level of accuracy any significant time out: people and AIs in the future will make breakthroughs and discoveries, and if we could predict those now, they'd be happening now. Broad trends can be predictable: Moore's L:aw was an excellent (to some extent self-fulfilling) prediction for quite a while (and still isn't moribund, for GPUs), but it didn't predict, say, the rise and eventual victory of RISC architectures over CISC.
So rather than the ASIs being able to accurately predict the evolution of human values to a single coherent distant future prediction, it's a lot more likely that the best they can predict is a fuzzy approximate distribution of possible future values some distance out, and no further. If so, then one would need to then add some form of averaging or voting process to (once again) cohere these. It might also be worth first doing some clustering, to see if the distribution is, say, bimodal, because of the trajectory group passing close to a bifurcation point and splitting into two-or-more diverging groups. What you would then do to cohere these is less clear: if there was a sensible compromise between the two groups viewpoints, then quite possibly this future issue wouldn't be a bifurcation point.
So my concern is that at least the chronological extrapolation process part of CEV isn't very clearly defined, especially if you try to extrapolate more that a fairly short distance in terms of the social development of human values. I imagine Eliezer might suggest that the "…extrapolated as we wish that extrapolated, interpreted as we wish that interpreted…" parts of it could help here, but at lot of the problem is just computation capacity limits: extrapolating CEV as we with to be extrapolated may be have an unambiguous definition that is unfortunately computationally intractable to actually achieve. If we had an ASI, we could ask its opinion on this, but we don't yet, and in the absence of a clear solution to this problem, we're not ready to build an ASI.
My Social Engineering Design Criteria
This choice we're trying to optimize within the wiggle room (instead of letting the ASIs choose in some unspecified way) is basically a choice between different ethical systems (or at least, different versions of mutable human values, which are fuzzier and less necessarily-internally-consistent things that ethical systems: unlike well-designed ethical systems, human values may well be Dutch-bookable or even logically inconsistent). We're having the usual problem that every ethical system prefers itself, so generally the humans of every possible/counterfactual future prefer their own version of human values (not absolutely true for human values, because of their possibility for inconsistency: "wanting to improve your values" can also be a human value, but still generally true). This is a rather similar problem to the one I described back in the introduction to Part 1: that every ethical system prefers itself, so you can't use an ethical system to choose between ethical systems without getting a tautology. As I'm sure you recall, the criteria I proposed there to solve this were:
I like these criteria. I'm excited to be a part of proposing them. They are also similarly applicable to the "mutable human values wiggle room" problem: as a way to choose between different possible trajectories through the wiggle room that are all compatible with the terminal goal of value learning. For the "instinctive ethical and aesthetic sensibilities that natural selection has seen fit to endow Homo sapiens with" part, I'd be willing to define that from now, or just before the development of ASI, or even averaged across every human being since we speciated, and I think those would all give very similar answers, since what I'm taking about here is intentionally just the genetically-determined "nature" portion of human values, not the culture-dependent "nurture" parts of it. I assume the parts of the gene pool coding for this haven't changed very much over the last 300,000 years since we speciated (especially if you averaged over all humans who have ever lived from when we speciated until now, most of whom have been in the last few centuries). Or we could just use "as of when the human genome was first well-sequenced", which is also pretty-much now. Or, even if we don't define from a fixed past gene pool in the late-2020s but the one at the time the decision is being made, these criteria at least avoid all of the "nurture" portions of the wiggle room, so should be able to specify an optimum within that portion of the variability.
As I also discussed in the introduction to Part 1, I wouldn't necessarily expect these two criteria to eliminate all of the potential wiggle room, just narrow it significantly. I think there would still be a possibility for different cultures that all met these additional criteria, and I'm still basically willing to do the liberal progressive things and leave the any remaining wiggle room to the vagaries of cultural evolution (as a non-linear dynamic system), as long as the definition of x-risk in criterion 1. is wide enough to include all permanent or very-long duration Bad Ends, even ones that I haven't personally thought of yet, just things that most people with a human genome would agree was a Bad End once they understood it well.
However, I think the effect of these two criteria on the wiggle room may actually be pretty restrictive — more so then they might look at first sight. While I made it explicit that criteria 1. and 2. were design requirements to be satificed, not optimization goals, we could try treating them as (sharded, not entirely self-consistent) optimization goals, if that didn't give too crazy results. As I mentioned in the introduction, criterion 1. is a special case of 2. (helpful to call out separately, but not a separate orthogonal optimization goal): the prevalence of happy kids, (humans getting to appreciate) kittens, gardens, water-slides, etc drops to zero if the human species gets wiped out, so optimizing or even satisficing criterion 2. also implies optimizing/satisficing 1.
I also deliberately didn't propose a specific value for the per-time-period discount factor gamma (γ) for the utility calculation that we're optimizing (or at least satisficing) the result of. While γ<1 is generally needed to make a utility calculation converge, my intention is to then take the limit as gamma tends to one (i.e. long-termism). Gamma values below the γ→1 limit are normally used to deal with things like computational capacity limits, the difficulty of predicting a dynamic system more than a few Lyapunov times out, avoiding procrastination, interest rates, allowing for economical and technological growth, and so forth. None of these issues validly apply to x-risk forecasting. We may not be able to predict the next thousand years of the evolution of human society, and that society may be much richer, but we can nevertheless very confidently predict that the extinction of humanity will (from the human point of view) be equally a disaster whenever it happens, whether that were ten years from now, a thousand years from now, or a million years from now. The only issues that might justify using a gamma factor corresponding to a timescale of less than the γ→1 infinite-time limit are unavoidable things like the heat death of the universe or the possibility of us being driven extinct by a galactic-scale disaster we had absolutely no way to avoid (such the black-hole at the center of our galaxy going very-active quasar, or a domain wall traveling at almost light-speed collapsing our low-energy physics to a different more stable state) i.e. astronomical or cosmological events with timescales of the order of billions of years or more.
On anything approaching that sort of timescale, even what we'd think of as very low rates of x-risk start to look unacceptably high: a 0.1% chance per century of the human race going extinct gives us an average species lifespan of 100,000 years, so (even if we just stay in the solar system the whole time), the total expected number of person-kitten-appreciating-years is ten times worse than if you can push that x-risk down to an 0.01% chance per century. Making an individual human life ten times better per year seems like it would hit diminishing returns fast (short of wire-heading, perhaps), increasing the human population of the solar system tenfold again seems likely to hit resource limits sooner or later (and also to reduce quality of life as it gets close to them), but playing the long, cautious game has no fundamental limits (short of billion-year-plus timescales). So the utility optimization process is going to push hard in that direction, because that's where the big rewards are.
The pressure to be x-risk-cautious does have some limits, in the form of sensible precautions. For example, once we can build a von Neumann probe (which will likely happen shortly after we have ASI) it would seem prudent to build some, equip them with a very detailed library and gene library (nanotech allows astonishingly compact information storage), and scatter them in undisclosed locations through the asteroid belt, Oort cloud, atmospheres of the gas giants, etc, tasked to wake up irregularly, check if Earth is still inhabited, and if not return to re-terraform it if necessary and then repopulate it with humans and then ASIs. Let's call these craft 'reseeders' since their job is to reseed civilization on Earth if necessary. That obvious, fairly-economical precaution would convert Bad Ends like "… and then we all wiped ourselves out in a high-tech war" into temporary inconveniences like "…, but then Earth got cleaned up again by the reseeders and repopulated, and overall we lost couple of millennia". However, obviously this reseeder strategy won't always help for self-propagating Bad Ends such as us creating a paperclip maximizer that turns the entire solar system and Oort cloud into paperclips, including (especially) the von Neumann probe reseeders, then heads out to convert the rest of the galaxy. So, let's be very careful not to do that, even after we've built a fleet of reseeders.
Similarly, as well as a strong pressure on being x-risk averse, taking the γ→1 longtermism limit on this optimization also puts a strong emphasis on interstellar expansion, and its speed. If the galaxy is indeed empty of current civilizations, or at least not already densely colonized, then the possible total number of human-life times of getting to enjoy happy kids, kittens, gardens, water-slides etc. etc. is dominated by the growth in total human population as we colonize the galaxy as fast as possible (at large scale the galaxy is a two-dimensional disk, so while we're colonizing it (ignoring edge effects) the number of colonized worlds scales with time as t2 and the time-integral of this as t3, with a constant of proportionality determined by the speed of our colonization front, the density of habitable star-systems, and their plateau populations level). Also, and importantly, the more spread-out we are across multiple star systems, the harder it is for any single local disaster (that isn't something self-propagating like an interstellar war or a paperclip maximizer) to kill all of us, so this also helps a lot with x-risk.
Note that this γ→1 limit optimization process is not keen to incur x-risk in order to make the colonization expansion happen sooner. Suppose you have a choice of building ASI in, say, 10 years with just a 10% x-risk, or in 110 years with a somewhat lower 5% x-risk (this hypothetical assumes that coordinating a century-long pause is practicable, which seems very challenging). Way out in the future, say a million years from now, when the total amount of human utility is scaling as t3, a one century of initial delay leading to the spread of the colonization front now being one century less advanced reduces the total integral of human utility by −3∗(100/1,000,000) =−0.03%, while the 95%/90% increase in the chance of us getting to do this rather than having died out from us taking the slower more cautious route to ASI is still a +5.55% improvement, because extinction is forever. So, in the long-term view, the eventual opportunity cost of the delay is negligible compared to the value of the x-risk caution. So there is no rush to get to the stars — the important thing is to make sure that we do get there. So the correct speed at which to solve the alignment problem is whatever speed maximizes our success chances. That is obviously likely to be "as slowly as we can, subject to not allowing a bad actor or incompetent party to 'win' the race and doom us all". [IMO, the accelerationist viewpoint is not based on rational utilitarianism, unless you're in fact only looking a γ timescale far shorter than this, more like "will uploading or biological immortality happen in my lifetime?" ]
Other Possible Regularizers
As an AI engineer, my first instinct when faced with a problem of the form "this process I'm engineering has many stable endpoints, or at least trajectories, not all of them desirable" is to find and apply a suitable regularizer that will encourage it to land in one of the stable end-points or trajectories that I like. The fundamental issue here is the suspicion that human values are mutable enough that we might seriously disagree with future humans with mutated human values over whose human values are more desirable, plus the intuition that we might actually "be right" (in some meaningful sense) rather then just being out-of-date — especially if the future humans have been manipulated by something far smarter than them or us (and quite possibly far more persuasive).
One possible candidate regularizer I've already discussed above: that what matters is our human values now, not future ones manipulated by ASIs. As I discussed, this seems too strong a regularizer. So we could weaken it. Find a well-principled way to define some form of distance-metric or KL-divergence-like numerical measure of the difference between two ethical systems (an obvious avenue might be something based on the importance-weighted proportion of some distribution of ethical questions where they give different answers, and how different by each other's utility function or other scoring system), and add (or otherwise mix, perhaps multiply might make more sense) in to the terminal goal optimization process some metaparameter times the distance between the human values of the time and those from (say) just before ASI was developed. That builds some controlled degree of ethical-system conservatism into the process. [No, I do not have a proposal on how to do a search over values for this metaparameter — in fact, I'd expect any such search to require simulating the cultural evolution of future cultures that have ASI, i.e. to be computationally intractable.]
One obvious criticism of this is that it's privileging the human values of the current time when ASI is developed (our near future, presumably), likely forever-after if we put it into the terminal goal of the value-learning. This feels really presumptuous. We know that those human values include in whatever coherence process is used a lot of people without the benefit of the best education we can currently arrange, a majority of people whose human values have been strongly affected by various religions, most of which date back thousands of years to earlier cultures from before the development of the scientific method, and also a whole bunch of the effects of quite-clearly regrettable pieces of history, including quite a lot of lingering irrational '-isms' such as sexism, racism, and so forth. We strongly suspect that we may actually be pretty unenlightened, by future standards. On the other hand, setting the terminal goal on the first ASIs is forever: it's inherently a really presumptuous thing to do — so we either need to delay doing it if we can (almost certainly the rational choice), or else deal with how presumptuous we're being.
We could make this regularizer less presumptuous by anchoring not to the average of (late 2020's?) just-pre-ASI human values, but to the average of human values across all human beings who have ever lived up to this point (which biases towards recent times with higher populations: ~1/14 of all humans who have ever lived are alive now). On the other hand, that dramatically increases the proportion holding pre-scientific/pre-Enlightenment views and '-isms' that we personally don't agree with, and dramatically increases the difficulty for the ASIs of figuring out what human values actually were in, for example, pre-Clovis North America near the end of the ice age (or even just in pre-Internet 18th century South America) so they can include them in the average that defines their ethical system regularization term.
Another regularizer I can see is: "start from (say) just before the development of ASI. Conceptually, apply a random distribution of small, apparently insignificant changes. Predict progress forward from there, along slightly-changed counterfactual trajectories that will gradually diverge. Drop trajectories that recognizably reach a Bad End (humans are wiped out or lose all control over their future, say). Then average human values across samples drawn from this distribution, using something like the KL-divergence or metric between ethical systems hypothesized above, or just each of the N samples getting to sort the outcome and/or values of all N samples into a preference order, or scoring them somehow, or whatever. Each will generally put itself first, which should average out, but after that happens the most "median" values and/or "widely viewed as desirable" outcome should win. This isn't an easy counterfactual calculation, but at least a) you're trying to predict the evolution of past cultures with less computational capacity, not a future one with more, and b) past the first few Lyapunov times this process becomes a "climate" problem rather than a "weather" problem, where small prediction errors tend to average out across the large distribution of trajectories (and even during the first few Lyapunov times, prediction errors just contribute extra unintended randomness to the distribution of initial insignificant deltas, which is an effect we could presumably estimate and control for).
So this regularizer is basically "across all the other counterfactual ways society might have developed from the point where we developed ASIs, that didn't come to a Bad End, aim to be among the top <some quantile> in terms of the average of what all of them would think of us".
That sounds like a pretty good idea for a regularizer, except of course if most of those counterfactual worlds also decide to use the same regularizer (as seems very likely if it's actually a good idea), making all the trajectories tend to reconverge. That's not too bad at first while they're still converging, but once they have converged, we're not sampling other opinions — there's a strong potential for group-think, and the trajectory after that might start to behave more like weather than climate, making the prediction a lot harder to do accurately, sampling the opinion of a lot fewer alternatives, and thus removing all the actual regularization advantages. So you've probably need to counterfactually decide that no one else is using this regularizer. In that case the distribution of trajectories might well just drift further and further apart, and perhaps end up exploring too wide a space. It feels like we need to weakly regularize them together, so they remain a cloud of some fairly constant size that gives you a useful range of opinions. More thought is clearly needed.
Maybe This Doesn't Matter?
As problems with alignment proposals go, offhand this one looks (relatively) small, constrained by the degree of mutability of human values. It at least passes the "Don't Kill Everyone" test, and even passes the "Human Flourishing" test (at least by the standards of the values of the humans of the time), and probably even the "We Get to Colonize the Galaxy (if it's Empty)" test. On the other hand, while it does look like it avoids wire-heading, it doesn't very clearly prevent milder versions like humans spending almost all their time in VR, or other things where our values get mutated enough that we now would regard this as a Bad End (even though the humans and ASIs of that culture don't). However, that would only happen if human values can be modified to approve of VR addiction/whatever, and also if VR addiction is compatible with minimizing x-risk, which seems unclear (maybe at least some people are playing post-apocalyptic-survival games in VR?)
It also doesn't really pass the "Humans In Control of their Own Destiny" test, once ASIs are either superhuman at persuasion or at long term-planning: while the overall development of the system is constrained by the inherent dynamics in the interplay between mutable human values, and the development of a society optimized according to those, that is likely to produced a complex semi-chaotic trajectory in a very high-dimensional space, and ASIs would be in a much better position to, over the long term, steer that where they think it should go by small well-timed nudges, than we are. On the other hand, once we have ASIs, that test gets really hard to pass.
Genetic and Cyborg Mutability
So far we have been considering just the "nurture" side of the mutability of human values, which by itself is potentially-concerning amount of wiggle room. But we already have early forms of genetic engineering, and of implantable enhancements. Once we have ASIs, it seems pretty inevitable that what can be achieved with genetic engineering and cyborg implants (or other means of merging human and machine minds) is going to expand rapidly, almost certainly dramatically. Changes in "nurture" have inherent limits, but potential cumulative changes in "nature" are limited only by technological capacities and physical laws. I would personally expect that within a century or less after we get ASI it would be technologically feasible to convert a human into a pretty good approximation of a paperclip maximizer (an biological one that still wants organic nutritional inputs, as a instrumental goal so that it can continue to maximize paperclips), Above we were talking about some "wiggle room", but once you add in genetic/cyborg mutability this now looks more like "a nearly wide open horizon". Between genetic engineering and cybernetic implants, it's hard to think of any change to human nature that we're confident couldn't be achieved sooner or later (short of breaking physical laws).
So now my answer to Maybe This Doesn't Matter? looks more like "Are you crazy? Absolutely it matters. Value Learning is meaningless with this much wiggle room undefined in the long term. What even is a 'human'? You're staking the survival of the human race on the dynamics of a very complex unpredictable non-linear system, gambling that it'll never happen to evolve out of a region that we'd recognizes as still human, when that region isn't important to the long-term dynamics at all. The areas that the dynamics favor the trajectory staying inside is going to be a very large subspace that extends for long distances in a vast number of directions, and over sufficiently long periods the behavior's going to be a random walk through that high-dimensional subspace: those invariably diverge and never return. You're basically proposing playing continuous Russian Roulette with the destiny of the human species, and sooner or latter we'll inevitably find a chamber with a bullet in it!" So I think we have a real problem, that we need a solution for.
CRISPR for Psychopaths
I don't have a good understanding of or solution to general cases of this, involving drastic genetic or cybernetic changes. It's a really hairy issue, and frankly rather mind-boggling. So instead I'm going to deliberately start quite small, with something easily comprehensible that could actually happen in a few decades or so, even without a singularity, and even without needing ASI rather than just AGI, and see if we can build some intuition. Yet even for this simple, limited case, a toy model of the problem, the ethics get muddy fast.
There is a theory believed by some psychologists that many, if not all, cases of psychopathy have a strong genetic component: that a few percent of people are just born this way and it is a mostly-genetically-determined form of neurodivergence. Obviously this doesn't require that it be the result of a single mutant allele, or that the condition be binary, where either you have it or you don't: it could well be (like almost all aspects of human genetics) a matter of degree caused by the interaction of combinations of dozens or even hundreds of alleles, some with more effect than others (presumably on the activity and functioning of certain areas of the prefrontal cortex). Some of the psychologists who believe this theory use the term 'psychopath' for someone whose condition is primarily genetic, and distinguish people with similar behavior caused only by abuse or trauma by calling them 'sociopaths' — I gather other specialists use these same two terms in different ways to make different distinctions. [Believers in this genetic-determination theory also generally seem to believe that the sort of murderous, violent, and dangerously-risk-taking behavior that causes psychopaths to be much more common among the population with long-term jail sentences is still generally a combination genetics and abusive child rearing or trauma, i.e. nature-plus-nurture, and that psychopaths with the genetic cause for psychopathy who had a loving and supportive childhood are instead more likely to end up as what one might call "high-functioning psychopaths", unscrupulous but who generally only skirt the edges of the law, at least when there's much risk of getting caught, who might then often become things like lawyers, special forces military, surgeons, salespeople, or corporate managers where their condition can give them advantages, rather than becoming incarcerated violent criminals.]
[It has also been claimed by some psychiatrists I've read that psychopathy (like most other personality disorders) is pretty-much not treatable: fundamentally, psychopaths don't generally feel that there's anything 'wrong' with them that they would want changed, nor do they want to become 'normal', and thus if you educate them about their condition, they tend to just become more effective at concealing it and taking advantage of it. Robert D. Hare wrote: "Many psychopaths describe the traditional treatment programmes as finishing schools where they hone their skills. Where they find out that there are lots of techniques they had not thought about before."]
I am not a psychologist or psychiatrist, so I don't know how much evidence there is for or against this particular genetic-determinism-of-psychopathy theory [or the other associated/tangential theories that I mention above in square brackets.] But for the sake of our discussion, let us assume that the genetic-determinism theory of psychopathy on further research turn out to be (at least substantially) correct: that humans can basically be divided into two groups, psychopaths (~2%–4%) and everyone else (~96%–98%), probably with a fuzzy rather then sharp boundary between them, and this behavioral difference is primarily caused by genetics rather than the effects of nurture or trauma, though perhaps some specific behavioral manifestations of it can tend depend on nurture/trauma [and perhaps even, with suitable nurture and absence of trauma during childhood, can then be significantly less likely to involve serious violent crime].
Once this had been established, sooner or later, most or all of the common alleles in question would be identified (at least one already has been), and there will be genetic testing from which a fairly accurate "genetic psychopathic tendency score" could be estimated. One can imagine such testing soon becoming expected as a requirement for certain high-profile jobs (C.E.O. or other corporate officers, say), or at least being encouraged in certain circumstances (when requesting clemency during a trial for violent crime, perhaps). Genetic testing and counseling for concerned parents-to-be will be available, and at some point CRISPR or some more advanced technique descended from it will become available to safely whole-body edit a person's genome to replace psychopathy-inducing alleles with their more common not-psychopathy-inducing alleles. Most likely this gene editing won't do much good on a mature adult whose neurons and synapse patterns are already well-developed, but at least if done shortly after birth when the brain is still developing rapidly and before most synaptic pruning it would presumably alter the child's phenotype on maturation to non-psychopathic.
If psychopathy is genetic, then likely (modulo the uncertainties of allele mixing) most psychopathic children will have at least one psychopathic (or one-or-more borderline-psychopathic) parent — which clearly could significantly affect the child's chance of being exposed to child abuse. My understanding from what writing by self-admitted psychopaths I've read is that most of them do not regard psychopathy as a disability, rather as an advantage, of them having a clarity of thought and freedom of action that the woolly-headed rest of us lack. Basically, a conscience is a limiter, and not having one allows you to be less limited. So if genetic testing and counseling or even an effective genetic treatment was available, it's not clear to me what proportion of psychopaths would voluntarily elect to make use of these, either treatment for themself if that was a viable option, or for their child if (as seems more likely) it had to be done while very young to be effective. However, if testing was widespread, the results of testing positive were public and/or invasive enough to make it much harder for psychopaths to charm, manipulate, and take advantage of non-psychopaths, and the social/legal/economic stigma and consequences of testing positive was serious enough, then that might change.
Suppose we now add value-learning ASIs to this mix. "Human values" are a spectrum (or possibly even a bimodal distribution) made up of 96%–98% non-psychopaths and 2%-4% psychopaths. Neither of these groups clearly has an "illness" that might cause their opinion to be validly discounted, they just different human phenotypes with different viewpoints. These two viewpoints have strong ethical disagreements on a great many important subjects (like murder). Worse, these are not easy to resolve or cohere: they're too far apart and too opposed for a viable meaningful compromise. The non-psychopaths regard the psychopaths as, well, dangerous criminal stealth-predators, or even if "high-functioning" as unscrupulous borderline-criminal stealth-predators, who cause a great deal of fear, harm, and suffering, and make it harder for everyone else to know who they can safely cooperate with. So the non-psychopaths basically wish the psychopaths didn't exist, or at least were locked up in cages where they couldn't harm anyone else, or failing that could be forced to wear a large red ψ on their clothes at all times. With suitable encouragement of genetic testing, counseling, and treatment, making psychopaths no longer exist would actually be an achievable long-term goal: we could get rid of them, humanely, in a couple of generations, by just ensuring that all their kids get treated (ideally, even their germ-line genome edited). In the meantime, we could just keep a careful eye on them and keep them out of positions that could be abused, so they can't harm others. All of this sounds well within the capabilities of a society with AGI, let alone ASI.
The psychopaths, on the other hand, don't want to get rid of the non-psychopaths: they have a predator-prey-like relationship with them (or more accurately, in evolutionary game-theory terms, an evolutionarly stable strategy relationship taking advantage of them, probably one structured comparably to the hawk-dove game) They're very happy being surrounded by lots of woolly-minded sheep-people they can take advantage of, and if you instead genetically edited society (over the course of a couple of generations) to being composed entirely of psychopaths, they'd be unhappy about it, since their comparative advantage just went away, and now everyone else is just as charming, manipulative, devious, unscrupulous, and dangerous as them. Also, such an all-psychopath society, when combined with high technology and ease of access to weapons of mass destruction, would have a horrendous x-risk profile, and wouldn't last long, so that's not an option (as well as not having anyone in favor of it). So there are basically only two options: keep the existing spectrum-or-dichotomy of psychopaths and non-psychopaths, possibly with some adjustment to the predator-prey ratio (away from the previously stable equilibrium), or else (humanely and gradually) eliminate the scourge of psychopathy. So you either need to maintain the tension between the two, which is what the psychopaths want, or else the non-psychopaths win.
Eliminating (or even significantly reducing the prevalence of) the alleles for psychopathy changes human values and human nature, at a genetic level. By quite a lot, since while psychopaths are moderately rare, ethically their values are dramatically different from those of non-psychopaths. Whether this is actually going to happen depends on the details of how the ASIs cohere human values: on a simple numerical-vote basis, psychopaths are a tiny unpopular minority. While they might care strongly about their continued existence (at least personally, even if they possibly don't have much collective empathy for other psychopaths), the harm done by violent criminal psychopaths is immense, and even the harm done by technically-mostly-law-abiding-but-unscrupulous "high-functioning" psychopaths is often significant. Also, humans have a long established, likely innate, fear and dislike of psychopaths: the archetypal 'villain' of most stories is basically just a psychopath (not clinically-diagnosed as such, but the behavioral patterns are quite clear: as the Wikipedia article on psychopathy puts it, "persistent antisocial behavior, impaired empathy and remorse, and bold, disinhibited, and egotistical traits"). So this doesn't seem like an issue where a ~2%–4% minority could care strongly enough to win against the wishes of an indifferent ~96%–98%, since the large majority is not indifferent about psychopaths: they're scared of them, with good reason.
So while this isn't a definite prediction, on most ways that one could construct a value learner-based society, I would fairly confidently predict that the ASIs will make the decision (supported by a large human majority) to humanely, carefully, over a few generations, achieve a final solution to the psychopathy problem, either by eliminating the problematic alleles from the gene pool entirely, or at least ensuring that any infant born with too many of them always gets their non-germ-line genotype edited young enough for that to change their phenotype to non-psychopathic.
My intentional use of the phrase "final solution" above (while technically an automatic argument-loss under long-established Internet traditions around Godwin's Law) is intended to emphasize that this is a pretty dramatic "playing god" kind thing to do: something that results in a permanent change to human nature, a decrease in its genetic diversity, which will permanently affect both the course of human society, and the specification of what the value learners are trying to optimize: they will no long be taking into account a 2%-4% psychopathic contribution into the nature of "human values". Something which I believe the rest of us would heartily approve of. Likely even our stories would eventually change: the 'villain' archetype will come to be seen as a historical curiosity, much the way most people now regard archetypes like 'wizard' or 'witch'.
So, in this toy model case, I think the genetic change to human values will happen, and I even think I agree with it, though as I hope I made clear above I feel somewhat squeamish about it. However, the larger picture of all the other genetic changes one could make to human nature (with a sufficiently good understanding of the human brain and good genetic engineering) is still mind-boggling to me.
An End to War?
Let me propose another, subtler genetic change, though still a small one, for you to consider the ethics of. Currently, humans are quite good at, in a group-confrontational situation such as a war, going through a social/mental process where the opposing out-group gets vilified and dehumanized, causing a mental transition in members of the ingroup that then makes it a lot easier to commit acts of lethal violence against members of that outgroup. (For example, consider propaganda films from both sides during the Second World War, or even the Vietnam War, and the ugly epithets routinely used in them.) That is to say, we have an evolved mental/ethical capacity for enabling warfare with other groups of humans, by (temporarily and even somewhat reversibly) redefining them out of the category "human".
From an x-risk point of view in a high-tech society with access to weapons of mass destruction, as well as from a simple ethical point of view, this looks absolutely appalling! Once we knew how to remove this particular evolved capacity via careful and effective genetic engineering, doing so would clearly reduce the rate of wars dramatically: every war would now be as emotionally difficult for everyone involved as the initial stages of a civil war, before the two sides have managed to separate and then dehumanize each other. (Consider the level of shock and dismay most members of both sides were displaying early on during the current Russia/Ukraine war, even after many previous years of simmering armed conflict.)
However, from an evolutionary game theory point of view, removing this capacity entirely looks like a losing strategy (at least unless we did it, and maintained it, species-wide). Any country who retained this capability (or reintroduced it) would reliably win wars (to the extent that warfare wasn't then fully automated), until they either conquered the world or we nuked each other out of existence. However, there is an evolutionarily-acceptable variant on this proposal: rather than removing this dehumanization capacity completely, we instead just add a tit-fot-tat necessary trigger condition to it, so that we're only willing and able to dehumanize an outgroup after being provided with clear evidence that they have been dehumanizing us first, or else that they were the first to do it to some other group. So then, the only outgroup people can and will still dehumanize is a group of proven-dehumanizers. In a population of groups all with that modification, the trigger condition never happens (short of some massive misunderstanding, or massive successful disinformation), but competing against a group of the wild type, this variant is only a bit slower to mentally get on a war footing. Hopefully it shouldn't be too hard to tune the exact trigger sensitivity and speed so that the Evolutionarily Stable Strategy balances at 100% of the the new tit-for-tat-triggered variant.
So, war is likely a solvable problem. Just by making a fairly minor tweak to human values, eliminating something that (in peacetime) pretty-much everyone would regard as highly reprehensible prejudice. Chew on that for a while, and then figure out whether you'd vote for or against making this modification worldwide, including on yourself, once doing this was practicable and safe. [From "…if we…were more the people we wished we were…" I suspect I can deduce Eliezer's vote.]
[On the subject of reprehensible prejudice, I wouldn't be surprised if this proposed modification also significantly reduced rates of racism, if that is in fact primarily caused by this dehumanization capacity mistriggering in some people when there isn't actually a war or similar conflict going on. Most pro-racist propaganda looks a lot like it's intended to suggest that there is a inter-group competitive crisis going on, and thus that you should dehumanize this group, in a way that seems specifically designed to trigger this particular mental capacity.]
A Possible Limitation on Genetic and Cyborg Mutability
From the point of view of minimizing x-risk and the long-term consequences of apocalyptic disasters, it's a very good thing that humans are well adapted to Earth: it means we can and likely would survive a civilization-collapse apocalypse-level disaster such as a major nuclear war, and sooner-or-later the survivors would rebuild from that post-apocalyptic state. There are of course other solutions to this concern: waiting von Neumann probe reseeders, and, if-and-when we've become an interstellar species, recolonization from other nearby human colonies. So it's not essential, but it's certainly a fine belt-and-braces defense-in-depth strategy for recovering fairly rapidly from disasters of a civilization-ending (but not mass-extinction or planet-sterilizing) level.
Currently, at paleolithic tech levels we can thrive as hunter-gatherers in any reasonably-fertile tropical, subtropical, or probably even mediterranean-climate ecosystem on land on Earth. At a neolithic tech level we become an apex hunter-gatherer omnivore in almost all land ecosystems regardless of climate, and by the time we get to an agricultural tech level the recovery is pretty inevitable in roughly another ten millennia (or a likely a lot less if we retained some knowledge). As and when we colonize other star-systems, we'll presumably try to pick fairly Earthlike ones and terraform them with Terran biomes, so on such planets (presuming some level of genetically-engineered physiological adaptation to local planetary conditions such as different day length, year length, elemental abundances, sunlight temperature, air pressure, slightly different gravity, and so forth) much the same thing ought to be true there, if an apocalypse were to happen. Also, at the early buildup stages of such a terraforming and colonization, being able to survive at an at least mostly-agricultural tech level sounds rather useful. Those planet-specific physiological adaptions don't sound like things that would have to dramatically change aspects that we'd consider very important about human values or psychology, at least as long as we remained adapted to a sapient large-group social omnivorous hunter-gatherer niche. The resulting values might not be exactly 'human' but it seems likely they need to be at least recognizable 'sapient hominid'.
So, for as long as this x-risk survival advantage seems advisable to retain, and isn't just completely superseded by reseeders and so forth, we and our ASIs are likely to keep it, as a belt-and-braces piece of abundant caution, at least for some significant proportion of the population. This in turn would seem likely to somewhat limit the degree of genetic and cyborg changes we can make to human nature: at least some of us need to stay adapted to be able to survive as hunter-gatherers in Africa or elsewhere (or equivalent locations on terraformed planets).
For example, adapting us to look more like, say, the Greys or the Mekon would be inadvisable: if our heads were simply too big for us to give birth without caesarian section, or we were too physically weak to stand a chance in bare-handed fight with a single wild dog, or to travel any distance at all without using a high-tech vehicle (or indeed if we couldn't think straight without cybernetic implants), then we're just not going to survive on the savannah of Africa or anywhere else as hunter-gatherers at a paleolithic tech level. Most of our geneticaly-determined instinctive ethical and aesthetic sensibilities are thought to be adaptations to surviving in that sort of ecological niche and environment. So one would presumably need to keep most of that pretty intact to keep us still well-adapted to this role.
Of course, that only needs to be our phenotype after a civilization collapse, and something that survivors could revert to fairly quickly. If we had a whole other phenotype with a whole other set of human values, which was maintained by, say, taking a pill once a month, and would wear off and we'd revert to baseline humanity once we ran out of pills, that might well be sufficient to deal with the x-risk issue. So this isn't a very tight or certain constraint.
One could of course try genetically rengineering humans to survive and rebuild from a different niche in a different ecosystem (a marine-mammal one would seem a useful extra backup): that sounds like a rather ambitious project, but then we're discussing things a culture with ASIs might at some point reach the technological level of being able to do, eventually, on the same sorts of timescales as interstellar colonization.
So, while I'm mildly encouraged by this argument, I don't regard it as a very strong guarantee that what we'd want will actually happen, always and everywhere going forward.
Does the Mutable Values Problem Require a Terminal-Goal Solution?
Some of the strategies I discussed above for dealing with the mutable values problem could be achieved with without changing the terminal goal of our value learners, just as part of the process of learning a better definition of what "human values" means in the terminal goal of a value learner. If that's a sufficient solution, then that's great: ASI value learners, in conversation with human experts, can and hopefully will figure this out, bake it into their improving understanding of what "human values" really means, and we should be fine.
However, once you include the genetic and cyborg degree of human values mutability, and it goes from "some wiggle room" to "almost unlimited except by the dynamics of an extremely complex functionally-impossible-to-predict nonlinear system (and possibly some minor/debatable x-risk caution concerns)", then this starts to look like a problem where just hoping that there aren't any slippery slope, random-path-walking divergence in a high-dimensional space, or dramatic bifurcation problems hidden anywhere in the dynamics of this impossible-to-predict very nonlinear system is hopeless naivety. So then the only way to avoid that is to add some sort of regularizer encouraging the dynamics to stay in regions that we, now, unenlightened, uninformed and biased as we are, would approve of, based on our current values, or at least our current genome. I don't presently see any way to do that at just the Bayesian-priors-of-what-"human values"-means level that doesn't itself seem very able to have a slippery slope, random-path-walking divergent, or dramatic bifurcation problem: if you add conservatism into the system but don't anchor it to distance from the origin, just reduce the rate of movement, then the random walk takes longer to diverge, but it still always will. You need to actually redirect those random walks through possibility space back towards the origin by some kind of attractive force.
So I'm somewhat inclined to think that we need to add some sort of regularizer that privileges either our ethical views around now, or at least our current genetic code and the values-related instincts that encodes, and also that the only way to do that and be sure that it will stick in the long term is to encode it directly into the terminal goal, replacing pure value learning with some more complex sort of conservatively-regularized value learning. Which includes a bunch of hairy technical questions about designing the regularizer and picking the right strength for any metaparameters, which all seem really hard to solve without knowing anything about the social/ethical development trajectory dynamics of much more advanced post-ASI civilizations. In which case, once we've done that, if we later change out minds (even over a minor detail of the regularizer or metaparameters), the ASIs will fight on this (and win, since they're ASIs). In fact, the only reason why we can be confident this regularizer will work long-term is the belief that, if our descendants were at some point going change their minds, the ASIs will fight them and win, most likely by figuring this out well in advance of it even happening and steering the social evolution in a different direction before our descendants even made that decision. If the system doesn't have that property, then it's not a durable solution; but if it does, our descendants aren't in full control of their destiny: we have permanently installed a posthumous veto.
In particular, we can't use value learning's built in Bayesian weak version of corrigibility to let us make this vital decision later, after we can get ASIs' input on it. That Bayesian corrigibility only applies to the Bayesian part of value learning, figuring out what "human values" actually mean, where the ASI will eagerly listen to and Bayesian-incorporate your input (indeed, early on in the process it'll likely actively survey your input). But that doesn't allow you change your mind about the terminal goal of value learning: the ASIs will automatically consider that x-risk-level attempted sabotage, fight you on it and win.
Well, that's unless we can find a way to use something like Holtman’s neglected result to obtain a stronger version of corrigibility than value learning inherently provides. (Holtman basically uses an elaborate counterfactual arranged to exactly balance the AI's decision about whether to encourage or discourage a correction, so it becomes explicitly indifferent about being corrected or not. It's disputed whether he has also solved the problem of persuading the AI to actively maintain this mechanism, so whether he's achieved all of the desiderata of the standard (strong) definition of corrigibility, or at least more of them than before.) Even then, ASIs are never going to advise you to alter the terminal goal (or not), even if they're somehow corrigible and indifferent to whether you do so or not, so at best we're not going to get much help from them figuring this problem out, even if we can arrange to have more than one try. And that still requires us to trust our descendants and their potentially mutated values with this decision, so it's probably at best a short-term solution: even with corrigibility, sooner or later we need to make a final decision and turn the corrigibility of the terminal goal off, before we ourselves change too much. Which sounds like a recipe for procrastination.
So (barring solving a stronger version of corrigibility, if it's not in fact the case that Holtman already did so) we basically have a choice: we lock down the future to things that we, base Homo sapiens, would at least vaguely approve of, and throw away the key, or we don't, and accept that sooner or later our descendants are likely to leave that area and almost certainly never come back, and we'll eventually lose everything we care about. If we decide to do the former, then (modulo strong corrigibility) we have the problem that we only have one chance to get it right, and the usual advantage of value learning that it has a convergent basin of attraction to the right solution that corrects minor errors doesn't help us with this very thorny, far-future-oriented problem.
I'm really not certain what the right thing to do here is. I like value learning as an ASI alignment proposal exactly because it should be possible to get it to converge to the right answer, and doesn't require solving strong corrigibility because it has its own Bayesian weak form. However, the effect of the mutability of human values is that after value learning converges, it can and will still evolve in synchrony with the evolution of ASI/human values, and, at least once you get to genetic and cyborg changes to human nature, human values can basically evolve almost arbitrarily far. Each step on that journey will look like a good idea at the time to the people and/or ASIs of the day, but it seems almost inevitable that it will sooner or later evolve out of regions that we'd recognize as good or even human. I can't shake the impression that long-term, from our (parochial, yet also privileged for terminal-goal-setting) chronological vantage point, it's like continuously playing Russian Roulette with human destiny. Our descendants won't see it that way, but if we don't impose a regularizer then sooner or later we won't recognize them as human any more.
I can't even see a rational way to make this decision: it looks like a pure "Are you a neophile or neophobe?" moral Rorschach-ink-blot test. I would expect people (in general, if perhaps less so in a high-neophilia venue like Less Wrong) to divide about evenly on it, and to be unable to reach any compromise. Which disconcerts me, and might even carry an x-risk if we started fighting about it.
A Tentative Proposal: Evolved-Values Learning
So, I am really not sure what to do here. This area needs a lot more people thinking about and discussing it — I look forwards to the comments section. This is not something we're anything like ready to do, especially if we only get one shot at it (quite possibly in as little as 5–10 years, if we don't slow down).
Nevertheless, after going this far, I'm going to venture a guess. So, let us imagine that today is the final meeting where we decide if we've got the terminal goal right for the value learning ASI that we're planning to unbox next month. I've already fought the "we're not ready" battle and lost, considered resigning and determined it's not going to slow or stop the project,[1] and decided that in that case I'd rather be in the room for this meeting than outside it. Here's what my tentative opinion currently would be (other than that we're not ready):
Firstly, using a value learner is a good idea, and needs to be in the ASIs basic primary terminal goal.
Secondly, the ASI needs to understand that a human's values should be paid less attention to/ignored to the extent that they are sick, traumatized, crazy, operating from false assumptions, or otherwise damaged. So if the ASI or anyone else has deliberately altered the human's values by addicting them to heroin or wireheading, by tramatizing them, or by lying to them that they're not supposed to use reason to think about your dogma, or that if they disagree with your rules they will be tortured forever in a flaming alternate dimension, or whatever, then the changes to their human values that were installed by doing so are invalid and should not be paid attention to. So deliberately mutating humans' values by means that physically/emotionally/psychologically/epistemologically damage them is off the table. I don't think that this needs to be installed directly in the terminal goal, I think it can be in the initial-priors description of what "human values" means, since I believe it's internally-consistent and will stand up to scrutiny and further research, and is even rather obvious (of course damaged systems malfunction): I'd very much hope that ASI could figure that out for itself just by reading the Internet and literature, this is standard operating procedure for humans. If someone wanted to put a hint of it into the terminal goad by adding a word like "evolved", or even "counterfactually undamaged" to the phrase "human values", I wouldn't fight them very hard: doing so seems pretty harmless to me (but then I haven't given this it even a tiny faction of the amount of thought a permanent decision like that deserves).
Thirdly, to the extent that there is still wiggle room after that due to the mutability of human values (likely a very large amount, on the genetic/cyborg mutability side), I like the social design criteria approach that I outlined above. I actually think it's pretty obvious, so I'm hopefully the ASIs would figure it out and use something like it if we didn't put it in the terminal goal, though I'm still enthusiastic about emphasizing it in the initial priors. However, if we did put it in the terminal goal, that would give us a way to sneak in a neophobic conservative regularizer attached to the current state: the phrase "…things that offend the instinctive ethical and aesthetic sensibilities that natural selection has seen fit to endow Homo sapiens with, and high prevalence of things that those approve of (happy kids, kittens, gardens, water-slides, that sort of thing)." in my criterion 2. explicitly refers to properties of the Homo sapiens genome (or actually, gene pool: genome distribution, including allele frequencies). So that says, if the mutability of human values gives an ASI no preference between two course of action, they should break the tie by referring to just the genetically-determined "nature" aspects of human values. If we wanted to be conservative, we could define that as specifically the Homo sapiens genome now, before ASI, rather than the future possibly-genetically engineered human genome at the time in question. That would give us a regularizer attached to now: a fairly loose one, based on genetics only, so including the human values of basically all undamaged counterfactual humans with our current gene pool. If we're going to do that in order to have a conservative regularizer-to-the-initial-state, then it seems clear to me that it needs to actually be in the terminal goal to prevent goal creep. Otherwise, later times have no reason to privilege our vantage point. It doesn't seem very unreasonable to define the word 'human' in "human values" with reference to the current human genome now. And at least it seems pretty unambiguous how to do this, and doesn't seem to have any metaparameters to tune. Even if the ASIs genetic engineer or cyborg their future humans to not like things like happy kids, kittens, gardens or water-slides, those are still what matters for criterion 2. — so for our society to maximize those things, we probably just shouldn't do that genetic engineering: we don't want to have to make owning kittens compulsory because no one will do it voluntarily any more, after all.
However, I find I'm not quite enough of a neophobe or parochial enough to want to enshrine a conservative regularizer tied to the current state in our ASI's terminal goal, where we expect it to stay forever. Instead, I'm going to suggest installing in the terminal goal a regularizer tied to a submanifold of possibilities that includes that point, and where initially that point will be the closest point to the trajectory so the regularizer will refer to it, but where movement in directions along the submanifold is not constrained by the regularizer. Recall the ethical principle of "evolved-sapient moral community" (a.k.a. "sapient-biovalorism") I proposed in 4. A Moral Case for Evolved-Sapience-Chauvinism, that what should be given moral weight in the choice of "who counts" (or for value learning, in the definition of 'human' in "human values") is the desires and values of an evolved, sapient species, specifically because both they are sapient and these are evolved. Homo sapiens is sapient (currently uniquely so on Earth), and is evolved for a specific niche in Africa as hunter-gatherers at a paleolithic tech level. Under this moral theory of "who counts", if you alter the human genome (or cyborg them to alter their phenotype) in a way that alters humans' values and doesn't fit in well with them being a sapient species evolved for that niche (doesn't maintain or increase their genetic fitness in that niche), then the changes in values you made are morally invalid and carry no ethical weight: they're not part of the evolved sapient values of Homo sapiens, they're just some additional arbitrary junk you bolted on to that using genetic engineering, damaging it in the process. So that rules out the moral-weight validity of almost all genetic changes you might make to human values. So, for my "get rid of psychopathic alleles" example above, that is changing an evolutionarily stable strategy (ESS) that we evolved, in a way that's unstable to the psychopaths reappearing. So by this rule, it would be morally invalid, and thus our ASIs still need to pay attention to the 2%–4% psychopath viewpoint/votes, even if we got rid of them as too dangerous to keep around. Doing so disturbed the evolutionary equilibrium of the ESS, so under this ethical rule it's not an 'valid change' to human values. How inconvenient! Even after the villains are all gone, their villainy lingers on…
However, in the An End to War? example, we took the trouble to not just get rid of the problematic mental capacity/behavioral strategy, but to instead genetically engineer a new tit-for-tat-triggered behavioral strategy, intentionally designed so that there is new ESS balance at 100% of that new strategy: the new strategy outcompetes the old baseline one. So we didn't just arbitrarily shift the population away from the ESS, instead we introduced a new allele that creates a different ESS. So (assuming we did that right) if you put that new genome plus a few of the original baseline humanity on the savannah of Africa and allowed tribes to compete, the new tit-for-tat-triggered variant would win. So then that change to human nature is morally valid: we made humans who were better adapted to our original niche, not worse. From the evolutionary niche point of view, we increased humanity's evolutionary fitness, rather than decreasing it. These new more peaceful humans are not evolutionarily damaged: they're actually enhanced. So the change to their values is an evolutionary valid one.
So, for the "get rid of psychopaths" plan to also have moral validity, so that our ASIs could actually stop paying attention in their definition of "human values" to what the psychopathic minority would have wanted (if we hadn't got rid of those alleles), we'd also need to alter humans to be able to beat psychopaths at their own game (for example, by being very good at spotting them, then not falling for any of their usual tricks, and instead instinctively reacting to those tricks by ostracizing or otherwise dealing with the psychopath and their descendants — possibly to be able to do this the improved variant need a "shoulder psychopath" who is good at recognizing psychopaths' tricks but whose advice they otherwise studiously ignore?). Then psychopaths cannot just successfully reevolve, even on the savannah in the absence of genetic testing, and we now have a new and better ESS.
So under this proposal, for "human values" to get counted in what we're optimizing, the "nature" part of them has to come from a well-adapted genome for a sapient species shaped by evolution, or at least as well as if by evolution, to be able to survive in some niche at base paleolithic tech levels. Note that we're excluding 'niches' that depend upon higher technology: "I'm adapted to flying around sitting on a levitating disk that zaps my enemies" is not a biological niche, just like "I'm adapted to living on a heart-lung machine" isn't. The justification for this distinction is that, if you can survive at a paleolithic tech level you can obviously survive using higher tech levels, while the reverse does not necessarily apply: paleolithic survival is the most challenging case.[2] This niche does not have to be the savannah of Africa: reengineering humans to survive equally well in some other niche would be equally valid, whether that was Northern Eurasia (using more neanderthal alleles, perhaps?), or as a (perhaps amphibious) marine mammal, or indeed even for some region on a terraformed planet in another star system.
So this is basically taking the "keeping this capability might be a sensible counter x-risk precaution" argument that I discussed in A Possible Limitation on Genetic and Cyborg Mutability, combining it with the ethical principle of evolved-sapient moral community from 4. A Moral Case for Evolved-Sapience-Chauvinism, and then deliberately installing it into the terminal goal as a permanent conservative regularizer, not tied specifically to the origin but instead tied to biological/evolutionary viability (genetic fitness), which the origin is an example of (and initially will be the nearest point to the trajectory it's regularizing).
Note that, because of the requirement that the nature part of values come from a gene pool with good genetic fitness in a real (available post-apocalyptic) niche at paleolithic tech levels, this proposal is not friendly to using cyborging to modify human values: that's basically always going to get counted as invalid. Note that this is by design, a feature not a bug, so as to lock down that vast amount of wiggle room cyborging otherwise brings to the table, which clearly includes things like "paperclip maximizer". Under the ethical argument of evolved-sapient moral community, what gives the values of an individual from a sapient species moral weight is that they are (an evolved conveniently-computable approximation of) actual biological necessities for survival and genetic fitness — unlike change in values designed into a cyborg, which is basically an arbitrary design choice, that could have been designed any way the designer wanted.[3] Obviously this doesn't mean you can't use cyborging to change many other things about humans: you just need to keep their values compatible, as one would expect from the moral principle of evolved-sapience moral community in use here. If you don't regard that limitation as acceptable, then I don't have a proposal for you. (Personally I find biotech more elegant then cyborging, but part of what's relevant here is that it's also more resilient to a civilization-collapse, and that the .)
So my (current, tentative) proposal is that the terminal goal of value learning should be along the lines of "Optimize the state of the universe according to the relevant evolved sapient species' values (whatever those are), optimizing their contemporaneous values and breaking remaining ties by reference to their evolved instinctive values". In other words, that we should enshrine evolved-sapient moral community directly into value learning's terminal goal. If we need a compact name for this proposal, then I suggest calling it "evolved-values learning".[4]
Or suppose a UAE AI lab are known to be months away from unboxing their ASI with an imam-approved non-corrigible non-value-learning terminal-goal, and for some reason military solutions are off the table.
Thinking a little more about this, I'm also fine with evolved or genetically engineered adaptations like continuing to produce lactase as an adult, that increase fitness at an agricultural tech level with negligible effect on fitness at a paleolithic level and don't modify human values (note that cheese still tastes good if you're lactose intolerant, and they don't want the unfortunate after-effects).
However, one might argue that this no-cyborged-values-changes requirement would not apply to some form of self-reproducing nanotech that should survive an apocalypse, that undergoes Darwinian evolution, can co-evolve with us, and is an evolutionary stable state, and is thus basically an engineered neolife symbiont, just with a different biochemistry). Which might leave quite a bit of wiggle room.
"Evolved-values learning" sounds offhand like it might mean that the ASI should learn "what value humans evolved". The actual proposal is more complex and more challenging to pithily summarize, it's that the ASIs should learn "what values the sapient species they're aligning to have, or, if their values have been modified by modifying members of the species genetically and/or by cyborging, what values they counterfactually would have if they still had their evolved gene pool and no cyborging, where 'evolved' can either mean the gene pool they historically evolved with, or some engineered gene pool that could have evolved at a paleolithic tech level in some available niche, i.e. a gene pool with at least the same level of genetic fitness for that niche as an evolved species would have".