Genetic fitness is a measure of selection strength, not the selection target

Kaj_Sotala

Alternative title: "Evolution suggests robust rather than fragile generalization of alignment properties."

A frequently repeated argument goes something like this:

Evolution has optimized humans for inclusive genetic fitness (IGF)
However, humans didn’t end up explicitly optimizing for genetic fitness (e.g. they use contraception to avoid having children)
Therefore, even if we optimize an AI for X (typically something like "human values"), we shouldn’t expect it to explicitly optimize for X

My argument is that premise 1 is a verbal shorthand that’s technically incorrect, and premise 2 is at least misleading. As for the overall conclusion, I think that the case from evolution might be interpreted as weak evidence for why AI should be expected to continue optimizing human values even as its capability increases.

Summary of how premise 1 is wrong: If we look closely at what evolution does, we can see that it selects for traits that are beneficial for surviving, reproducing, and passing one’s genes to the next generation. This is often described as “optimizing for IGF”, because the traits that are beneficial for these purposes are usually the ones that have the highest IGF. (This has some important exceptions, discussed later.) However, if we look closely at that process of selection, we can see that this kind of trait selection is not “optimizing for IGF” in the sense that, for example, we might optimize an AI to classify pictures.

The model that I’m sketching is something like this: evolution is an optimization function that, at any given time, is selecting for some traits that are in an important sense chosen at random. At any time, it might randomly shift to selecting for some other traits. Observing this selection process, we can calculate the IGF of traits currently under selection, as a measure of how strongly those are being selected. But evolution is not optimizing for this measure; evolution is optimizing for the traits that have currently been chosen for optimization. Resultingly, there is no reason to expect that the minds created by evolution should optimize for IGF, but there is reason to expect that they would optimize for the traits that were actually under selection. This is something that we observe any time that humans optimize for some biological need.

In contrast, if we were optimizing an AI to classify pictures, we would not be randomly changing the selection criteria the way that evolution does. We would keep the selection criteria constant: always selecting for the property of classifying pictures the way we want. To the extent that the analogy to evolution holds, AIs should be much more likely to just do the thing they were selected for.

Summary of how premise 2 is misleading: It is often implied that evolution selected humans to care about sex, and then sex led to offspring, and it was only recently with the evolution of contraception that this connection was severed. For example:

15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game.
– Eliezer Yudkowsky, AGI Ruin: A List of Lethalities

This seems wrong to me. Contraception may be a very recent invention, but infanticide or killing children by neglect is not; there have always been methods for controlling the population size even without contraception. According to the book Anthropology of Childhood, family sizes and the economic value of having children have always been correlated. Children are more of a burden on foragers and foragers correspondingly have smaller family sizes, whereas children are an asset for farmers who have larger family sizes.

Rather than evolution having selected humans for IGF and this linkage then breaking with the invention of contraception, evolution has selected humans to have an optimization function that weighs various factors in considering how many children to have. In forager-like environments, this function leads to a preference for fewer children and smaller family sizes; in farmer-like environments, this functions leads to a preference for more children and larger family sizes. @RobinHanson has suggested that modern society is more forager-like than farmer-like and that our increased wealth is causing us to revert to forager-like ways and psychology. To the extent that this argument is true, there has been no breakage between what evolution “intended” and how humans behave; rather, the optimization function that evolution created continues operating the way it always has.

The invention of modern forms of contraception may have made it easier to limit family sizes in a farmer-type culture that had evolved cultural taboos against practices like infanticide. But rather than creating an entirely new evolutionary environment, finding a way to bypass those taboos brought us closer to how things had been in our original evolutionary environment.

If we look at what humans were selected to optimize for, it looks like we are mostly continuing to optimize for those same things. The reason why a minority of people are choosing not to have children is because our evolved optimization function also values things other than children, and we have “stayed loyal” to this optimization function. In the case of an AI that was trained to act according to something like “human values” and nothing else, the historical example seems to suggest that its alignment properties might generalize even more robustly than ours, as it had not been selected for a mixture of many competing values.

Evolution as a force that selects for traits at random

For this post, I skimmed two textbooks on evolution: Evolution (4th edition) by Futuyama & Kirkpatrick and Evolutionary Analysis (5th edition) by Herron & Freeman. The first one was selected based on Googling “what’s the best textbook on evolutionary biology” and the second was selected because an earlier edition was used in an undergraduate course on evolutionary psychology that I once took and I recalled it being good.

As far as I could tell, neither one talked about evolution as a process that optimizes for genetic fitness (though this was a rather light skim so I may have missed it even if it was there). Rather, they cautioned against thinking of evolution as an active agent that “does” anything in the first place. Evolution does increase a population’s average adaptation to its environment (Herron & Freeman, p. 107), but what this means can constantly change as the environment itself changes. At one time in history, a region may have a cold climate, selecting the species there for an ability to deal with the cold; and then the climate may shift to a warmer one, and previously beneficial adaptations like fur may suddenly become a liability.

Another classic example is that of peppered moth evolution. Light-colored moths used to be the norm in England, with dark-colored ones being very rare, as a light coloration was a better camouflage against birds than a dark one. With the Industrial Revolution and the appearance of polluting factories, some cities became so black that dark color became better camouflage, leading to an increase in dark-colored moths relative to the light-colored ones. And once pollution was reduced, the light-colored moths came to dominate again.

If we were modeling evolution as a mathematical function, we could say that it was first selecting for light coloration in moths, then changed to select for dark, then changed to select for light again.

The closest that one gets to something like “evolution optimizing for genetic fitness” is what’s called “the fundamental theorem of natural selection”, which among other things implies that natural selection will cause the mean fitness of a population to increase over time. However, here we are assuming that the thing we are selecting for remains constant. Light-colored moths will continue to become more common over time, up until a dark coloration becomes the trait with higher fitness and the dark coloration starts becoming more common. In both situations we might say that the “mean fitness of the population is increasing”, but this means a different thing in those two situations: in one situation it means selecting for white coloration, and in another situation, it means selecting for dark coloration. The thing that was first being selected for, is then being selected against, even as our terminology implies that the same thing is being selected for.

What happened was that the mean fitness of the population went up as a particular coloration was selected for, then a random change (first the increased pollution, then the decreased pollution) caused the mean fitness to fall, and then it started climbing again.

All else being equal, the fundamental theorem would lead us to expect that the mean fitness of species should increase by a few percent per generation. But all else is not equal: what selection gives, other evolutionary forces take away. The fitness gains made by selection are continuously offset by environments that change in space and time, deleterious mutations, and other factors. (Futuyama & Kirkpatrick, p. 127)

Even taking this into account, evolution does not even consistently increase the mean fitness of the population: sometimes evolution ends up selecting for a decrease in the mean fitness of the population.

The fundamental theorem and the adaptive landscape make assumptions that do not apply exactly to any natural populations. In many cases, though, they give very good approximations that are useful to guide our thinking about evolution. In other cases, the assumptions are violated in ways that make evolution behave very differently. A particularly important situation where the fundamental theorem and adaptive landscape do not apply is when selection is frequency dependent. In some cases, this can cause the mean fitness of a population to decline (Futuyama & Kirkpatrick, p. 128)

An example of frequency-dependent selection leading to lower mean fitness is the case of a bush that produces many fruits (Futuyama & Kirkpatrick, p. 129). Some bushes then evolve a trunk that causes them to cast shade over their neighbors. As a result, those neighbors weaken and die, allowing the bushes that have become trees to get more water and nutrients.

This leads to the trees becoming more common than the bushes. But since trees need to spend much more energy on producing and maintaining their trunk, they don’t have as much energy to spend on growing fruit. When trees were rare and mostly stealing energy from the bushes, this wasn’t as much of a problem; but once the whole population consists of trees, they can end up shading each other. At this point, they end up producing much less fruit from which new trees could grow, so have fewer offspring and thus a lower mean fitness.

This kind of frequency-dependent selection is common. Another example (Futuyama & Kirkpatrick, p. 129) is that of bacteria that evolve both toxins that kill other bacteria, while also evolving an antidote against the toxin. Both cost energy to produce, but as long as these bacteria are rare, it’s worth the cost as the toxicity allows them to kill off their competitors.

But once these toxic bacteria establish themselves, there’s no longer any benefit to producing the toxin - all the surviving bacteria are immune to it - so continuing to spend energy on producing it means there’s less energy available for replication. It now becomes more beneficial to keep the antidote production but lose the toxin production: the toxin production goes from being selected for, to being selected against.

Once this selection process has happened for long enough and non-toxin-producing bacteria have come to predominate, the antidote production also becomes an unnecessary liability. Nobody is producing the toxin anymore, so there’s no reason to waste energy on maintaining a defense against it, so the antidote also goes from being selected for to being selected against.

But then what happens once none of the bacteria are producing the toxin or the antidote anymore? Now that nobody has a defense against the toxin, it becomes advantageous to start producing the toxin + antidote combination again, thus killing all the other bacteria that don’t have the antidote… and thus the cycle repeats.

In this section, I have argued that to the extent that evolution is “optimizing a species for fitness”, this actually means different things (selecting for different traits) in different circumstances; and also evolution optimizing for fitness is more of a rough heuristic rather than a literal law anyway since there are many circumstances where evolution ends up lowering the fitness of a population. This alone should make us suspicious of the argument that “evolution selected humans for IGF”; what that means isn't that there's a single thing that was being optimized for, but rather that there was a wide variety of traits that were selected for at different times.

What exactly is fitness, again?

So far I’ve been talking about fitness in general terms, but let’s recap some of the technical details. What exactly is inclusive genetic fitness, again?

There are several different definitions; here’s one set of them.

A simple definition of fitness is that it’s the number of offspring that an individual leaves for the next generation^[1]. Suppose that 1% of a peppered moth’s offspring survive to reproductive age and that the surviving moths have an average of 300 offspring. In this case, the average fitness of these individuals is 0.01 * 300 = 3.

For evolution by natural selection to occur, fitness differences among individuals need to be inherited. In biological evolution, inheritance happens through genes, so we are usually interested in genetic fitness - the fitness of genes. Suppose that these are all light-colored moths in a polluted city. Suppose a gene allele for dark coloration increases the survivability by 0.33 percentage points, for an overall fitness of 0.0133 * 300 = 4. The fitnesses of the alleles are now 3 and 4.

Image from Futuyama & Kirkpatrick. Caption in the original: Genotype A has a fitness of 3, while genotype B has a fitness of 4. Both genotypes start with 10 individuals. (A) The population size of genotype B grows much more rapidly. (B) Plotting the frequencies of the two genotypes shows that genotype B, which starts at a frequency of 0.5, makes up almost 90% of the population just 7 generations later.

Often what matters is the difference in fitness between two alleles: for example, an allele with a fitness of 2 may become more common in the population if its competitor has a fitness of 1, but will become more rare if its competitor has a fitness of 3. Thus it’s common to indicate fitness relative to some common reference, such as the average fitness of the population or the genotype with the highest absolute fitness.

Genetic fitness can be divided into two components. An individual can pass a gene directly onto their offspring - this is called direct fitness. They can also carry a genetic adaptation that causes them to help others with the same adaptation, increasing their odds of survival. For example, a parent may invest extra effort in taking care of their offspring. This is called indirect fitness. The inclusive fitness of a genotype is the sum of its direct and indirect fitness.^[2]

Biological evolution can be defined as “inherited change in the properties of organisms over the course of generations” (Futuyama & Kirkpatrick, p. 7). Evolution by natural selection is when the relative frequencies of a genotype change across generations due to differences in fitness. Note that genotype frequencies can also change across generations for reasons other than natural selection, such as random drift or novel mutations.

Fitness as a measure of selection strength

Let’s look at a case of intentional animal breeding. The details of the math that follows aren’t that important, but I wanted to run through them anyway, just to make it more concrete what “fitness” actually means. Still, you can just skim through them if you prefer.

Suppose that I happen to own a bunch of peppered moths of various colors and happen to like a light color, so I decide to breed them towards being lighter. Now I don’t know the details of how the genetics of peppered moth coloration works - I assume that it might very well be affected by multiple genes. But for the sake of simplicity, let’s just say that there is a single gene with a “light” allele and a “dark” allele.

Call the “light” allele B1 and the “dark” allele B2. B1B1 moths are light, B2B2 moths are dark, and B1B2 / B2B1 moths are somewhere in between (to further simplify things, I’ll use “B1B2” to refer to both B1B2 and B2B1 moths).

Suppose that the initial population has 100 moths. I have been doing breeding for a little bit already, so we start from B1 having a frequency of 0.6, and B2 a frequency of 0.4. The moths have the following distribution of genotypes:

B1B1 = 36

B1B2 = 48

B2B2 = 16

To my eye, all of the moths with the B1B1 genotype look pleasantly light, so I choose to have them all breed. 75% of the moths with the B1B2 genotype look light enough to my eye, and so do 50% of the B2B2 ones (maybe their coloration is also affected by environmental factors or other genes). The rest don’t get to breed.

This gives us, on average, a frequency of 0.675 for the B1 alleles and 0.325 for the B2 alleles in the next generation^[3]. Assuming that each of the moths contributed a hundred gametes to the next generation, we get the following fitnesses for the alleles:

B1: Went from 120 (36 + 36 + 48) to 5400 copies, so the fitness is 5400/120 = 45.

B2: Went from 80 (48 + 16 + 16) to 2600 copies, so the fitness is 2600/80 = 32.5.

As the proportion of B1 increases, the average fitness of the population will increase! This is because the more B1 alleles you carry, the more likely it is that you are selected to breed, so B1 carriers have a higher fitness… which means that B1 becomes more common… which increases the average fitness of the mouse population as a whole. So in this case, the rule that the average fitness of the population tends to increase over time does apply.

But now… wouldn’t it sound pretty weird to describe this process as optimizing for the fitness of the moths?

I am optimizing for having light moths; what the fitness calculation tells us is how much of an advantage the lightness genes have - in other words, how much I am favoring the lightness genes - relative to the darkness genes.

Because we were only modeling the effect of fitness and not e.g. random drift, all of the difference in gene frequencies came from the difference in fitness. This is tautological - it doesn’t matter what you are selecting (optimizing) for, anything that gets selected ends up having the highest fitness, by definition.

Rather than saying that we were optimizing for high fitness, it seems more natural to say that we were optimizing for the trait of lightness and that lightness gave a fitness advantage. The other way around doesn’t make much sense - we were optimizing for fitness and that gave an advantage to lightness? What?

This example used artificial selection because that makes it the most obvious what the actual selection target was. But the math works out the same regardless of whether we’re talking artificial or natural selection. If we say that instead of me deciding that some moths don’t get to breed, the birds and other factors in the external environment are doing it… well, nothing changes about the equations in question.

Was natural selection optimizing for the fitness of the moths? There's a sense in which you could say that since the dark-colored moths ended up having increased fitness compared to the light-colored ones. But it would also again feel a little off to describe it this way; it feels more informative and precise to say that the moths were optimized for having dark color, or to put it more abstractly, for having the kind of a color that fits their environment.

From coloration to behavior

I’ve just argued that if we look at the actual process of evolution, it looks more like optimizing for having specific traits (with fitness as a measure of how strongly they’re selected) rather than optimizing for fitness as such. This is so even though the process of selection can lead to the mean fitness of the population increasing - but as we can see from the math, this just means “if you select for something, then you get more of the thing that you are selecting for”.

In the sections before that, I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.

I think these arguments are sufficient to conclude that the claim “evolution optimized humans for fitness [thus humans ought to be optimizing for fitness]” is shaky.

So far, I have mostly been talking about relatively “static” traits such as coloration, rather than cognitive traits that are by themselves optimizers. So let's talk about cognition. While saying that “evolution optimized humans for genetic fitness, thus humans ought to be optimizing for fitness” seems shaky, the corresponding argument does work if we talk about specific cognitive behaviors that were selected for.

For example, if we say that “humans were selected for caring about their offspring, thus humans should be optimizing for ensuring the survival of their offspring”, then this statement does generally speaking hold - a lot of humans do put quite a lot of cognitive effort into ensuring that their children survival. Or if we say that “humans were selected for exhibiting sexual jealousy in some circumstances, so in some circumstances, they will optimize for preventing their mates from having sex with other humans”, then clearly that statement does also hold.

This gets to my second part of the argument: while it’s claimed that we are now doing something that goes completely against what evolution selected for, contraception at least is a poor example of that. For the most part, we are still optimizing for exactly the things that evolution selected us to optimize for.

Humans still have the goals we were selected for

The desire to have sex was never sufficient for having babies by itself - or at least not for having ones that would survive long enough to reproduce themselves in turn. It was always only one component, with us having multiple different desires relating to children:

A desire to have sex and to enjoy it for its own sake
A desire to have children for its own sake
A desire to take care of and protect children (including ones that are not your own) for its own sake

Eliezer wrote, in “AGI Ruin: A List of Lethalities” that

15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously.

This quote seems to imply that

effective contraception is a relatively recent invention
it’s the desire for sex alone that’s the predominant driver for having children (and effective contraception breaks this assumption)
it’s a novel development that we prioritize things-other-than-children so much

All of these premises seem false to me. Here’s why:

Effective contraception is a relatively recent innovation. Even hunter-gatherers have access to effective “contraception” in the form of infanticide, which is commonly practiced among some modern hunter-gatherer societies. Particularly sensitive readers may want to skip the following paragraphs from The Anthropology of Childhood:

The Ache [a Paraguyan foraging society] are particularly direct in disposing of surplus children (approximately one-fifth) because their peripatetic, foraging lifestyle places an enormous burden on the parents. The father provides significant food resources, and the mother provides both food and the vigilant monitoring required by their dangerous jungle environment. Both men and women face significant health and safety hazards throughout their relatively short lives, and they place their own welfare over that of their offspring. A survey of several foraging societies shows a close association between the willingness to commit infanticide and the daunting challenge “to carry more than a single young child on the nomadic round” (Riches 1974: 356).
Among other South American foragers, similar attitudes prevail. The Tapirapé from central Brazil allow only three children per family; all others must be left behind in the jungle. Seasonally scarce resources affecting the entire community dictate these measures (Wagley 1977). In fact, the availability of adequate resources is most commonly the criterion for determining whether an apparently healthy infant will be kept alive (Dickeman 1975). Among the Ayoreo foragers of Bolivia, it is customary for women to have several brief affairs, often resulting in childbirth, before settling into a stable relationship equaling marriage. “Illegitimate” offspring are often buried immediately after birth. During Bugos and McCarthy’s (1984) fieldwork, 54 of 141 births ended in infanticide.

It takes years for a newborn to get to a point where they can take care of themselves, so a simple lack of active caretaking is enough to kill an infant, no modern-age contraceptive techniques required.

It’s the desire for sex alone that’s the predominant driver for there being children. Again, see infanticide, which doesn’t need to be an active act as much as a simple omission. One needs an active desire to keep children alive.

Also, even though the share of voluntarily childfree people is increasing, it’s still not the predominant choice. One 2022 study found that 22% of the people polled neither had nor wanted to have children - which is a significant amount, but still leaves 78% of people as ones who either have or want to have children. There’s still a strong drive to have children that’s separate from the drive to just have sex.

It’s a novel cultural development that we prioritize things other-than-having-children so much. Anthropology of Childhood spends significant time examining the various factors that affect the treatment of children in various cultures. It quite strongly argues that the value of children has always also been strongly contingent on various cultural and economic factors - meaning that it has always been just one of the things that people care about. (In fact, a desire to have lots of children may be more tied to agricultural and industrial societies, where the economic incentives for it are abnormally high.)

Adults are rewarded for having lots of offspring when certain conditions are met. First, mothers must be surrounded by supportive kin who relieve them of much of the burden of childrearing so they can concentrate their energy on bearing more children. Second, those additional offspring must be seen as “future workers,” on farm or in factory. They must be seen as having the potential to pay back the investment made in them as infants and toddlers, and pretty quickly, before they begin reproducing themselves. Failing either or both these conditions, humans will reduce their fertility (Turke 1989). Foragers, for whom children are more of a burden than a help, will have far fewer children than neighboring societies that depend on agriculture for subsistence (LeVine 1988). [...]
In foraging societies, where children are dependent and unproductive well into their teens, fewer children are preferred. In farming societies, such as the Beng, children may be welcomed as “little slaves” (Gottlieb 2000: 87). In pastoral and industrial societies, where young children can undertake shepherding a flock, or do repetitive machine-work, women are much more fertile. And, while the traditional culture of the village affords a plethora of customs and taboos for the protection of the pregnant mother and newborn, these coexist with customs that either dictate or at least quietly sanction abortion and infanticide.

To me, the simplest story here looks something like “evolution selects humans for having various desires, from having sex to having children to creating art and lots of other things too; and all of these desires are then subject to complex learning and weighting processes that may emphasize some over others, depending on the culture and environment”.

Some people will end up valuing children more, for complicated reasons; other people will end up valuing other things more, again for complicated reasons. This was the case in hunter-gatherer times and this is the case now.

But it doesn’t look to me like evolution selected us to desire one thing, and then we developed an inner optimizer that ended up doing something completely different. Rather, it looks like we were selected to desire many different things, with a very complicated function choosing which things in that set of doings each individual ends up emphasizing. Today’s culture might have shifted that function to weigh our desires in a different manner than before, but everything that we do is still being selected from within that set of basic desires, with the weighting function operating the same as it always has.

As I mentioned in the introduction, Robin Hanson has suggested that modern society is more forager-like than farmer-like and that our increased wealth is causing us to revert to forager-like ways and psychology. This would then mean that our evolved weighting function is now exhibiting the kind of behavior that it was evolved to exhibit in a forager-like environment.

We do engage in novel activities like computer games today, but it seems to me like the motivation to play computer games is still rooted in the same kinds of basic desires as the first hunter-gatherers had - e.g. to pass the time, enjoy a good story, socialize, or experience a feeling of competence.

So what can we say about AI?

Well, I would be cautious around reasoning by analogy. I’m not sure we can draw particularly strong claims about the connection to AI. I think that there are more direct and relevant arguments that one can make that do seem worrying, rather than trying to resort to evolutionary analogies.

But it does seem to me that e.g. the evolutionary history for the “sharp left turn” implies the opposite than previously argued. Something like “training an AI for recognizing pictures” or “training an AI for caring about human values” looks a lot more like “selecting humans to care about having offspring” than it looks like “optimizing humans for genetic fitness”. Caring about having offspring is a property that we still seem to pretty robustly carry; our alignment properties continued to generalize even as our capabilities increased.

To the extent that we do not care about our offspring, or even choose to go childfree, it’s just because we were selected to also care about other things - if a process selects humans to care about a mix of many things, them sometimes weighing those other things more does not by itself represent a failure of alignment. This is again in sharp contrast to something like an AI that we tried to exclusively optimize for caring about human well-being. So there’s reason to expect that an AI’s alignment properties might generalize even more than those of existing humans.

Thanks to Quintin Pope, Richard Ngo, and Steve Byrnes for commenting on previous drafts of this essay.

^{^}
Futuyama & Kirkpatrick, p. 60.
^{^}
Futuyama & Kirkpatrick, p. 300.
^{^}
Each B1B1 moth has a 100% chance to “pick” a B1 allele for producing a gamete, each B1B2 moth has a 50% chance to pick a B1 gamete and a 50% chance to pick a B2 gamete, and each B2B2 moth has a 100% to pick a B2 allele for producing a gamete. Assuming that each moth that I’ve chosen to breed contributes 100 gametes to the next generation, we get an average of 3600 B1 gametes from the 36 B1B1 moths chosen to breed, 1800 B1 and 1800 B2 gametes from the 360 B1B2 moths chosen to breed, and 800 B2B2 gametes from the 8 B2B2 moths chosen to breed.
This makes for 3600 + 1800 = 5400 B1 gametes and 1800 + 800 = 2600 B2 gametes, for a total of 8000 gametes. This makes for a frequency of 0.675 for B1 and 0.325 for B2.

I agree with most of the factual claims made in this post about evolution. I agree that "IGF is the objective" is somewhat sloppy shorthand. However, after diving into the specific ways the object level details of "IGF is the objective" play out, I am confused about why you believe this implies the things you claim they imply about the sharp left turn / inner misalignment. Overall, I still believe that natural selection is a reasonable analogy for inner misalignment.

I agree fitness is not a single stationary thing. I agree this is prima facie unlike supervised learning, where the objective is typically stationary. However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. Therefore I don't see any important disanalogy between evolution and multi agent RL. I have various thoughts on why language models do not make RL analogies irrelevant that I can explain but that's another completely different rabbit hole.
I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. I think the misconception may arise from a closely related claim that some make, which is that AI systems might develop weird arbitrary goals (tiny metallic squiggles) because any goal with sufficient intelligence implies playing the training game and then doing a sharp left turn. However, the claim here is not that the tiny metallic squiggles drive will suddenly appear at some point and replace the "make humans really happy" drive that existed previously. The claim is that the drive for tiny metallic squiggles was always, from the very beginning, the reason why [make humans really happy] was the observed behavior in environment [humans can turn you off if they aren't happy with you], and therefore in a different environment [humans can no longer turn you off], the observed behavior is [kill everyone and make squiggles].
I agree that everything is very complex always. I agree that there are multiple different goals/drives/desires in humans that result in children, of which the sex drive is only one. I agree that humans still have children sometimes, and still want children per se sometimes, but in practice this results in less and less children than in the ancestral environment over time (I bet even foragers are at least above replacement rate) for exactly the reason that the drives that we have always had for the reason that they caused us to survive/reproduce in the past now correspond much less well. I also agree that infanticide exists and occurs (but in the ancestral environment, there are counterbalancing drives like taboos around infanticide). In general, in many cases, simplifying assumptions totally break the analogy and make the results meaningless. I don't think I've been convinced that this is one of those cases.

I don't really care about defending the usage of "fitness as the objective" specifically, and so I don't think the following is a crux and am happy to concede some of the points below for the sake of argument about the object facts of inner alignment. However, for completeness, my take on when "fitness" can be reasonably described as the objective, and when it can't be:

I agree that couched in terms of the specific traits, the thing that evolution does in practice is sometimes favoring some traits and sometimes favoring other traits. However, I think there's an important sense in which these traits are not drawn from a hat- natural selection selects for lighter/darker moths because it makes it easier for the moths to survive and reproduce! If lighter moths become more common whenever light moths survive and reproduce better, and vice versa for dark moths, as opposed to moths just randomly becoming more light or more dark in ways uncorrelated to survival/reproduction, it seems pretty reasonable to say that survival/reproduction is closer to the thing being optimized than some particular lightness/darkness function that varies between favoring lightness and darkness.
I agree it is possible to do artificial selection for some particular trait like moth color and in this case saying that the process optimizes "fitness" (or survival/reproduction) collapses to saying the same thing as the process optimizes moth lightness/darkness. I agree it would be a little weird to insist that "fitness" is the goal in this case, and that the color is the more natural goal. I also agree that the evolutionary equations plays out the same way whether the source of pressure is artificial human selection or birds eating the moths. Nonetheless, I claim the step where you argue the two cases are equivalent for the purposes of whether we can consider fitness the objective is the step that breaks down. I think the difference between this case and the previous case is that the causality flows differently. We can literally draw from a hat whether we want light moths or dark moths, and then reshape the environment until fitness lines up with our preference for darkness, whereas in the other case, the environment is drawn from a hat and the color selection is determined downstream of that.

Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.

I notice I'm having difficulty responding, in that I disagree with some of what you said, but then have difficulty figuring out my reasons for that disagreement. I have the sense there's a subtle confusion going on, but trying to answer you makes me uncertain whether others are the ones with the subtle confusion or if I am.

I'll think about it some more and get back to you.

So I think the issue is that when we discuss what I'd call the "standard argument from evolution", you can read two slightly different claims into it. My original post was a bit muddled because I think those claims are often conflated, and before writing this reply I hadn't managed to explicitly distinguish them.

The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:

The original evolutionary "intent" of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals are executed even though their consequences (in terms of their impact on fitness) are very different. This tells us that the intent of the process that created a behavior/goal does not matter. Once the behavior/goal has been created, it will just do what it does even if the consequences of that doing deviate from their original purpose. Thus, even if we train an AI so that it carries out goal X in a particular context, we have no particular reason to expect that it would continue to automatically carry out the same (intended) goal if the context changes enough.

I agree with this form of the argument and have no objections to it. I don't think that the points in my post are particularly relevant to that claim. (I've even discussed a form of inner optimization in humans that causes value drift that I don't recall anyone else discussing in those terms before.)

However, I think that many formulations are actually implying, if not outright stating a stronger claim:

In the case of evolution, humans were originally selected for IGF but are now doing things that are completely divorced from that objective. Thus, even if we train an AI so that it carries out goal X in a particular context, we have a strong reason to expect that its behavior would deviate so much from the goal as to become practically unrecognizable.

So the difference is something like the implied sharpness of the left turn. In the weak version, the claim is just that the behavior might go some unknown amount to the left. We should figure out how to deal with this, but we don't yet have much empirical data to estimate exactly how much it might be expected to go left. In the strong version, the claim is that the empirical record shows that the AI will by default swerve a catastrophic amount to the left.

(Possibly you don't feel that anyone is actually implying the stronger version. If you don't and you would already disagree with the stronger version, then great! We are in agreement. I don't think it matters whether the implication "really is there" in some objective sense, or even whether the original authors intended it or not. I think the relevant thing is that I got that implication from the posts I read, and I expect that if I got it, some other people got it too. So this post is then primarily aimed at the people who did read the strong version to be there and thought it made sense.)

You wrote:

I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically.

If we are talking about the weak version of the argument, then yes, I agree with everything here. But I think the strong version - where our behavior is implied to be completely at odds with our original behavior - has to implicitly assume that things like an art-creation drive are something novel.

Now I don't think that anyone who endorses the strong version (if anyone does) would explicitly endorse the claim that our art-creation drive just appeared out of nowhere. But to me, the strong version becomes pretty hard to maintain if you take the stance that we are mostly still executing all of the behaviors that we used to, and it's just that their exact forms and relative weightings are somewhat out of distribution. (Yes, right now our behavior seems to lead to falling birthrates and lots of populations at below replacement rates, which you could argue was a bigger shift than being "somewhat out of distribution", but... to me that intuitively feels like it's less relevant than the fact that most individual humans still want to have children and are very explicitly optimizing for that, especially since we've only been in the time of falling birthrates for a relatively short time and it's not clear whether it'll continue for very long.)

I think the strong version also requires one to hold that evolution does, in fact, consistently and predominantly optimize for a single coherent thing. Otherwise, it would mean that our current-day behaviors could be explained by "evolution doesn't consistently optimize for any single thing" just as well as they could be explained by "we've experienced a left turn from what evolution originally optimized for".

However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded.

I agree that there are contexts where it would be analogous to that. But in that example, AlphaGo is still being rewarded for winning games of Go, and it's just that the exact strategies it needs to use differ. That seems different than e.g. the bacteria example, where bacteria are selected for exactly the opposite traits - either selected for producing a toxin and an antidote, or selected for not producing a toxin and an antidote. That seems to me more analogous to a situation where AlphaGo is initially being rewarded for winning at Go, then once it starts consistently winning it starts getting rewarded for losing instead, and then once it starts consistently losing it starts getting rewarded for winning again.

And I don't think that that kind of a situation is even particularly rare - anything that consumes energy (be it a physical process such as producing a venom or a fur, or a behavior such as enjoying exercise) is subject to that kind of an "either/or" choice.

Now you could say that "just like AlphaGo is still rewarded for winning games of Go and it's just the strategies that differ, the organism is still rewarded for reproducing and it's just the strategies that differ". But I think the difference is that for AlphaGo, the rewards are consistently shaping its "mind" towards having a particular optimization goal - one where the board is in a winning state for it.

And one key premise on which the "standard argument from evolution" rests is that evolution has not consistently shaped the human mind in such a direct manner. It's not that we have been created with "I want to have surviving offspring" as our only explicit cognitive goal, with all of the evolutionary training going into learning better strategies to get there by explicit (or implicit) reasoning. Rather we have been given various motivations that exhibit varying degrees of directness in how useful they are for that goal - from "I want to be in a state where I produce great art" (quite indirect) to "I want to have surviving offspring" (direct), with the direct goal competing with all the indirect ones for priority. Unlike AlphaGo, which does have the cognitive capacity for direct optimization toward its goal being the sole reward criteria all along.

This is also a bit hard to put a finger on, but I feel like there's some kind of implicit bait-and-switch happening with the strong version of the standard argument. It correctly points out that we have not had IGF as our sole explicit optimization goal because we didn't start by having enough intelligence for that to work. Then it suggests that because of this, AIs are likely to also be misaligned... even though, unlike with human evolution, we could just optimize them for one explicit goal from the beginning, so we should expect our AIs to be much more reliably aligned with that goal!

I think the main crux is that in my mind, the thing you call the "weak version" of the argument simply is the only and sufficient argument for inner misalignment and very sharp left turn. I am confused precisely what distinction you draw between the weak and strong version of the argument; the rest of this comment is an attempt to figure that out.

My understanding is that in your view, having the same drive as before means also having similar actions as before. For example, if humans have a drive for making art, in the ancestral environment this means drawing on cave walls (maybe this helped communicate the whereabouts of food in the ancestral environment). In the modern environment, this may mean passing up a more lucrative job opportunity to be an artist, but it still means painting on some other surface. Thus, the art drive, taking almost the same kinds of actions it ever did (maybe we use acrylic paints from the store instead of grinding plants into dyes ourselves), no longer results in the same consequences in amount of communicating food locations or surviving and having children or whatever it may be. But this is distinct from a sharp left turn, where the actions also change drastically (from helping humans to killing humans).

I agree this is more true for some drives. However, I claim that the association between drives and behaviors is not true in general. I claim humans have a spectrum of different kinds of drives, which differ in how specifically the drive specifies behavior. At one end of the spectrum, you can imagine stuff like breathing or blinking where it's kind of hard to even say whether we have a "breathing goal" or a clock that makes you breath regularly--the goal is the behavior, in the same way a cup has the "goal" of holding water. At this end of the spectrum it is valid to use goal/drive and behavior interchangeably. At the other end of the spectrum are goals/drives which are very abstract and specify almost nothing about how you get there: drives like desire for knowledge and justice and altruism and fear of death.

The key thing that makes these more abstract drives special is that because they do not specifically prescribe actions, the behaviors are produced by the humans reasoning about how to achieve the drive, as opposed to behaviors being selected for by evolution directly. This means that a desire for knowledge can lead to reading books, or launching rockets, or doing crazy abstract math, or inventing Anki, or developing epistemology, or trying to build AGI, etc. None of these were specifically behaviors that evolution could have reinforced in us--the behaviors available in the ancestral environment were things like "try all the plants to see which ones are edible". Evolution reinforced the abstract drive for knowledge, and left it up to individual human brains to figure out what to do, using the various Lego pieces of cognition that evolution built for us.

This means that the more abstract drives can actually suddenly just prescribe really different actions when important facts in the world change, and those actions will look very different from the kinds of actions previously taken. To take a non-standard example, for the entire history of the existence of humanity up until quite recently, it just simply has not been feasible for anyone to contribute meaningfully to eradicating entire diseases (indeed, for most of human history there was no understanding of how diseases actually worked, and people often just attributed it to punishment of the gods or otherwise found some way to live with it, and sometimes, as a coping mechanism, to even think the existence of disease and death necessary or net good). From the outside it may appear as if for the entire history of humanity there was no drive for disease eradication, and then suddenly in the blink of an evolutionary timescale eye a bunch of humans developed a disease eradication drive out of nowhere, and then soon thereafter suddenly smallpox stopped existing (and soon potentially malaria and polio). These will have involved lots of novel (on evolutionary timescale) behaviors like understanding and manufacturing microscopic biological things at scale, or setting up international bodies for coordination. In actuality, this was driven by the same kinds of abstract drives that have always existed like curiosity and fear of death and altruism, not some new drive that popped into being, but it involved lots of very novel actions steering towards a very difficult target.

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals (and there could be multiple). I think there may be a general communication issue where there is a type of person that likes to boil problems down to their core, which is usually some very simple setup, but then neglects to actually communicate why they believe this particular abstraction captures the thing that matters.

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment. (And winning winning states of the board always looking like having more territory encircled seems analogous to surviving and reproducing always looking like having a lot of children)

I think there is also a disagreement about what AlphaGo does, though this is hard to resolve without better interpretability -- I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go. In fact, the biggest gripe I have with most empirical alignment research is that I think models today fail to have sufficiently abstract drives, quite possibly for reasons related to why they are kind of dumb today and why things like AutoGPT mysteriouly have failed to do anything useful whatsoever. But this is a spicy claim and I think not that many other people would endorse this.

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals

I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)

15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. (AGI Ruin: A List of Lethalities)

Those arguments are explicitly premised on humans having been optimized for IGF, which is implied to be a single thing. As I understand it, your argument is just that humans now have some very different behaviors from the ones they used to have, omitting any claims of what evolution originally optimized us for, so I see it as making a very different sort of claim.

To respond to your argument itself:

I agree that there are drives for which the behavior looks very different from anything that we did in the ancestral environment. But does very different-looking behavior by itself constitute a sharp left turn relative to our original values?

I would think that if humans had experienced a sharp left turn, then the values of our early ancestors should look unrecognizable to us, and vice versa. And certainly, there do seem to be quite a few things that our values differ on - modern notions like universal human rights and living a good life while working in an office might seem quite alien and repulsive to some tribal warrior who values valor in combat and killing and enslaving the neighboring tribe, for instance.

At the same time... I think we can still basically recognize and understand the values of that tribal warrior, even if we don't share them. We do still understand what's attractive about valor, power, and prowess, and continue to enjoy those kinds of values in less destructive forms in sports, games, and fiction. We can read Gilgamesh or Homer or Shakespeare and basically get what the characters are motivated by and why they are doing the things they're doing. An anthropologist can go to a remote tribe to live among them and report that they have the same cultural and psychological universals as everyone else and come away with at least some basic understanding of how they think and why.

It's true that humans couldn't eradicate diseases before. But if you went to people very far back in time and told them a story about a group of humans who invented a powerful magic that could destroy diseases forever and then worked hard to do so... then the people of that time would not understand all of the technical details, and maybe they'd wonder why we'd bother bringing the cure to all of humanity rather than just our tribe (though Prometheus is at least commonly described as stealing fire for all of humanity, so maybe not), but I don't think they would find it a particularly alien or unusual motivation otherwise. Humans have hated disease for a very long time, and if they'd lost any loved ones to the particular disease we were eradicating they might even cheer for our doctors and want to celebrate them as heroes.

Similarly, humans have always gone on voyages of exploration - e.g. the Pacific islands were discovered and settled long ago by humans going on long sea voyages - so they'd probably have no difficulty relating to a story about sorcerers going to explore the moon, or of two tribes racing for the glory of getting there first. Babylonians had invented the quadratic formula by 1600 BC and apparently had a form of Fourier analysis by 300 BC, so the math nerds among them would probably have some appreciation of modern-day advanced math if it was explained to them. The Greek philosophers argued over epistemology, and there were apparently instructions on how to animate golems (arguably AGI-like) around by the late 12th/early 13th century.

So I agree that the same fundamental values and drives can create very different behavior in different contexts... but if it is still driven by the same fundamental values and drives in a way that people across time might find relatable, why is that a sharp left turn? Analogizing that to AI, it would seem to imply that if the AI generalized its drives in that kind of way when it came to novel contexts, then we would generally still be happy about the way it had generalized them.

This still leaves us with that tribal warrior disgusted with our modern-day weak ways. I think that a lot of what is going on with him is that he has developed particular strategies for fulfilling his own fundamental drives - being a successful warrior was the way you got what you wanted back in that day - and internalized them as a part of his aesthetic of what he finds beautiful and what he finds disgusting. But it also looks to me like this kind of learning is much more malleable than people generally expect. One's sense of aesthetics can be updated by propagating new facts into it, and strongly-held identities (such as "I am a technical person") can change in response to new kinds of strategies becoming viable, and generally many (I think most) deep-seated emotional patterns can at least in principle be updated. (Generally, I think of human values in terms of a two-level model, where the underlying "deep values" are relatively constant, with emotional responses, aesthetics, identities, and so forth being learned strategies for fulfilling those deep values. The strategies are at least in principle updatable, subject to genetic constraints such as the person's innate temperament that may be more hardcoded.)

I think that the tribal warrior would be disgusted by our society because he would rightly recognize that we have the kinds of behavior patterns that wouldn't bring glory in his society and that his tribesmen would find it shameful to associate with, and also that trying to make it in our society would require him to unlearn a lot of stuff that he was deeply invested in. But if he was capable of making the update that there were still ways for him to earn love, respect, power, and all the other deep values that his warfighting behavior had originally developed to get... then he might come to see our society as not that horrible after all.

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment.

I don't think the actual victory states look substantially different? They're all ones where AlphaGo has more territory than the other player, even if the details of how you get there are going to be different.

I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.

Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).

This is a great post! Thank you for writing it.

There's a huge amount of ontological confusion about how to think of "objectives" for optimization processes. I think people tend to take an inappropriate intentional stance and treat something like "deliberately steering towards certain abstract notions" as a simple primitive (because it feels introspectively simple to them). This background assumption casts a shadow over all future analysis, since people try to abstract the dynamics of optimization processes in terms of their "true objectives", when there really isn't any such thing.

Optimization processes (or at least, evolution and RL) are better thought of in terms of what sorts of behavioral patterns were actually selected for in the history of the process. E.g., @Kaj_Sotala's point here about tracking the effects of evolution by thinking about what sorts of specific adaptations were actually historically selected for, rather than thinking about some abstract notion of inclusive genetic fitness, and how the difference between modern and ancestral humans seems much smaller from this perspective.

I want to make a similar point about reward in the context of RL: reward is a measure of update strength, not the selection target. We can see as much by just looking at the update equations for REINFORCE (from page 328 of Reinforcement Learning: An Introduction):

The reward^[1] is literally a (per step) multiplier of the learning rate. You can also think of it as providing the weights of a linear combination of the parameter gradients, which means that it's the historical action trajectories that determine what subspaces of the parameters can potentially be explored. And due to the high correlations between gradients (at least compared to the full volume of parameter space), this means it's the action trajectories, and not the reward function, that provides most of the information relevant for the NN's learning process.

From Survival Instinct in Offline Reinforcement Learning:

on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.

Trying to preempt possible confusion:

I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn't incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.

My point here is that, if you want to judge an optimization process's predictability/controllability, you should not be comparing some abstract notion of the process's "true outer objective" to the result's "true inner objective". Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing's future behaviors are, given past behaviors / updates.

@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.

For RL agents, I am also arguing that thinking in terms of the historical action trajectories that were actually reinforced during training implies greater consistency, as compared to thinking of things in terms of some "true goal" of the training process. E.g., Goal Misgeneralization in Deep Reinforcement Learning trained a mouse to navigate to cheese that was always placed in the upper right corner of the maze and found that it would continue going to the upper right even when the cheese was moved.

This is actually a high degree of consistency from the perspective of the historical action trajectories. During training, the mouse continually executed the action trajectories that navigated it to the upper right of the board, and continued to do the exact same thing in the modified testing environment.

^{^}
Technically it's the future return in this formulation, and current SOTA RL algorithms can be different / more complex, but I think this perspective is still a more accurate intuition pump than notions of "reward as objective", even for setups where "reward as a learning rate multiplier" isn't literally true.

I think this is really lucid and helpful:

I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn't incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process's predictability/controllability, you should not be comparing some abstract notion of the process's "true outer objective" to the result's "true inner objective". Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing's future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.

I expect you to be making a correct and important point here, but I don't think I get it yet. I feel confused because I don't know what it would mean for this frame to make false predictions. I could say "Evolution selected me to have two eyeballs" and I go "Yep I have two eyeballs"? "Evolution selected for [trait with higher fitness]" and then "lots of people have trait of higher fitness" seems necessarily true?

I feel like I'm missing something.

Oh. Perhaps it's nontrivial that humans were selected to value a lot of stuff, and (different, modern) humans still value a lot of stuff, even in today's different environment? Is that the point?

Does this comment help clarify the point?

In the sections before that, I argued that there’s no single thing that evolution selects for; rather, the thing that it’s changing is constantly changing itself.

"The thing that it's selecting for is itself constantly changing"?

Thanks, edited:

I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.

This seems to be making the same sort of deepity that Turntrout is making in his 'reward is not the optimization target', in taking a minor point about model-free RL approaches not necessarily building in any explicit optimization/planning for reward into their policy, and then people not understanding it because it ducks the major issue, while handwaving a lot of points. (Especially bad: infanticide is not a substitute for contraception because pregnancy is outrageously fatal and metabolically expensive, which is precisely why the introduction of contraception has huge effects everywhere it happens and why hunter-foragers have so many kids while contemporary women have fewer than they want to. Infanticide is just about the worst possible form of contraception short of the woman dying. I trust you would not argue that 'suicide is just as effective contraceptive as infanticide or condoms' using the same logic - after all, if the mother is dead, then there's definitely no more kids...)

In particular, this fundamentally does not answer the challenge I posed earlier by pointing to instances of sperm bank donors who quite routinely rack up hundreds of offspring, while being in no way special other than having a highly-atypical urge to have lots of offspring. You can check this out very easily in seconds and verify that you could do the same thing with less effort than you've probably put into some video games. And yet, you continue to read this comment. Here, look, you're still reading it. Seconds are ticking away while you continue to forfeit (I will be generous and pretend that a LWer is likely to have median number of kids) much more than 10,000% more fitness at next to no cost of any kind. And you know this because you are a model-based RL agent who can plan and predict the consequences of actions based solely on observations (like of text comments) without any additional rewards, you don't have to wait for model-free mechanisms like evolution to slowly update your policy over countless rewards. You are perfectly able to predict that if the status quo lasted for enough millennia, this would stop being true; men would gradually be born with a baby-lust, and would flock to sperm donation banks (assuming such things even still existed under the escalating pressure); you know what the process of evolution would do and is doing right now very slowly, and yet, using your evolution-given brain, you still refuse to reap the fitness rewards of hundreds of offspring right now, in your generation, with yourself, for your genes. How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)? Certainly if AGI were as well-aligned with human values as we are with inclusive fitness, that doesn't seem to bode very well for how human values will be fulfilled over time as the AGI-environment changes ever more rapidly & at scale - I don't know what the 'masturbation, porn, or condom of human values' is, and I'd rather not find out empirically how diabolically clever reward hacks can be when found by superhuman optimization processes at scale targeting the original human values process...

infanticide is not a substitute for contraception

I did not mean to say that they would be exactly equivalent nor that infanticide would be without significant downsides.

How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?

"Inner optimizers diverging from the optimization process's reward function" sounds to me like humans were already donating to sperm banks in the EEA, only for an inner optimizer to wreak havoc and sidetrack us from that. I assume you mean something different, since under that interpretation of what you mean the answer would be obvious - that we don't need to invoke inner optimizers because there were no sperm banks in the EEA, so "that's not the kind of behavior that evolution selected for" is a sufficient explanation.

How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?

I think it's inappropriate to use technical terms like "reward function" in the context of evolution, because evolution's selection criteria serve vastly different mechanistic functions from eg a reward function in PPO.^[1] Calling them both a "reward function" makes it harder to think precisely about the similarities and differences between AI RL and evolution, while invalidly making the two processes seem more similar. That is something which must be argued for, and not implied through terminology.

^{^}
And yes, I wish that "reward function" weren't also used for "the quantity which an exhaustive search RL agent argmaxes." That's bad too.

I agree fitness is not a single stationary thing. I agree this is prima facie unlike supervised learning, where the objective is typically stationary. However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded. Therefore I don't see any important disanalogy between evolution and multi agent RL. I have various thoughts on why language models do not make RL analogies irrelevant that I can explain but that's another completely different rabbit hole.
I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically. I think the misconception may arise from a closely related claim that some make, which is that AI systems might develop weird arbitrary goals (tiny metallic squiggles) because any goal with sufficient intelligence implies playing the training game and then doing a sharp left turn. However, the claim here is not that the tiny metallic squiggles drive will suddenly appear at some point and replace the "make humans really happy" drive that existed previously. The claim is that the drive for tiny metallic squiggles was always, from the very beginning, the reason why [make humans really happy] was the observed behavior in environment [humans can turn you off if they aren't happy with you], and therefore in a different environment [humans can no longer turn you off], the observed behavior is [kill everyone and make squiggles].
I agree that everything is very complex always. I agree that there are multiple different goals/drives/desires in humans that result in children, of which the sex drive is only one. I agree that humans still have children sometimes, and still want children per se sometimes, but in practice this results in less and less children than in the ancestral environment over time (I bet even foragers are at least above replacement rate) for exactly the reason that the drives that we have always had for the reason that they caused us to survive/reproduce in the past now correspond much less well. I also agree that infanticide exists and occurs (but in the ancestral environment, there are counterbalancing drives like taboos around infanticide). In general, in many cases, simplifying assumptions totally break the analogy and make the results meaningless. I don't think I've been convinced that this is one of those cases.

I agree that couched in terms of the specific traits, the thing that evolution does in practice is sometimes favoring some traits and sometimes favoring other traits. However, I think there's an important sense in which these traits are not drawn from a hat- natural selection selects for lighter/darker moths because it makes it easier for the moths to survive and reproduce! If lighter moths become more common whenever light moths survive and reproduce better, and vice versa for dark moths, as opposed to moths just randomly becoming more light or more dark in ways uncorrelated to survival/reproduction, it seems pretty reasonable to say that survival/reproduction is closer to the thing being optimized than some particular lightness/darkness function that varies between favoring lightness and darkness.
I agree it is possible to do artificial selection for some particular trait like moth color and in this case saying that the process optimizes "fitness" (or survival/reproduction) collapses to saying the same thing as the process optimizes moth lightness/darkness. I agree it would be a little weird to insist that "fitness" is the goal in this case, and that the color is the more natural goal. I also agree that the evolutionary equations plays out the same way whether the source of pressure is artificial human selection or birds eating the moths. Nonetheless, I claim the step where you argue the two cases are equivalent for the purposes of whether we can consider fitness the objective is the step that breaks down. I think the difference between this case and the previous case is that the causality flows differently. We can literally draw from a hat whether we want light moths or dark moths, and then reshape the environment until fitness lines up with our preference for darkness, whereas in the other case, the environment is drawn from a hat and the color selection is determined downstream of that.

Thank you, I like this comment. It feels very cooperative and like some significant effort went into it, and it also seems to touch the core of some important consideratios.

I'll think about it some more and get back to you.

The weaker form of the argument, which I interpret your comment to be talking about, goes something like this:

The original evolutionary "intent" of various human behaviors/goals was to increase fitness, but in the modern day these behaviors/goals are executed even though their consequences (in terms of their impact on fitness) are very different. This tells us that the intent of the process that created a behavior/goal does not matter. Once the behavior/goal has been created, it will just do what it does even if the consequences of that doing deviate from their original purpose. Thus, even if we train an AI so that it carries out goal X in a particular context, we have no particular reason to expect that it would continue to automatically carry out the same (intended) goal if the context changes enough.

However, I think that many formulations are actually implying, if not outright stating a stronger claim:

In the case of evolution, humans were originally selected for IGF but are now doing things that are completely divorced from that objective. Thus, even if we train an AI so that it carries out goal X in a particular context, we have a strong reason to expect that its behavior would deviate so much from the goal as to become practically unrecognizable.

You wrote:

I agree that humans (to a first approximation) still have the goals/drives/desires we were selected for. I don't think I've heard anyone claim that humans suddenly have an art creating drive that suddenly appeared out of nowhere recently, nor have I heard any arguments about inner alignment that depend on an evolution analogy where this would need to be true. The argument is generally that the ancestral environment selected for some drives that in the ancestral environment reliably caused something that the ancestral environment selected for, but in the modern environment the same drives persist but their consequences in terms of [the amount of that which the ancestral environment was selecting for] now changes, potentially drastically.

However, it is pretty analogous to RL, and especially multi agent RL, and overall I don't think of the inner misalignment argument as depending on stationarity of the environment in either direction. AlphaGo might early in training select for policies that do tactic X initially because it's a good tactic to use against dumb Go networks, and then once all the policies in the pool learn to defend against that tactic it is no longer rewarded.

I don't think any of these arguments depend crucially on whether there is a sole explicit goal of the training process, or if the goal of the training process changes a bunch. The only thing the argument depends on is whether there exist such abstract drives/goals

I agree that they don't depend on that. Your arguments are also substantially different from the ones I was criticizing! The ones I was responding were ones like the following:

The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities. (A central AI alignment problem: capabilities generalization, and the sharp left turn)

15. [...] We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. [...]
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. (AGI Ruin: A List of Lethalities)

To respond to your argument itself:

I am confused by your AlphaGo argument because "winning states of the board" looks very different depending on what kinds of tactics your opponent uses, in a very similar way to how "surviving and reproducing" looks very different depending on what kinds of hazards are in the environment.

I predict that AlphaGo is actually not doing that much direct optimization in the sense of an abstract drive to win that it reasons about, but rather has a bunch of random drives piled up that cover various kinds of situations that happen in Go.

Yeah, I would expect this as well, but those random drives would still be systematically shaped in a consistent direction (that which brings you closer to a victory state).

This is a great post! Thank you for writing it.

From Survival Instinct in Offline Reinforcement Learning:

on many benchmark datasets, offline RL can produce well-performing and safe policies even when trained with "wrong" reward labels, such as those that are zero everywhere or are negatives of the true rewards. This phenomenon cannot be easily explained by offline RL's return maximization objective. Moreover, it gives offline RL a degree of robustness that is uncharacteristic of its online RL counterparts, which are known to be sensitive to reward design.

Trying to preempt possible confusion:

^{^}
Technically it's the future return in this formulation, and current SOTA RL algorithms can be different / more complex, but I think this perspective is still a more accurate intuition pump than notions of "reward as objective", even for setups where "reward as a learning rate multiplier" isn't literally true.

I think this is really lucid and helpful:

I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn't incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process's predictability/controllability, you should not be comparing some abstract notion of the process's "true outer objective" to the result's "true inner objective". Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing's future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.

I feel like I'm missing something.

Oh. Perhaps it's nontrivial that humans were selected to value a lot of stuff, and (different, modern) humans still value a lot of stuff, even in today's different environment? Is that the point?

Does this comment help clarify the point?

In the sections before that, I argued that there’s no single thing that evolution selects for; rather, the thing that it’s changing is constantly changing itself.

"The thing that it's selecting for is itself constantly changing"?

Thanks, edited:

I argued that there’s no single thing that evolution selects for; rather, the thing that it’s selecting is constantly changing.

infanticide is not a substitute for contraception

I did not mean to say that they would be exactly equivalent nor that infanticide would be without significant downsides.

How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?

^{^}
And yes, I wish that "reward function" weren't also used for "the quantity which an exhaustive search RL agent argmaxes." That's bad too.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

20

Genetic fitness is a measure of selection strength, not the selection target

20

Evolution as a force that selects for traits at random

What exactly is fitness, again?

Fitness as a measure of selection strength

From coloration to behavior

Humans still have the goals we were selected for

So what can we say about AI?

Trying to preempt possible confusion:

Trying to preempt possible confusion: