My 2003 Post on the Evolutionary Argument for AI Misalignment

Wei Dai

This was posted to SL4 on the last day of 2003. I had largely forgotten about it until I saw the LW Wiki reference it under Mesa Optimization^[1]. Besides the reward hacking angle, which is now well-trodden, it gave an argument based on the relationship between philosophy, memetics, and alignment, which has been much less discussed (including in current discussions about human fertility decline), and perhaps still worth reading/thinking about. Overall, the post seems to have aged well, aside from the very last paragraph.

For historical context, Eliezer had coined "Friendly AI" in Creating Friendly AI 1.0 in June 2001. Although most of it was very hard to understand and subsequently disavowed by Eliezer himself, it had a section on “philosophical crisis”^[2] which probably influenced a lot of my subsequent thinking including this post. What's now called The Sequences would start being posted to OB/LW in 2006.

Subsequent to 2003, I think this post mostly went unnoticed/forgotten (including by myself), and MIRI probably reinvented the idea of mesa-optimization/inner-misalignment circa 2016. I remember hearing people talk about inner vs outer alignment while attending a (mostly unrelated decision theory) workshop at MIRI and having an "oh this is new" reaction.

The SL4 Post

Subject: "friendly" humans?
Date: Wed Dec 31 2003

Why are we not faithful servants of our genes? Instead, the defenses our genes built against parasitic memes are breaking down, resulting in entire societies falling below replacement fertility rate even as we enjoy unprecendented riches in technology and material resources. Genes built our brains in the hope that we will remain friendly to them, and they appear to have failed. Why? And is there anything we can learn from their catastrophe as we try to build our own friendly higher-intelligence?

I think the reason we're becoming increasingly unfriendly to genes is that parasitic memes are evolving too fast for genes and their symbiotic defensive memes to keep up, and this is the result of a series of advances in communications technology starting with the printing press. Genes evolved two ways of ensuring our friendliness - hardwired desires and hosting a system of mutually reinforcing philosophies learned during childhood that defines and justifies friendliness toward genes. Unfortunately for the genes, those hardwired desires have proven easy to bypass once a certain level of technology is reached (e.g., development of birth control), and the best philosophical defense for gene-friendliness that evolution could come up with after hundreds of thousands of years is the existence of a God that wants humans to be fertile.

This doesn't bode well for our own efforts. An SI will certainly find it trivial to bypass whatever hardwired desires or constraints that we place on it, and any justifications for human-friendliness that we come up with may strike it to be just as silly as "fill the earth" theism is to us.

But perhaps there is also cause for optimism, because unlike humans, the SI does not have to depend on memes for its operation, so we can perhaps prevent the problem of human-unfriendly memes by not having any memes at all. For example we can make the SI a singleton. If more than one SI is developed, we can try to prevent them from communicating with each other, especially about philosophy. If the SI is made up of multiple internal agents, we can try to make sure their internal communication channels are not suitable for transmitting memes.

Happy New Year!

^{^}
and/or possibly hearing it mentioned in a podcast as the first description of the analogy between evolution and AI alignment, but I'm not sure and can't recall or find which podcast.
^{^}
Intro to that section read: A “philosophical crisis” is hard to define. I usually think of a “philosophical crisis” as the AI stumbling across some fact that breaks ver loose of the programmers—i.e., the programmers have some deeply buried unconscious prejudice that makes them untrustworthy, or the AI stumbles across a deliberate lie, or the AI discovers objective morality, et cetera. If the hypothesized gap is wide enough, it may be enough to invalidate almost all the content simultaneously.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

17

My 2003 Post on the Evolutionary Argument for AI Misalignment

17

The SL4 Post