When I'm arguing points like orthogonality and fragility of value, I've occasionally come across rejoinders that I'll (perhaps erroneously) summarize:

Superintelligences are not spawned fully-formed; they are created by some training process. And perhaps it is in the nature of training processes, especially training processes that involve multiple agents facing "social" problems or training processes intentionally designed by humans with friendliness in mind, that the inner optimizer winds up embodying the spirit of niceness and compassion.

Like, perhaps there just aren't all that many ways for a young mind to grow successfully in a world full of other agents with their own desires, and in the face of positive reinforcement for playing nicely with those agents, and negative reinforcement for crossing them. And perhaps one of the common ways for such a young mind to grow, is for it to internalize into its core goals the notions of kindness and compassion and respect-for-the-desires-of-others, in a manner broadly similar to humans. And, sure, this isn't guaranteed, but perhaps it's common enough that we can get young AI minds into the right broad basin, if we're explicitly trying to.

One piece of evidence for this view is that there aren't simple tweaks to human psychology that make them significantly more reproductively successful. Sociopathy isn't at fixation. Humans can in fact sniff out cheaters, and can sniff out people who want to make deals but who don't actually really care about you — and those people do less well. Actually caring about people in a readily-verifiable way is robustly useful in the current social equilibrium.

If it turns out to be easy-ish to instill similar sorts of caring into AI, then such an AI might not share human tastes in things like art or humor, but that might be fine, because it might embody broad cosmopolitan virtues — virtues that inspire it to cooperate with us to reach the stars, and not oppose us when we put a sizable portion of the stars toward Fun.

(Or perhaps we'll get even luckier still, and large swaths of human values will turn out to have pretty-wide basins that we can get the AI into if we're trying, so that it does share our sense of humor and laughs alongside us as we travel together to the stars!)

This view is an amalgam of stuff that I tentatively understand Divia Eden, John Wentworth and the shard theory advocates to be gesturing at.

I think this view is wrong, and I don't see much hope here. Here's a variety of propositions I believe that I think sharply contradict this view:

  1. There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.
  2. The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly "nice".
  3. Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI's mind to work differently from how you hope it works.
  4. The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.

Expanding on 1): 

There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.

We have niceness/kindness/compassion because our nice/kind/compassionate ancestors had more kids than their less-kind siblings. The work that niceness/kindness/compassion was doing ultimately grounded out in more children. Presumably that reproductive effect factored through a greater ability to form alliances, lowering the bar required for trust, etc.

It seems to me like "partially adopt the values of others" is only one way among many to get this effect, with others including but not limited to "have a reputation for, and a history of, honesty" and "be cognitively legible" and "fully merge with local potential allies immediately".


Expanding on 2):

The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly "nice".

I think this perspective is reflected in Three Worlds Collide and "Kindness to Kin". Even if we limit our attention to minds that solve the "trust is hard" problem by adopting some of their would-be collaborators’ objectives, there are all sorts of parameters controlling precisely how this is done.

Like, how much of the other's objectives do you adopt, and to what degree?

How long does patience last?

How do you guard against exploitation by bad actors and fakers?

What sorts of cheating are you sensitive to?

What makes the difference between "live and let live"-style tolerance, and chumminess?

If you look at the specifics of how humans implement this stuff, it's chock full of detail. (And indeed, we should expect this a priori from the fact that the niceness/kindness/compassion cluster is a mere correlate of fitness. It's already a subgoal removed from the simple optimization target; it would be kinda surprising if there were only one way to form such a subgoal and if it weren’t situation-dependent!)

If you take a very dissimilar mind and fill out all the details in a very dissimilar way, the result is likely to be quite far from what humans would recognize as “niceness”!

In humans, power corrupts. Maybe in your alien AI mind, a slightly different type of power corrupts in a slightly different way, and next thing you know, it's stabbing you in the back and turning the universe towards its own ends. (Because you didn’t know to guard against that kind of corruption, because it’s an alien behavior with an alien trigger.)

I claim that there are many aspects of kindness, niceness, etc. that work like this, and that are liable to fail in unexpected ways if you rely on this as your central path to alignment.


Expanding on 3): 

Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI's mind to work differently from how you hope it works.

It looks pretty plausible to me that humans model other human beings using the same architecture that they use to model themselves. This seems pretty plausible a-priori as an algorithmic shortcut — a human and its peers are both human, so machinery for self-modeling will also tend to be useful for modeling others — and also seems pretty plausible a-priori as a way for evolution to stumble into self-modeling in the first place ("we've already got a brain-modeler sitting around, thanks to all that effort we put into keeping track of tribal politics").

Under this hypothesis, it's plausibly pretty easy for imaginations of others’ pain to trigger pain in a human mind, because the other-models and the self-models are already in a very compatible format.[1]

By contrast, an AI might work internally via an architecture that is very different from our own emotional architectures, with nothing precisely corresponding to our "emotions", and many different and distinct parts of the system doing the work that pain does in us. Such an AI is much less likely to learn to model humans in a format that's overlapped with its models of itself, and much less able to have imagined-pain-in-others coincide with the cognitive-motions-that-do-the-work-that-pain-does-in-us. And so, on this hypothesis, the AI entirely fails to develop empathy.

I'm not trying to say "and thus AIs will definitely not have empathy — checkmate"; I'm trying to use this as a single example of a more general fact: an AI, by dint of having a different cognitive architecture than a human, is liable to respond to similar training incentives in very different ways.

(Where, in real life, it will have different training incentives, but even if it did have the same incentives, it is liable to respond in different ways.)

Another, more general instance of the same point: Niceness/kindness/compassion are instrumental-subgoal correlates-of-fitness in the human ancestral environment, that humans latch onto as terminal goals in a very specific way, and the AI will likely latch onto instrumental-subgoals as terminal in some different way, because it works by specific mechanisms that are different than the human mechanism. And so the AI likely gets off the train even before it fails to get empathy in exactly the same way that humans did, because it's already in some totally different and foreign part of the "adopt instrumental goals" part of mindspace. (Or suchlike.)

And more generally still: Once you start telling stories about how the AI works internally, and see that details like whether the human-models and the self-models share architecture could have a large effect on the learned-behavior in places where humans would be learning empathy, then the rejoinder "well, maybe that's not actually where empathy comes from, because minds don't actually work like that" falls pretty flat. Human minds work somehow, and the AI's mind will also work somehow, and once you can see lots of specifics, you can see ways that the specifics are contingent. Most specific ways that a mind can work, that are not tightly analogous to the human way, are likely to cause the AI to learn something relevantly different, where we would be learning niceness/kindness/compassion.

Insofar as your only examples of minds are human minds, it's easy to imagine that perhaps all minds work similarly. And maybe, similarly, if all you knew was biology, you might expect that all great and powerful machines would have the squishy nature, with most of them being tasty if you cook them long enough in a fire. But the more you start understanding how machines work, the more you see how many facts about the workings of those machines are contingent, and the less you expect vehicular machines to robustly taste good when cooked. (Even if horses are the best vehicle currently around!)


Expanding on 4): 

The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.

Suppose you shape your training objectives with the goal that they're better-achieved if the AI exhibits nice/kind/compassionate behavior. One hurdle you're up against is, of course, that the AI might find ways to exhibit related behavior without internalizing those instrumental-subgoals as core values. If ever the AI finds better ways to achieve those ends before those subgoals are internalized as terminal goals, you're in trouble.

And this problem amps up when the AI starts reflecting. 

E.g.: maybe those values are somewhat internalized as subgoals, but only when the AI is running direct object-level reasoning about specific people. Whereas when the AI thinks about game theory abstractly, it recommends all sorts of non-nice things (similar to real-life game theorists). And perhaps, under reflection, the AI decides that the game theory is the right way to do things, and rips the whole niceness/kindness/compassion architecture out of itself, and replaces it with other tools that do the same work just as well, but without mistaking the instrumental task for an end in-and-of-itself.

Lest this example feel completely implausible, imagine a human who quite enjoys dunking on the outgroup and being snide about it, but with a hint of doubt that eventually causes them — on reflection — to reform, and to flinch away from snideness. The small hint of doubt can be carried pretty far by reflection. The fact that the pleasure of dunking on the outgroup is louder, is not much evidence that it's going to win as reflective ability is amplified.

Another example of this sort of dynamic in humans: humans are able to read some philosophy books and then commit really hard to religiosity or nihilism or whatever, in ways that look quite misguided to people who understand the Law. This is a relatively naive mistake, but it's a fine example of the agent's alleged goals being very sensitive to small differences in how it resolves internal inconsistencies about abstract ("philosophical") questions.

A similar pattern can get pretty dangerous when working with an AGI that acts out its own ideals, and that resolves “philosophical” questions very differently than we might — and thus is liable to take whatever analogs of niceness/kindness/compassion initially get baked into it (as a correlate of training objectives), and change them in very different ways than we would.

E.g.: Perhaps the AI sees that its “niceness” binds only when there's actually a smiling human in front of its camera, and not in the case of distant humans that it cannot see (in the same way that human desire to save a drowning child binds only in specific contexts). And perhaps the AI uses slightly different reflective resolution methods than we would, and resolves this conflict not by generalizing niceness, but by discarding it.

And: all these specific examples are implausible, sure. But again, I'm angling for a more general point here: once the AI is reflecting, small shifts in reflection-space (like "let's stop being snide") can have large shifts in behavior-space.

So even if by some miracle the vast differences in architecture and training regime only produce minor (and survivable) differences between human niceness/kindness/compassion and the AI's ad-hoc partial internalizations of instrumental objectives like "be legibly cooperative to your trading partners", similarly-small differences in its reflective-stabilization methods are liable to result in big differences at the reflective equilibrium.


 

  1. ^

    I suspect I'm one of the people that caused Steven to write up his quick notes on mirror-neurons, because I was trying to make this point to him, and I think he misunderstood me as saying something stupid about mirror neurons. ETA: nope!

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 7:09 PM

I suspect I'm one of the people that caused Steven to write up his quick notes on mirror-neurons, because I was trying to make this point to him, and I think he misunderstood me as saying something stupid about mirror neurons.

Nope, I don’t remember you ever saying or writing anything stupid (or anything at all) about mirror neurons. That post was not in response to anything in particular and has no hidden agenda. :-)

…our ancestral environment…

I strongly agree that it’s a bad idea to try to get nice AGIs by doing a blind evolution-like outer-loop search process in an environment where multiple AGIs might benefit from cooperation—see Section 8.3.3.1 here for my three reasons why (which seem complementary to yours).

However, I don’t think that blind evolution-like outer-loop search processes are an ingredient in either shard theory or “alignment by default”.

At least in the shard theory case, the shard theory people seem very clear that when they talk about humans, they’re thinking about within-lifetime learning, not human evolution. For example, they have a post that says “Evolution is a bad analogy for AGI” right in the title!! (I agree btw.)

Expanding on 3):…

OK, now it seems that the post is maybe shifting away from evolution and towards within-lifetime learning, which I like.

In that case, I think there are innate drives that lead (non-psychopathic) humans to feel various social instincts, some of which are related to “niceness”. I think it would be valuable to understand exactly how these innate drives work, and that’s why I’ve been spending 80% of my time doing that. There are a few reasons that it seems valuable. At the very least, this information would give us examples to ground the yet-to-be-invented science that (we hope) will issue predictions like “If an AGI has innate drives X, and training environment Y, it will “grow up” into a trained AGI that wants to do Z”.

A stronger claim (which I don’t endorse) would be “We should put those exact same niceness-related innate drives, built the exact same way, into an AGI, and then we’ve solved alignment!” That seems like almost definitely a very bad plan to me. (See here.) The thing about empathy that you mentioned is one reason. Likewise, for all I know right now, the innate drives are implemented in a way that depends on having a human body and growing up at human speed in a human family etc.

However, if we understand how those innate drives work in humans, then we don’t have to slavishly copy them into an AGI. We can tailor them. Or we can come up with superficially-quite-different approaches that wind up in a similar place. Alignment-by-default would be in that “superficially quite different” category, I think? (As for shard theory, I’m a bit hazy on exactly what the plan is.)

Expanding on 4): 

I want to register strong agreement that this is an area where things can go awry.

Not central to the substance of your claims, but it seems like a good quick improvement to me. I think this post would be more aptly named "Niceness is contingent", or some such. "Niceness is unnatural" is false. Niceness values occur often and reliably in nature, and so they are natural. 

(Perhaps will respond more substantively later, once I judge myself to have a clearer picture of what you're arguing.)

This post makes a pretty straightforward and important point, and I've referenced it a few times since then. It hasn't made a huge impact , and it isn't the best explanation, but I think it's a good one that covers the basics, and I think it could be linked to more frequently.

Whether or not details (and lots of specific detail arguments) matter hinges on the sensitivity argument (which is an argument about basins?) in general, so I'd like to see that addressed directly. What are the arguments for high sensitivity worlds other than anthropics? What is the detailed anthropic argument?