Eliezer has talked about thingspace - an abstract space in which objects lie, defined by their properties on some scales. In thingspace, there are some more-or-less natural categories, such as the "bird" category, that correspond to clusters in this thingspace. Drawing the boundary/carving reality at the joints, can be thought of as using definitions that efficiently separate these clusters[1].

He said about the thingspace of birds:

The central clusters of robins and sparrows glowing brightly with highly typical birdness; satellite clusters of ostriches and penguins [further from the center] glowing more dimly with atypical birdness, and Abraham Lincoln a few megaparsecs away and glowing not at all.

That, however, is how things stand in the world as it is. What if, for some reason, we had a lot of power in the world, and we wanted to break these neat-ish clusters, how would we go about it?

Remaking the clusters

Suppose first we wanted to break away some of those satellites away. What if we wanted penguins to be clearly distinct from birds? Penguins are land-waddlers and good swimmers, living in cold climates. So let's fill out this category, call it, maybe, Panguens.

Panguens all live in cold climates, waddle on land, and are good swimmers. Most are mammals, and give birth to live young. Some are more like the duck-billed platypus, and lay eggs (though they suckle their young). And all of them, including penguins, can reproduce with each other (how? genetic engineering magic). Thus there are a lot of intermediate species, including some that are almost-penguins except differing on a few details of their anatomy. All have fins of some sort, but only have penguin-like wing-fins.

Panguens are clearly a cluster, quite a tight one as animals go (given they can reproduce with each other). And they are clearly not birds. And penguins clearly belong inside them, rather than with birds; at most, some pedantic classifier would label penguins as "bird-like panguens".

So, by filling in extra possibilities in thingspace, we have moved penguins to a new category. Graphically, it looks like starting with:

and moving to:

What if we wanted to make Abraham Lincoln into a bird[2]? We can connect the categories of birds and humans by creating all sort of bird-human hybrids. But this is not quite enough. Bird-humans is certainly a reasonable category in this world. But humans are very close to other mammals, especially great apes; Lincoln is clearly a bird-human, but also clearly a mammal. How can we excise him from this category?

Well, we might just go around killing all the apes and monkeys, making the bird-human category more natural. But we don't need to do that. We could introduce instead mammal-dog hybrids: all mammal species (except humans) have a smooth transition to lizards, filled with intermediate species. Now the new natural definition of "mammal" includes "must have a hybrid version with lizards". All mammals fit this definition - except for humans, who have their own thing with birds. Much more natural to divide the "bird-humans" cleanly from the "mammal-lizards", and note that Lincoln is a bird-human that shares some features with some mammal-lizards.

The more we want to separate humans from mammals, the more we increase the category "mammals" in directions away from humans, and increase the category "bird-humans" in directions away from other mammals. Graphically we start with:

and move to:

That image, however, underestimates the distance between humans and other mammals, since we have added an extra feature - "has a hybrid version with lizards" - that other mammals have and humans don't. So humans and mammals have moved further apart in thingspace, along this feature.

Why we should care: AIs, examples, and definitions

This is an amusing mental exercise, but what's the point? The point is that this is another variant of the standard failure mode of powerful AIs.

Let's go with that old chestnut, and assume that we have a superintelligence dedicated to "making humans happy". For that to work, we have to define "human" and "happy". We could give an intrinsic definition of these, by saying that humans are "featherless bipeds" (with broad flat nails) and happy humans are those that are "smiling, with higher serotonin, dopamine, oxytocin, and endorphins".

Think of these in terms of thingspace. These carve it up, separating the featherless bipeds from the other animals and things, and the happy ones from those less happy. In our world as it stands, these are pretty natural cuts: almost humans are featherless bipeds, and almost all featherless bipeds (with broad flat nails) are humans. The happiness definition is similarly decent.

But all that falls apart as soon as the AI starts to optimise things. Consider tiny toy robots with a painted smile and some hormones around its CPU. These fit the definitions we've made, but are clearly not "happy humans". That's because the definitions are decent cuts in the space of actual things, but terrible cuts in the space of potential things. So the more powerful the AI becomes, the more it can enter potential space, and the more useless the definitions become.

Relative definitions also suffer: if you define what a human is in terms of collections of cells, for instance, you incentivise the AI to change what cells correspond to.

The approach of reward extrapolation/model splintering is to rely more on extrinsic definitions: the category of humans is the set of humans, the category of happy humans is the set of happy humans. Then the AI is supposed to extend both of these categories in ways similar to how we would do so, dealing conservatively with the inevitable problems as the defining features break down.


  1. This is quite similar to unsupervised learning in ML. ↩︎

  2. This is among my favourite sentences I have ever written. ↩︎

7

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 7:24 AM

Yes - nice post - feels to me like another handle on the pointers problem.

Thanks for the link.