What is a definition, how can it be extrapolated?

Stuart_Armstrong

What is a definition? Philosophy has, ironically, a large number of definitions of definitions, but three of them are especially relevant to ML and AI safety.

There is the intensional definition, where concepts are defined logically in terms of other concepts (“bachelors are unmarried males”). There is also the extensional definition, which proceeds by listing all the members of a set (“the countries in the European Union are those listed here”).

Much more relevant, though with a less developed philosophical analysis, is the ostensive definition. This is where you point out examples of a concept, and let the viewer generalise from them. This is in large part how we all learnt concepts as children: examples and generalisation. In many cultures, children have a decent grasp of “dog” just from actual and video examples - and that’s the definition of “dog” we often carry into adulthood.

We can use ostensive definitions for reasoning and implications. For example, consider the famous syllogism, “Socrates is human”, “humans are mortal” imply “Socrates is mortal”. “Socrates is human” means that we have an ostensive definition of what humans are, and Socrates fits it. Then “humans are mortal” means that we’ve observed that the set of “human” seems to be mainly a subset of the set of “mortals”. So we can ostensively define humans as mortal (note that we are using definitions as properties: having the property of “being mortal” means that one is inside the ostensive definition of “mortals”). And so we can conclude that Socrates is likely mortal, without waiting till he’s dead.

Distinctions: telling what from non-what

There’s another concept that I haven’t seen articulated, which is what I’ll call the “distinction”. This does not define anything, but is sufficient to distinguish between an element of a set from non-members.

To formalise "the distinction", let be the universe of possible objects, and $E \subset Ω$ the “environment” of objects we expect to encounter. An ostensive definition starts with a list $S \subset E$ of examples, and generalises to a “natural” category $S^{E}$ with $S \subset S^{E} \subset E$ - we are aiming to "carve reality at the joints", and get an natural extension of the examples. So, for example, $E$ might be the entities in our current world, $S$ might be the example of dogs we’ve seen, and $S^{E}$ the set of all dogs.

Then, for any set $T \subset E$ , we can define the “distinction” $d_{T, E}$ which maps $T$ to 1 (“True”) and its complement $E ∖ T$ to 0 (“False”). So $d_{S^{E}, E}$ would be a distinction that identifies all the dogs in our current world.

Mis-definitions

A lot of confusion around definition seems to come from mistaking distinctions for definitions. To illustrate, consider the idea of defining maleness as "possessing the Y chromosome". As a distinction, it's serviceable: there's a strong correlation between having that chromosome and being ostensively male.

But it is utterly useless as a definition of maleness. For instance, it would imply that nobody before the 20th century had any idea what maleness was. Oh, sure, they may have referred to something as "maleness" - something to do with genitalia, voting rights, or style of hats - but those are mere correlates of the true definition of maleness, which is the Y chromosome. It would also imply that all "male" birds are actually female, and vice-versa.

Scott had a description of maleness here: “Absolutely typical men have Y chromosomes, have male genitalia, appreciate manly things like sports and lumberjackery, are romantically attracted to women, personally identify as male, wear male clothing like blue jeans, sing baritone in the opera, et cetera.”

Is this a definition? I’d say not; it’s not a definition, it’s a reminder of the properties of our ostensive definitions (apart from the Y chromosome bit - unless we’re geneticists, we’ll never have observed the Y chromosomes of anyone). It can function as the beginning of an ostensive definition for some alien who hadn’t encountered maleness before and now knew what examples to start with. It can also function as a definition of “American maleness” for foreigners who had no association between maleness and blue jeans or lumberjackery. As we’ll see later on, pure ostensive definitions are rare, and are often mixed in with other definition types.

In another example, Plato (allegedly) defined humans as "featherless bipeds", to which Diogenes plucked a chicken and proclaimed "Here is Plato's man." Plato then amended the definition to "featherless biped with broad flat nails."

So what happened is that Plato had a distinction for humans in the typical Athenian environment. Diogenes then attacked that distinction with an adversarial example, and Plato responded by refining the distinction. Neither of these distinctions are definitions, though; the fact that Plato immediately amended his sentence after Diogenes’s example shows that. He didn’t stop and consider reasoned arguments as to whether a plucked chicken might be human; instead, he relied on his ostensive definition of human, and ruled out the chicken immediately, and then revised his sentence to justify that.

Distinctions, definitions, and extrapolations

When extrapolating to new environments, a key fact is that definitions and distinctions extrapolate differently.

For example, suppose we’d been brought up in an environment where the only dogs we’d ever seen were small black ones:

And suppose that we’d not seen any other black animal or indeed any other black object of that size. Then one plausible distinction for “dog” would be “is it small and black?”.

And now we go out into the world, and see a lot more dogs of all shapes, sizes, and colours. We also encounter a lot of black rocks:

We can extend the ostensive definition of dog into this new environment, and would naturally converge on the usual definition of dog. But if we extended the distinction, we would classify black rocks as dogs and white poodles as non-dogs.

Symbolically, let $S$ generate the ostensive definition $S^{E}$ in environment $E$ . And let $d_{S^{E}, E}$ be the corresponding distinction (there is a single distinction if we think of it as a function; there are multiple ones if we think of it as an algorithm that implements that function).

And now let us extend the environment to $E^{'} \supset E$ . Then the ostensive definition of $S$ in that new environment is $S^{E^{'}}$ , the natural extrapolation of $S$ to the new environment (we’re presuming there is a single “natural” extrapolation here, but there could be many candidates). The distinction $d_{S^{E}, E}$ also has a natural extrapolation; call it $d_{S^{E}, E}^{'}$ .

But $d_{S^{E}, E}^{'}$ can be very different from $d_{S^{E^{'}}, E^{'}}$ the distinction of $S^{E^{'}}$ . In summary:

The distinction of the natural extrapolation of an ostensive definition can be very different from the natural extrapolation of the distinction of that ostensive definition.

That is the problem that neural nets typically have when going out of distribution. They learnt a distinction that fit in their previous environment, and extrapolate that distinction, to dramatically wrong results, when we would have wanted to extrapolate some sort of ostensive definition.

That is akin to proxy rewards and Goodharting. If we care about the production of nails, then "what is the weight of this factory's production?" is a good way of distinguishing productive factories from non-productive ones. However, as soon as we use that as an objective, the factory managers can change their process to just make each nail much heavier at no benefit to the end consumer adding adversarial examples that move to a new environment. Hence "weight of production" is a poor definition of productivity.

Non-transitivity of definition extrapolation

Let $S$ be the set of small black dogs, as before. Let $E_{0}$ be the initial environment with only small black dogs. Let $E_{1}$ be a larger environment with small black dogs, wolves, and Saint Bernards. And let $E_{2}$ be an even larger environment with all types of dogs and animals.

It is plausible that the ostensive extrapolation of $S$ on $E_{1}$ might include wolves (another logical option is that it doesn’t include Saint Bernards). So $S^{E_{1}}$ includes wolves and Saint Bernards. Then imagine that we became comfortable with $S^{E_{1}}$ as our new definition of “dogs”.

Then the extrapolation of $S^{E_{1}}$ to $E_{2}$ would define ${(S^{E_{1}})}^{E_{2}}$ , the set of “dog-wolves^[1]”. This would be different from the direct extrapolation of $S$ to $E_{2}$ , which would define $S^{E_{2}}$ , which would be the set of dogs, as expected:

If we went the alternative route of not including Saint Bernards in $S^{E_{1}}$ , then ${(S^{E_{1}})}^{E_{2}}$ might end up as the category of “small-dogs”, again different from $S^{E_{2}}$ :

So the end result of extending definitions, finding more examples, extending again, and so on, may depend on the order in which we encounter the environment extrapolations.

A few more complications

Human definitions are a bit more complicated than pure ostensive ones.

Consider again defining dogs. Dogs are living animate beings and persist in time – the dog one day is the same as the dog the next day. But most people don’t learn time persistence via dogs: we learn it via experience with many objects in the world, and then apply that idea to dogs as well.

Similarly, we learn the concept of “living animate beings” from many examples, and then apply that concept to new categories, without re-learning it every time.

Thus we don’t learn the concept $S^{E}$ from just the list $S$ . We fit $S$ within a framework of other concepts. So, if someone hasn’t really encountered dogs before but has an otherwise normal upbringing, then upon encountering dogs for the first time, they can fit them into the “living animate beings” set $L$ from their first observations, and import the properties of $L^{E}$ to define $S^{E}$ (rather than strictly using $S$ ).

As a second complication, consider the sentence “larger dogs are more aggressive”. If people hear that and believe it, this updates their world-model. They already have (ostensive?) definitions of “larger”, “dogs”, and “aggressive”. Then this intensional relation between size and aggressiveness of dogs can be added to their world-model without them necessarily needing to experience large aggressive dogs. Which is useful (if someone told me that, say, “larger tigers are more aggressive”, I’m very happy to accept that without having to experience it personally).

So both of these examples show that the extension $S \to S^{E}$ depends on context. We know of lots of properties about objects in the world, and the relations between them. These relationships between properties can be ostensive (we have observed large aggressive dogs), intensional (we are told that larger dogs are more aggressive), or both (a mix of observation and formal statements that we believe). And properties can themselves be defined in similar ways, and combined in similar ways, and built on other properties defined in similar ways.

Using these complications to aid concept extrapolation

The above complications suggest ways to help concept extrapolation, when moving to a larger environment.

Consider an algorithm that aims to find images that are closely related to each other. It has many inputs, including the following two:

We want the algorithm to find that these images are close to each other, even though it has no other images of dogs. The biggest problem is that the two dogs are in different poses and from different angles, and the algorithm has no direct evidence that dogs can change poses or knowledge about what they look like from different angles.

But now assume the algorithm has access to a lot of labeled images of different animals in all sorts of poses, from different angles (or possibly unlabeled videos of animals changing poses, or sequences of images of animals changing poses). Thus is could conclude that “animals change poses”, and “animals' appearances typically change in this way from other angles". And it could also notice that dogs look very similar to other animals, and thus that dogs are likely to be able to change poses, and to look a certain way from another angle. And it should therefore conclude that those two images are similar, with colour and textures being the only major differences.

That is what the algorithm could get from semi-supervised learning. But we could also give it this knowledge in a specific intensional fashion.

For example, we could give it labeled videos of animated entities moving in various ways:

Then if we give it the knowledge that the first image is a “dog”, and that “dogs belong to the category of animated entities”, it can similarly deduce that the entity in the first image is capable of changing poses and how it can do so, leading it to rank the photos as similar.

^{^}
The same works if we’ve seen some examples subset $S^{'} \subset S^{E_{1}}$ , which includes wolves, which is enough to generate the same ostensive set in $E_{2}$ ; thus $(S_{1})^{E_{2}} = (S^{E_{1}})^{E_{2}}$ .

[-]Gordon Seidoh Worley2y32

In classical Chinese philosophy there's the concept of shi-fei or "this not that". A key bit of the idea, among other things, is that all knowledge involves making distinctions, and those distinctions are judgments, and so if you want to have knowledge and put things into words you have to make this-not-that style judgements of distinction to decide what goes in what category.

More recently here on the forum, Abram has written about teleosemantics, which seems quite relevant to your investigations in this post.

AI ALIGNMENT FORUM
AF

19