Models, myths, dreams, and Cheshire cat grins

"she has often seen a cat without a grin but never a grin without a cat"

Alice in Alice in Wonderland, about the Cheshire cat (also known as the Unitary Authority of Warrington Cat).

Let's have a very simple model. There's a boolean, $C$ , which measures whether there's a cat around. There's a natural number $N$ , which counts the number of legs on the cat, and a boolean $G$ , which checks whether the cat is grinning (or not).

There are a few obvious rules in the model, to make it compatible with real life:

$\neg C \to (N = 0)$ .
$\neg C \to \neg G$ .

Or, in other words, if there's no cat, then there are zero cat legs and no grin.

And that's true about reality. But suppose we have trained a neural net to automatically find the values of $C$ , $N$ , and $G$ . Then it's perfectly conceivable that something might trigger the outputs $\neg C$ and $G$ simultaneously: a grin without any cat to hang it on.

Adversarial examples

Adversarial examples often seem to behave this way. Take for example this adversarial example of a pig classified as an airliner:

Imagine that the neural net was not only classifying "pig" and "airliner", but other things like "has wings" and "has fur".

Then the "pig-airliner" doesn't have wings, and has fur, which are features of pigs but not airliners. Of course, you could build an adversarial model that also breaks "has wings" and "has fur", but, hopefully, the more features that need to be faked, the harder it would become.

This suggests that, as algorithms get smarter, they will become more adept at avoiding adversarial examples - as long as the ultimate question is clear. In our real world, the categories of pigs and airliners are pretty sharply distinct.

We run into problems, though, if the concepts are less clear - such as what might happens to pigs and airliners if the algorithm optimises them, or how the algorithm might classify underdefined concepts like "human happiness".

Myths and dreams

Define the following booleans: $H H$ detects the presence of a living human head, $H B$ a living human body, $J H$ a living jackal head, $J B$ a living jackal body.

In our world real world we generally have $H H \leftrightarrow H B$ and $J H \leftrightarrow J B$ . But set the following values:

$\neg H H, H B, J H, \neg J B,$

and you have the god Anubis.

Similarly, what is a dragon? Well, it's an entity such that the following are all true:

${is lizard, is flying, is huge, breath is fire, intelligence is human level, ...}$

And, even though those features never go together in the real world, we can put them together in our imagination, and get a dragon.

Note that "is flying" seems more fundamental to a dragon than "has wings", thus all the wingless dragons that fly "by magic^[1]". Our imagination seem comfortable with such combinations.

Dreams are always bewildering upon awakening, because they also combine contradictory assumptions. But these combinations are often beyond what our imaginations are comfortable with, so we get things like meeting your mother - who is also a wolf - and handing Dubai to her over the tea cups (that contain milk and fear).

"Alice in Wonderland" seems to be in between the wild incoherence of dream features, and the more restricted inconsistency of stories and imagination.

Not that any real creature that size could fly with those wings anyway. ↩︎

[-]Steven Byrnes6y60

Sorta related: my comment here

[-]Stuart_Armstrong6y30

Thanks! Good insights there. Am reproducing the comment here for people less willing to click through:

I haven't read the literature on "how counterfactuals ought to work in ideal reasoners" and have no opinion there. But the part where you suggest an empirical description of counterfactual reasoning in humans, I think I basically agree with what you wrote.

I think the neocortex has a zoo of generative models, and a fast way of detecting when two are compatible, and if they are, snapping them together like Legos into a larger model.

For example, the model of "falling" is incompatible with the model of "stationary"—they make contradictory predictions about the same boolean variables—and therefore I can't imagine a "falling stationary rock". On the other hand, I can imagine "a rubber wine glass spinning" because my rubber model is about texture etc., my wine glass model is about shape and function, and my spinning model is about motion. All 3 of those models make non-contradictory predictions (mostly because they're issuing predictions about non-overlapping sets of variables), so the three can snap together into a larger generative model.

So for counterfactuals, I suppose that we start by hypothesizing some core of a model ("a bird the size of an adult blue whale") and then searching out more little generative model pieces that can snap onto that core, growing it out as much as possible in different ways, until you hit the limits where you can't snap on any more details without making it unacceptably self-contradictory. Something like that...

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

7

Models, myths, dreams, and Cheshire cat grins

7

Adversarial examples

Myths and dreams