Toy model piece #4: partial preferences, re-re-visited

Stuart_Armstrong

Two failed attempts

I initially defined partial preferences in terms of foreground variables and background variables $Z$ .

Then a partial preference would be defined by $y_{+}$ and $y_{-}$ in $Y$ , such that, for any $z \in Z$ , the world described by $(y_{+}, z)$ would be better than the world described by $(y_{-}, z)$ . The idea being that, everything else being equal (ie the same $z$ ), a world with $y_{+}$ was better than a world with $y_{-}$ . The other assumption is that, within mental models, human preferences can be phrased as one or many binary comparisons. So if we have a partial preference like $P_{1}$ : "I prefer a chocolate ice-cream to getting kicked in the groin", then $(y_{+}, z)$ and $(y_{-}, z)$ are otherwise identical worlds with a chocolate ice-cream and a groin-kick, respectively.

Note that in this formalism, there are two subsets of the set of worlds, $y_{+} \times Z$ and $y_{-} \times Z$ , and map $l$ between them (which just sends $(y_{+}, z)$ to $(y_{-}, z)$ ).

In a later post, I realised that such a formalism can't capture seemingly simple preferences, such as $P_{2}$ : " $n + 1$ people is better than $n$ people". The problem is that that preferences like that don't talk about just two subsets of worlds, but many more.

Thus a partial preference was defined as a preorder. Now, a preorder is certainly rich enough to include preferences like $P_{2}$ , but its allows for far too many different types of structures, needing a complicated energy-minimisation procedure to turn a preorder into a utility function.

This post presents another formalism for partial preferences, that keeps the initial intuition but can capture preferences like $P_{2}$ .

The formalism

Let $W$ be the (finite) set of all worlds, seen as universes with their whole history.

Let $X$ be a subset of $W$ , and let $l$ be an injective (one-to-one) map from $X$ to $W$ . Define $Y = l (X)$ , the image of $l$ , and $l^{- 1} : Y \to X$ as the inverse.

Then the preference is determined by:

For all $x \in X$ , $x > l (x)$ .

If $X$ and $Y$ are disjoint, this just reproduces the original definition, with $X = y_{+} \times Z$ and $Y = y_{-} \times Z$ .

But it also allows preferences like $P$ , defining $l (x)$ as something like "the same world as $x$ , but with one less person". In that case, $l$ maps some parts of $X$ to itself^[1].

Then for any element $x \in X$ , we can construct its upwards and downwards chain:

$\dots, l^{- 3} (x), l^{- 2} (x), l^{- 1} (x), x, l (x), l^{2} (x), l^{3} (x), \dots$ .

These chains end when they cycle: so there is an $n$ and an $m$ so that $l^{- n} (x) = l^{m} (x)$ (equivalently, $l^{m + n} (x) = x$ ).

If they don't cycle, the upwards chain ends when there is an $l^{- n} (x)$ which is not an element of $Y$ (hence $l^{- 1}$ is not defined on in), and the downward chain ends when there is an $l (x)$ which is not in $X$ (and hence $l$ is not defined on it).

So, for example, for $P_{1}$ , all the chains contain two elements only: $x$ and $l (x)$ . For $P_{2}$ , there are no cycles, and the lower chain ends when the population hits zero, while the upper chain ends when the population hits some maximal value.

Utilities difference between clearly comparable worlds

Since the worlds of $X \cup Y$ decompose either into chains or cycles via $l$ , there is not need for the full machinery for utilities constructed in this post.

One thing we can define unambiguously, is the relative utility between two elements of the same chain/cycle:

If $x$ and $y = l^{n} (x)$ are in the same cycle, then $U_{l} (x) = U_{l} (y)$ .
Otherwise, if $x$ and $y = l^{n} (x)$ are in the same chain, then $U_{l} (x) - U_{l} (y) = n$ .

Currently, lets normalise these relative utilities to ${ˆ U}_{l}$ , by normalising each chain individually; note that if every world in the chain is reachable, this is the same as the mean-max normalisation on each chain:

If $x$ and $y = l^{n} (x)$ are in the same cycle, then $U_{l} (x) = U_{l} (y)$ .
Otherwise, if $x$ and $y = l^{n} (x)$ are in the same chain with $m$ total elements in the chain, then $U_{l} (x) - U_{l} (y) = n / (m - 1)$ .

We we could try and extend ${ˆ U}_{l}$ to a global utility function which compares different chains and compares values in chains with values outside of $X \cup Y$ . But as we shall see in the next post, this doesn't work when combining different partial preferences.

Interpretation of $l$

The interpretation of $l$ is something like "this is the key difference in features that causes the difference in world-rankings". So, for $P_{1}$ , the $l$ switches out a chocolate ice-cream and substitutes a groin-kick. While for $P_{2}$ , the $l$ simply removes one person from the world.

This means that, locally, we can express $X \cup Y$ in the same $Y \times Z$ formalism as in the first post. Here the $Z$ are the background variables, while $Y$ is a discrete variable that $l$ operates on.

We cannot necessarily express this $Y \times Z$ product globally. Consider, for $P_{2}$ , a situation where $z_{0}$ is an idyllic village, $z_{1}$ is an Earthbound human population, and $z_{2}$ a star-spanning civilization with extensive use of human uploads.

And if $Y$ denotes the number of people in each world, it's clear that $Y$ hits a low maximum for $z_{0}$ (thousands?), can rise much higher for $z_{1}$ (trillions?), and even higher for $z_{2}$ (need to use scientific notation). So though $(10^{20}, z_{2})$ makes sense, $(10^{20}, z_{0})$ is nonsense. So there is no global decomposition of these worlds as $Y \times Z$ .

Note that there is a similarity with CP-nets, if we consider this as expressing a preference over population size while keeping other variables constant. ↩︎

AI ALIGNMENT FORUM
AF

7

Toy model piece #4: partial preferences, re-re-visited

7

Two failed attempts

The formalism

Utilities difference between clearly comparable worlds

Interpretation of $l$

7

Toy model piece #4: partial preferences, re-re-visited

7

Two failed attempts

The formalism

Utilities difference between clearly comparable worlds

Interpretation of l

Interpretation of $l$