One-step hypothetical preferences

Stuart_Armstrong

Human preferences are time-inconsistent, and also contradictory.

That, by itself, is not a huge problem, but it's also the case that few human preferences are present at any given moment. At the moment, I'm focused on finding the best explanation to get my ideas through to you, the reader; I'm not focused on my moral preferences, personal safety desires, political beliefs, or taste in music.

If anyone asked me about those, I could immediately bring them to mind. My answers to standard questions are kinda in the background, accessible but not accessed. Wei Dai made a similar point about translators: they have a lot of trained knowledge that is not immediately accessible to their introspection. And only by giving them the inputs they were trained on (eg words, sentences,...) can you bring that knowledge to the fore.

In this post, I'll try and formalise these accessible preferences, starting with formalising preferences in general.

Basic preferences setup

This section will formalise the setup presented in Alice's example. Let $W$ be a set of all possible worlds. A human makes use of a model $M$ . This model contains a lot of variables ${P_{i}}$ , called properties. These $P_{i}$ take values in a domain $D_{i}$ .

A basic set $S$ of states in $M$ is a set of possible values for some of the $P_{i}$ . Thus $S = {S_{i}}$ , with $S_{i} \subset D_{i}$ . The property $P_{i}$ unconstrained in $S$ if $S_{i} = D_{i}$ . A general set of states is a union of basic $S$ ; let $S$ be these of all these sets of states.

For example, a human could be imagining four of their friends, and the $P_{i j}$ could be whether friend $i$ is sleeping with friend $j$ ( $6$ different Boolean $P_{i j} = P_{j i}$ ), and also whether a third friend $k$ believes two others are sleeping together ( $12$ different $P_{i j k} = P_{j i k}$ , taking values in ${$ sleeping together, not sleeping together, don't know $}$ ).

Then a statement of human gossip like ''X is sleeping with Y, but A doesn't realise it; in fact, A thinks that Y is sleeping with Z, which is totally not true!" is encoded as:

$S_{G} = {P_{X Y} = 1,$ $P_{Y Z} = 0,$ $P_{X Y A} \subset {"don't know", "not sleeping together"},$ $P_{Y Z A} = "sleeping together"}$ , with the other $P$ s unconstrained.

It's interesting how unintuitive that formulation is, compared with how our brains instinctively parse gossip.

To make use of these, these symbols need to be grounded. This is achieved via a function $g$ that takes a set of states $S$ and maps it to a set of worlds: $g (S) \subset W$ .

Finally, the human expresses a judgement about the states of $M$ , mentally categorising a set of states as better than another. This is an anti-symmetric partial function $J : S \times S \to R$ , a partial function that is non trivial on at least one pair of inputs.

For example, if $S_{G}$ is the gossip set above, and $S_{G}^{'}$ is the same statement with $P_{Y Z A} = "not sleeping together"$ , then a human that values honesty might judge $J (S_{G}, S_{G}^{'}) = - 1$ ; ie it is worse if $A$ believes a lie about $Y$ and $Z$ .

The sign of $J (S, S^{'})$ informs which set the human prefers; the magnitude is the difficult-to-define weight or intensity of the preference.

Hypotheticals posed to the human

Let $M_{J}$ be the set of possible pairs $(M, J)$ defined in the previous section. Humans rarely consider many $(M, J)$ at the same time. We often only consider one, or zero.

A hypothetical is some possible short intervention - a friend asks them a question, they get an email, a TV in the background shows something salient - that will cause a human to mentally use a model $M$ and pass judgement $J$ within it. Note that this not the same as Paul Christiano's definition of ascription : we don't actually need the human to answer anything, just to think.

So if $H_{t}$ is the set of possible hypothetical interventions at time $t$ , we have a (counterfactual) map $f$ from $H_{t}$ to $M_{J}$ .

Now, not all moments are ideal for a human to do much reflection (though a lot of instinctive reactions are also very informative). So it might be good to expand the time a bit, to say, $T =$ a week, and consider all the models that a human could be hypothetically be made to consider in that time.

So let $H_{t}^{T}$ be the set of hypothetical short interventions from time $t$ to $t + T$ , given that this intervention is the first in that time period. Then there is a natural map

$f : H_{t}^{T} \to M_{J}$ .

Idealised object

The map $f$ is a highly idealised and counterfactual object - there is no way we can actually test a human on the vast number of possible interventions. So the AI would not be tasked with "use $f$ to establish human preferences", but "estimate $f$ to estimate human preferences".

The $f$ will also reveal a lot of contradictions, since humans often have different opinions on the same subject, depending on how the intervention or question is phrased. Different phrasings may trigger different internal models of the same issue, or even different judgements within the same model. And, of course, the same intervention at different times (or by different agents) may trigger different reactions.

But dealing with contradictions is just one of the things that we have to sort out with human preferences.

Minimum modification

I mentioned the interventions should be short; that $T$ should be a short period; and that the interventions in $H_{t}^{T}$ should be the first in that time period. The whole idea is to avoid "modifying" the human too much, or giving the AI too much power to change, rather than reflect, the human's values. The human's reaction should be as close as possible to an unvarnished initial reaction.

There may be other ways of reducing the AI's influence, but it is still useful to get these initial reactions.

One-step hypotheticals

In slight contrast with the previous section, it is very valuable to get the human to reflect on new issues they hadn't considered before. For example, we could introduce them to philosophical thought experiments they hadn't seen before (maybe the trolley problem or the repugnant conclusion, or unusual variants of these), or present ideas that cross across their usual political boundaries, or the boundaries of their categories (eg whether Neanderthals should have human rights if a tribe of them were suddenly discovered today).

This is, in a sense, a minimum extrapolation, the very first tentative step of CEV. We are not asking what the human would think if they were smarter, but instead what they would think if they encountered a novel problem for the first time.

These "one-step hypotheticals" are thus different from the human's everyday current judgement, yet don't involve transforming the human into something else.

EDIT: Avturchin asks whether I expect these one-step hypotheticals to reveal hidden preferences, or to force humans to make a choice, knowing that they might have made a different choice in different circumstances.

The answer is... a bit of both. I expect the hypotheticals to sometimes contradict each other, depending on the phrasing and the timing. I expect them to contradict each other more than more usual questions ("zero-step hypotheticals") do.

But I don't expect the answers to be completely random, either. There will be a lot of information there. And the pattern of different $H_{t}^{T}$ leading to different or contradictory $J$ is relevant, and not random.

5