Toy model #6: Rationality and partial preferences

Stuart_Armstrong

In my research agenda on synthesising human preferences, I didn't mention explicitly using human rationality to sort through conflicting partial preferences.

This was, in practice, deferred to the "meta-preferences about synthesis". In this view, rationality is just one way of resolving contradictory lower-level preferences, and we wouldn't need to talk about rationality, just observe that it existed - often - within the meta-preferences.

Nevertheless, I think we might gain by making rationality - and its issues - an explicit part of the process.

Defining rationality in preference resolution

Explicit choice

We can define rationality in this area by using the one-step hypotheticals. If there is a contradiction between lower-level preferences, then that contradiction is explained to the human subject, and they can render a verdict.

This process can, of course, result in different outcomes depending on how the question is phrased - especially if we allow the one-step hypothetical to escalate to a "hypothetical conversation" where more arguments and evidence is considered.

So the distribution of outcomes would be interesting. If, in cases where most of the relevant argument/evidence is mentioned, the human tends to come down on one side, then that is a strong contender for being their "true" rational resolution of the issues.

However, if in stead the human answers in many different ways, especially if the answer changes because of small changes in how the evidence is ordered, how long the human has to think, whether they get all the counter-evidence or not, and so on - then their preference seems to be much weaker.

For example, I expect that most people similar to me would converge on one answer on questions like "does expected lives save dominate most other considerations in medical interventions?", while having wildly divergent views on "what's the proper population ethics?".

This doesn't matter

Another use of rationality could be to ask the human explicitly whether certain aspects of their preferences should matter. Many human seem to have implicit biases, whether racial or other; many humans believe that it is wrong to have these biases, or at least wrong to let them affect their decisions^[1].

Thus another approach for rationality is to query the subject as to whether some aspect should be affecting their decisions or not (because humans only consider a tiny space of options at once, it's better to ask "should X, Y, and Z be relevant", rather than "are A, B, and C the only things that should be relevant?").

Then these kind of rational questions can also be treated in the same way as above.

Weighting rationality

Despite carving out a special place for "rationality", the central thrust of the research agenda remains: a human's rational preferences will dominate other preferences, only if they put great weight in their own rationality.

Real humans don't always change their views just because they can't currently figure out a flaw in an argument; nor would we want them to, especially if their own rationality skills are limited or underused.

Having preferences that never affect any decisions at all is, in practice, the same as not having those preferences: they never affect the ordering of possible universes. ↩︎

AI ALIGNMENT FORUM
AF