Intertheoretic utility comparison: simple theory

Stuart_Armstrong

There's been a lot of work on how to reach agreement between people with different preferences or values. In practice, reaching agreement can be tricky, because of issues of extortion/trade and how the negotiations actually play out.

To put those issues aside, let's consider a much simpler case: where a single agent is uncertain about their own utility function. Then there is no issue of extortion, because the agent's opponent is simply itself.

This type of comparison is called intertheoretic, rather than interpersonal.

A question of scale

It would seem that if the agent believed with probability $p$ that it followed utility $u$ , and $1 - p$ that it followed utility $v$ , then it should simply follow utility $w = (p) u + (1 - p) v$ .

But this is problematic, because $u$ and $v$ are only defined up to positive affine transformations. Translations are not a problem: sending $u$ to $u + c$ sends $w$ to $w + p c$ . But scalings are: sending $u$ to $r u$ does not usually send $w$ to any scaled version of $w$ .

So if we identify $[u]$ as the equivalence class of utilities equivalent to $u$ , then we can write $w = (p) u + (1 - p) v$ , but it's not meaningful to write $[w] = (p) [u] + (1 - p) [v]$ .

From clarity, we'll call things like $u$ (which map worlds to real values) utility functions, while $[u]$ will be called utility classes.

The setup

This is work done in collaboration with Toby Ord, Owen Cotton-Barratt, and Will MacAskill. We had some slightly different emphases during that process. In this post, I'll present my preferred version, while adding the more general approach at the end.

We will need the structure described in this post:

#. A finite set $S$ of deterministic strategies the agent can take. #. A set $U$ of utility classes the agent might follow. #. A distribution $p$ over $U$ , reflecting the agent's uncertainty over its own utility functions. #. Let $U_{p} \subset U$ be the subset to which $p$ assigns a non-zero weight. We'll assume $p$ puts no weight on trivial, constant utility functions.

We'll assume here that $p$ never gets updated, that the agent never sees any evidence that changes its values. The issue of updating $p$ is analysed in the sections on reward learning agent.

We'll be assuming that there is some function $f$ that takes in $S$ and $p$ and outputs a single utility class $f (S, p) \in U$ reflecting the agent's values.

Basic axioms

Relevant data: If the utility classes $[u]$ and $[v]$ have the same values on all of $S$ , then they are interchangeable from $f$ 's perspective. Thus, in the terminology of this post, we can identify $U$ with $k (U) / \sim$ .

This gives $U$ the structure of $S \cup {0}$ , where $S$ is a sphere, and $0$ corresponds to the trivial utility that is equal on all $S$ . The topology of $U$ is the standard topology on $S$ , and the only open set containing ${0}$ is the whole of $U$ .

Then with a reasonable topology on the probability distribution on $U$ -- such as the weak topology? -- this leads to the next axiom:

Continuity: the function $f$ is continuous in $p$ .
Individual normalisation: there is a function $h$ that maps $U$ to individual utility functions, such that $f (S, p) = [\int_{U_{p}} h (u) p]$ (using $p$ as a measure on $U_{p}$ ).

The previous axiom means that all utility classes get normalised individually, then added together according to their weight in $p$ .

Symmetry: If $ρ$ is a stable permutation of $S$ , then $f (S, p) \circ ρ = f (S, p \circ ρ)$ .

Symmetry essentially means that the labels of $S$ , or the details of how the strategies are implemented, do not matter.

Utility reflection: $h [u] = - h [- u]$ .
Cloning indifference: If there exists $s_{1}, s_{2} \in S$ such that for all $u$ in $U$ on which $p$ is non-zero, $u (s_{1}) = u (s_{2})$ , then $f (S, p) = f (S - {s_{1}}, p)$ .

Cloning indifference means that the normalisation procedure does not care about multiple strategies that are equivalent on all possible utilities: it treats these strategies as if they were a single strategy.

We might want a stronger result, an independence of irrelevant alternatives. But this clashes with symmetry, so the following axioms attempt to get a weaker version of that requirement.

Relevance axioms

The above axioms are sufficient for the basics, but, as we'll see, they're compatible with a lot of different ways of combining utilities. The following two axioms attempt to put some sort of limitations on these possibilities.

First of all, we want to define events that are irrelevant. In the terminology of this post, let $h a$ be a partial history (ending in an action), with at two possible observations afterwards: $o$ and $o^{'}$ .

Then $S_{h a} = S_{h a o} \times S_{h a o^{'}}$ . Then if there exists a bijection $σ$ between $S_{h a o}$ and $S_{h a o^{'}}$ such that, for all $u$ with $[u] \in U_{p}$ , $u (s) = u (σ (s))$ , then the observation $o$ versus $o^{'}$ is irrelevant. See here for more on how to define $u$ on $S_{h a o}$ in this context.

Thus irrelevance means that the utilities in $U_{p}$ really do not 'care' about $o$ versus $o^{'}$ , and that the increased strategy set it allows is specious. So if we remove $o$ as a possible observation (substituting $o^{'}$ instead) this should make no difference:

Weak irrelevance: If $o$ versus $o^{'}$ given $h a$ is irrelevant for $p$ , then making $o$ (xor $o^{'}$ ) impossible does not change $f$ .
Strong irrelevance: If $o$ versus $o^{'}$ given $h$ is irrelevant for $p$ and there is at least one other possible observation $o^{''}$ after $h a$ , then making $o$ (xor $o^{'}$ ) impossible does not change $f$ .

Full theory

In our full analysis, we considered other approaches and properties, and I'll briefly list them here.

First of all, there is a set of prospects/options $O$ that may be different from the set of strategies $S$ . This allows you to add other moral considerations, not just strictly consequentialist expected utility reasoning.

In this context, the $f$ defined above was called a 'rating function', that rated the various utilities. With $O$ , there are two other possibilities, the 'choice function' which selected the best option, and the permissibility function, which lists the options you are allowed to take.

If we're considering options as outputs, rather than utilities, then we can do things like requiring the options to be Pareto only. We could also consider that the normalisation should stay the same if we remove the non-Pareto options or strategies. We might also consider that it's the space of possible utilities that we should care about; so, for instance, if $u (s_{1}) = 1$ , $u (s_{2}) = 0$ and $u (s_{3}) = - 1$ , and similar results hold for all $[u]$ in $U_{p}$ , then we may as well drop $s_{2}$ from the strategy set as it's image is in the mixture of the other strategies.

Finally, some of the axioms above were presented in weaker forms (eg the individual normalisations) or stronger (eg independence of irrelevant alternatives).

You talk like $p$ is countably supported, but everything you've said generalizes to arbitrary probability measures $p$ over $S$ , if you replace "for all $u$ assigned nonzero probability by $p$ " with "for all $u$ in some set assigned probability $1$ by $p$ ".

If you endow $U$ with the quotient topology from $R^{S} / \sim$ , then the only open set containing $0$ is all of $U$ . This is a funny-looking topology, but I think it is ultimately the best one to use. With this topology, every function to $U$ is continuous at any point that maps to $0$ . As a consequence, the assumption "if $f (S, p) \in S$ " in the continuity axiom is unnecessary. More importantly, what topology on the space of probability distributions did you have in mind? Probably the weak topology?

I find independence of irrelevant alternatives more compelling than symmetry, but as long as we're accepting symmetry instead, it probably makes sense to strengthen the assumption to isomorphism-invariance: If $ρ : S_{1} \to S_{2}$ is a bijection, then $f (S_{2}, p) \circ ρ = f (S_{1}, p \circ ρ)$ .

The relevance axioms section is riddled with type errors. $u (s) = u (σ (s))$ only makes sense if $S = S_{\neg X} ⊔ S_{X}$ , which would make sense if $S$ represented a space of outcomes rather than a space of strategies (which seems to me to be a more natural space to pay attention to anyway), or if $X$ is fully under the agent's control, whereas $S = S_{\neg X} \times S_{X}$ makes sense if $X$ is fully observable to the agent. If $X$ is neither fully under the agent's control nor fully observable to the agent, then I don't think either of these make sense. If we're using $\times$ instead of $⊔$ , then formalizing irrelevance seems trickier. The best I can come up with is that $p$ is supported on $u$ of the form $u (s, t) = (1 - q) ~ u (σ (s)) + q ~ u (t)$ , where $q$ is the probability of $X$ . The weak and strong irrelevance axioms also contain type errors, since the types of the output and second input of $f$ depend on its first input, though this can probably be fixed.

I didn't understand any of the full theory section, so if any of that was important, it was too brief.

Yes to your two initial points; I wanted to keep the exposition relatively simple.

Do you disagree with the reasoning presented in the picture-proof? That seems a simple argument against IIA. Isomorphism invariance makes sense, but I wanted to emphasise the inner structure of $S$ .

Updated the irrelevance section to clarify that $X$ is fully observed and happens before the agent takes any actions, and that $u (s)$ should be read as $u (s | \neg X)$ .

The full theory section is to write up some old ideas, to show that the previous axioms are not set in stone but that other approaches are possible and were considered.

Your picture proof looks correct, but it relies on symmetry, and I was saying that I prefer IIA instead of symmetry. I'm not particularly confident in my endorsement of IIA, but I am fairly confident in my non-endorsement of symmetry. In real situations, strategies/outcomes have a significant amount of internal structure which seems relevant and is not preserved by arbitrary permutations.

You've just replaced a type error with another type error. Elements of $U$ are just (equivalence classes of) functions $S \to R$ . Conditioning like that isn't a supported operation.

You're right. I've drawn the set of utility functions too broadly. I'll attempt to fix this in the post.

Ok, I chose the picture proof because it was a particularly simple example of symmetry. What kind of internal structure are you thinking of?

For strategies: This ties back in to the situation where there's an observable event $X$ that you can condition your strategy on, and the strategy space has a product structure $S = S_{X} \times S_{\neg X}$ . This product structure seems important, since you should generally expect utility functions $u$ to factor in the sense that $u (s, t) = q u_{X} (s) + (1 - q) u_{\neg X} (t)$ for some functions $u_{X}$ and $u_{\neg X}$ , where $q$ is the probability of $X$ (I think for the relevance section, you want to assume that whenever there is such a product structure, $p$ is supported on utility functions that factor, and you can define conditional utility for such functions). Arbitrary permutations of $S$ that do not preserve the product structure don't seem like true symmetries, and I don't think it should be expected that an aggregation rule should be invariant under them. In the real world, there are many observations that people can and do take into account when deciding what to do, so a good model of strategy-space should have a very rich structure.

For outcomes, which is what utility functions should be defined on anyway: Outcomes differ in terms of how achievable they are. I have an intuition that if an outcome is impossible, then removing it from the model shouldn't have much effect. Like, you shouldn't be able to rig the aggregator function in favor of moral theory 1 as opposed to moral theory 2 by having the model take into account all the possible outcomes that could realistically be achieved, and also a bunch of impossible outcomes that theory 2 thinks are either really good or really bad, and theory 1 thinks are close to neutral. A natural counter-argument is that before you know which outcomes are impossible, any Pareto-optimal way of aggregating your possible preference functions must not change based on what turns out to be achievable; I'll have to think about that more. Also, approximate symmetries between peoples' preferences seem relevant to interpersonal utility comparison in practice, in the sense that two peoples' preferences tend to look fairly similar to each other in structure, but with each person's utility function centered largely around what happens to themselves instead of the other person, and this seems to help us make comparisons of the form "the difference between outcomes 1 and 2 is more important for person A than for person B"; I'm not sure if this way of describing it is making sense.

OK, got a better formalism: https://agentfoundations.org/item?id=1449

I think I've got something that works; I'll post it tomorrow.