Moral strategies at different capability levels

Richard_Ngo

Let’s consider three ways you can be altruistic towards another agent:

You care about their welfare: some metric of how good their life is (as defined by you). I’ll call this care-morality - it endorses things like promoting their happiness, reducing their suffering, and hedonic utilitarian behavior (if you care about many agents).
You care about their agency: their ability to achieve their goals (as defined by them). I’ll call this cooperation-morality - it endorses things like honesty, fairness, deontological behavior towards others, and some virtues (like honor).
You care about obedience to them. I’ll call this deference-morality - it endorses things like loyalty, humility, and respect for authority.

I think a lot of unresolved tensions in ethics comes from seeing these types of morality as in opposition to each other, when they’re actually complementary:

Care-morality mainly makes sense as an attitude towards agents who are much less capable than you, and/or can't make decisions for themselves - for example animals, future people, and infants.
- In these cases, you don’t have to think much about what the other agents are doing, or what they think of you; you can just aim to produce good outcomes in the world. Indeed, trying to be cooperative or deferential towards these agents is hard, because their thinking may be much less sophisticated than yours, and you might even get to choose what their goals are.
- Applying only care-morality in multi-agent contexts can easily lead to conflict with other agents around you, even when you care about their welfare, because:
  - You each value (different) other things in addition to their welfare.
  - They may have a different conception of welfare than you do.
  - They can’t fully trust your motivations.
- Care morality doesn’t focus much on the act-omission distinction. Arbitrarily scalable care-morality looks like maximizing resources until the returns to further investment are low, then converting them into happy lives.
Cooperation-morality mainly makes sense as an attitude towards agents whose capabilities are comparable to yours - for example others around us who are trying to influence the world.
- Cooperation-morality can be seen as the “rational” thing to do even from a selfish perspective (e.g. as discussed here), but in practice it’s difficult to robustly reason through the consequences of being cooperative without relying on ingrained cooperative instincts, especially when using causal decision theories. Functional decision theories make it much easier to rederive many aspects of intuitive cooperation-morality as optimal strategies (as discussed further below).
- Cooperation-morality tends to uphold the act-omission distinction, and a sharp distinction between those within versus outside a circle of cooperation. It doesn’t help very much with population ethics - naively maximizing the agency of future agents would involve ensuring that they only have very easily-satisfied preferences, which seems very undesirable.
- Arbitrarily scalable cooperation-morality looks like forming a central decision-making institution which then decides how to balance the preferences of all the agents that participate in it.
- A version of cooperation-morality can also be useful internally: enhancing your own agency by cultivating virtues which facilitate cooperation between different parts of yourself, or versions of yourself across time.
Deference-morality mainly makes sense as an attitude towards trustworthy agents who are much more capable than you - for example effective leaders, organizations, communities, and sometimes society as a whole.
- Deference-morality is important for getting groups to coordinate effectively - soldiers in armies are a central example, but it also applies to other organizations and movements to a lesser extent. Individuals trying to figure out strategies themselves undermines predictability and group coordination, especially if the group strategy is more sophisticated than the ones the individuals generate.
- In practice, it seems very easy to overdo deference-morality - compared to our ancestral environment, it seems much less useful today. Also, whether or not deference-morality makes sense depends on how much you trust the agents you’re deferring to - but it’s often difficult to gain trust in agents more capable than you, because they’re likely better at deception than you. Cult leaders exploit this.
- Arbitrarily-scalable deference-morality looks like an intent-aligned AGI. One lens on why intent alignment is difficult is that deference-morality is inherently unnatural for agents who are much more capable than the others around them.

Cooperation-morality and deference-morality have the weakness that they can be exploited by the agents we hold those attitudes towards; and so we also have adaptations for deterring or punishing this (which I’ll call conflict-morality). I’ll mostly treat conflict-morality as an implicit part of cooperation-morality and deference-morality; but it’s worth noting that a crucial feature of morality is the coordination of coercion towards those who act immorally.

Morality as intrinsic preferences versus morality as instrumental preferences

I’ve mentioned that many moral principles are rational strategies for multi-agent environments even for selfish agents. So when we’re modeling people as rational agents optimizing for some utility function, it’s not clear whether we should view those moral principles as part of their utility functions, versus as part of their strategies. Some arguments for the former:

We tend to care about principles like honesty for their own sake (because that was the most robust way for evolution to actually implement cooperative strategies).
Our cooperation-morality intuitions are only evolved proxies for ancestrally-optimal strategies, and so we’ll probably end up finding that the actual optimal strategies in other environments violate our moral intuitions in some ways. For example, we could see love as a cooperation-morality strategy for building stronger relationships, but most people still care about having love in the world even if it stops being useful.

Some arguments for the latter:

It seems like caring intrinsically about cooperation, and then also being instrumentally motivated to pursue cooperation, is a sort of double-counting.
Insofar as cooperation-morality principles are non-consequentialist, it’s hard to formulate them as components of a utility function over outcomes. E.g. it doesn’t seem particularly desirable to maximize the amount of honesty in the universe.

The rough compromise which I use here is to:

Care intrinsically about the welfare of all agents which currently exist or might in the future, with a bias towards myself and the people close to me.
Care intrinsically about the agency of existing agents to the extent that they're capable enough to be viewed as having agency (e.g. excluding trees), with a bias towards myself and the people close to me.
- In other words, I care about agency in a person-affecting way; and more specifically in a loss-averse way which prioritizes preserving existing agency over enhancing agency.
Define welfare partly in terms of hedonic experiences (particularly human-like ones), and partly in terms of having high agency directed towards human-like goals.
- You can think of this as a mixture of hedonism, desire, and objective-list theories of welfare.
Apply cooperation-morality and deference-morality instrumentally in order to achieve the things I intrinsically care about.
- Instrumental applications of cooperation-morality and deference-morality lead me to implement strong principles. These are partly motivated by being in an iterated game within society, but also partly motivated by functional decision theories.

Rederiving morality from decision theory

I’ll finish by elaborating on how different decision theories endorse different instrumental strategies. Causal decision theories only endorse the same actions as our cooperation-morality intuitions in specific circumstances (e.g. iterated games with indefinite stopping points). By contrast, functional decision theories do so in a much wider range of circumstances (e.g. one-shot prisoner’s dilemmas) by accounting for logical connections between your choices and other agents’ choices. Functional decision theories follow through on commitments you previously made; and sometimes follow through on commitments that you would have made. However, the question of which hypothetical commitments they should follow through with depends on how updateless they are.

Updatelessness can be very powerful - it’s essentially equivalent to making commitments behind a veil of ignorance, which provides an instrumental rationale for implementing cooperation-morality. But it’s very unclear how to reason about how justified different levels of updatelessness are. So although it’s tempting to think of updatelessness as a way of deriving care-morality as an instrumental goal, for now I think it’s mainly just an interesting pointer in that direction. (In particular, I feel confused about the relationship between single-agent updatelessness and multi-agent updatelessness like the original veil of ignorance thought experiment; I also don’t know what it looks like to make commitments “before” having values.)

Lastly, I think deference-morality is the most straightforward to derive as an instrumentally-useful strategy, conditional on fully trusting the agent you’re deferring to - epistemic deference intuitions are pretty common-sense. If you don’t fully trust that agent, though, then it seems very tricky to reason about how much you should defer to them, because they may be manipulating you heavily. In such cases the approach that seems most robust is to diversify worldviews using a meta-rationality strategy which includes some strong principles.

47

Moral strategies at different capability levels

47

Morality as intrinsic preferences versus morality as instrumental preferences

Rederiving morality from decision theory