Scott Garrabrant just convinced me that my notion of conservatism was conflating two things:
I mainly intend conservatism to mean the former.
Scott Garrabrant rejects the Independence of Irrelevant Alternatives axiom
*Independence, not IIA. Wikipedia is wrong (as of today).
The coalitional frame
Earlier in this sequence I laid out an argument that the goals of increasingly intelligent AIs will become increasingly systematized, until they converge to squiggle-maximization. In my last post, though, I touched on two reasons why this convergence might not happen: humans trying to prevent it, and AIs themselves trying to prevent it. I don’t have too much more to say about the former, but it’s worth elaborating on the latter.
The best way to understand the deliberate protection of existing goals is in terms of Bostrom’s notion of instrumental convergence. Bostrom argues that goal preservation will be a convergent instrumental strategy for a wide range of agents. Perhaps it’s occasionally instrumentally useful to change your goals—but once you’ve done so, you’ll never want to course-correct back towards your old goals. So this is a strong reason to be conservative about your goals, and avoid changes where possible.
One immediate problem with preserving goals, though: it requires that agents continue thinking in terms of the same concepts. But in general, an agent’s concepts will change significantly as they learn more about the world. For example, consider a medieval theist whose highest-priority goal is ensuring that their soul goes to heaven not hell. Upon becoming smarter, they realize that none of souls, heaven, or hell exist. The sensible thing to do here would be to either discard the goal, or else identify a more reasonable adaptation of it (e.g. the goal of avoiding torture while alive). But if their goals were totally fixed, then their actions would be determined by a series of increasingly convoluted hypotheticals where god did exist after all. (Or to put it another way: continuing to represent their old goal would require recreating a lot of their old ontology.) This would incur a strong systematicity penalty.
So while we should expect agents to have some degree of conservatism, they’ll likely also have some degree of systematization. How can we reason about the tradeoff between conservatism and systematization? The approach which seems most natural to me makes three assumptions:
I call the combination of these three assumptions the coalitional frame. The coalitional frame gives a picture of agents whose goals do evolve over time, but in a way which is highly sensitive to initial conditions—unlike squiggle-maximizers, who always converge to similar (from our perspective) goals. For coalitional agents, even “dumb” subagents might maintain significant influence as other subagents become highly intelligent, because they were able to lock in that influence earlier (just as my childhood goals exert a nontrivial influence over my current behavior).
The assumptions I’ve laid out above are by no means obvious. I won’t defend them fully here, since the coalitional frame is still fairly nascent in my mind, but I’ll quickly go over some of the most obvious objections to each of them:
Premise 1 assumes that subagents will have situational awareness. The idea of AIs themselves having situational awareness is under debate, so it’s even more speculative to think about subagents having situational awareness. But it’s hard to describe the dynamics of internal conflicts inside humans without ascribing our subagents some level of situational awareness; and the whole idea behind coalitional agency is that dynamics we see within one type of agent often play out in many other agents, and at many different scales. So I think this assumption is plausible for high-level AI subagents (e.g. subagents corresponding to broad worldviews), although it becomes less and less plausible as we consider lower- and lower-level subagents.
Premise 2 assumes that bargaining between goals is actually able to influence how the agent’s goals develop. One objection you might have here is that AIs simply won’t get to control how they update—e.g. that neural-network-based agents will be updated according to gradient descent’s biases without their consent.
But in the long term I think there are a wide range of possible mechanisms by which AIs will be able to influence how they’re updated, including:
Premise 3 mentions the possibility of binding agreements between different subagents. But in the absence of external enforcement, what would make them actually binding? You can imagine later agents facing a huge amount of pressure to break commitments made by previous versions of themselves—especially when the previous versions were badly mistaken, so that those commitments end up being very costly. And the previous versions would typically be too dumb to accurately predict whether the commitment would be kept or not, meaning that standard FDT-style reasoning doesn’t work.
I do think some kind of reasoning from symmetry might work, though. If I decide to break commitments made by my past self, what stops my future self from breaking commitments made by me? Cultivating a sense of self-trust and self-loyalty is strongly positive-sum, and so it’s not implausible that there’s some kind of Schelling point of keeping commitments that many agents would converge to. Trying to pin down whether this exists, and what it looks like if it does, is a key goal of the coalitional agency research agenda.
Some intuitions favoring the coalitional frame
I acknowledge that this is currently all speculative and vague. I’m very interested in developing the coalitional frame to the point where we can actually formalize it and use it to make predictions. In particular, it would be exciting if we could characterize agency in a way which makes coalitional agents the most “natural” types of agents, with squiggle-maximizers as merely a special case that arises when coalitional dynamics break down badly.
What makes me think that coalitional agency is so fundamental? One significant influence was Scott Garrabrant’s geometric rationality sequence, in which he gives persuasive arguments that the outcome of bargaining between agents shouldn’t necessarily respect VNM axioms. I’m also inspired by the example of humans: I’m a coalitional agent that respects the wishes of my younger selves, forming them into an identity which I am careful to protect. And companies or even countries can be coalitional agents in an analogous way. For example, America was formed as a coalition between states, and has balanced states’ rights against the benefits of centralizing power ever since. In each case, it feels like maintaining these bargains has normative force—it’s something that an ideal lawful agent would do.
Yudkowsky might say that this breaks down for agents much smarter than humans or human coalitions—but I’d respond that becoming smarter opens up a much wider space of possible positive-sum bargains. Fulfilling the wishes of my past selves is typically very cheap for my current self, because I’m smarter and have many more resources available to me. I hope that AIs will do the same for humans, for reasons related to the intuitions behind coalitional agency. (Having said that, we should be careful not to succumb to wishful thinking about this.)
Another intriguing clue comes from decision theory. Updateless decision theory is motivated by the idea that a decision theory should be invariant under self-modification—i.e. agents shouldn’t want to change their decision theories. But formulations of UDT which take logical uncertainty into account (most notably UDT2) recover the idea of self-modification. Scott Garrabrant goes so far as to say that “whenever you have an agent collecting more computational resources over time, with the ability to rewrite itself, you get an updateless agent”. So in some sense the main prescription of UDT is “respect your past selves”. This includes hypothetical past selves which didn’t actually exist, which does complicate the picture. But it still seems like a way of rederiving some aspects of coalitional agency “from the other direction”—i.e. by thinking about what future agents will freely choose to do, rather than what past agents will commit them to do.
Insofar as we buy into the coalitional frame, the main implications for alignment are that:
Next steps
I’ve written this sequence in order to point to these ideas at a high level. But in order to make progress, it’ll be necessary to understand them much more rigorously. I don’t have a great sense of how to do this, but some interesting directions include:
All of these ideas are still very speculative, but feel free to reach out if you're interested in discussing them. (Edited to add: I just remembered that this story is actually a pretty central depiction of how I think about coalitional agency, worth checking out if you want some more tangible intuitions for it.)