However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
- Form an EUM which would give Alice all the cakes (because it weights Alice’s utility higher than Bob’s)
- Form an EUM which would give Bob all the cakes (because it weights Bob’s utility higher than Alice’s)
- Form an EUM which is totally indifferent about the cake allocation between them (which would allocate cakes arbitrarily, and could be swayed by the tiniest incentive to give all Alice’s cakes to Bob, or vice versa)
None of these is very satisfactory!
I think this exact example is failing to really inhabit the mindset of a true linear(!) returns EUM agent. If Alice has literally linear returns Alice is totally happy to accept a deal which gets Alice 2x as many cakes + epsilon in 50% of worlds and nothing otherwise.
Correspondingly, if Alice and Bob have ex-ante exactly identical expected power and it is ex-ante as easy to make cake for then I think the agent they would build together would be something like:
From Alice's perspective this gets twice as many cakes + epsilon (due to being more efficient) in 50% of worlds and is thus a nice trade.
(If the marginal cost of giving a cake to Alice vs Bob increases with number of cakes, then you'd give some to both.)
If Alice/Bob had dimishing returns, then adding the utility functions with some bargained weighting is also totally fine and will get you some nice split of cake between them.
If we keep their preferences, but make them have different cake production abilities or marginal costs of providing cakes for them, then you just change the weights (based on some negotiation), not the linearity of the addition. And yes, this means that in many worlds (where one agent always has lower than ex-ante relative marginal cake consumption cost), one of the agents gets all the cake. But ex-ante they got a bit more in expectation!
I'm much more sympathetic to other objections to aggregations of EUM agents being EUM, like ontology issues, imperfect information (and adverse selection), etc.
I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for":
EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something.
EDIT: another thing I glossed over is that IIUC Harsanyi's theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn't technically kosher. This may seem nitpicky but I think it's yet another illustration of the underlying non-robustness of EUM.
I've now edited that section. Old version and new version here for posterity.
Old version:
None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and option 2—which is still very inflexible, since it locks in one of them having priority indefinitely.
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to follow through on commitments they made about which decision procedure to follow (or even hypothetical commitments).
New version:
These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.
We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
Coalitional agency seems like an unnecessary constraint on design of a composite agent, since an individual agent could just (choose to) listen to other agents and behave the way their coalition would endorse, thereby effectively becoming a composite agent, without being composite "by construction". The step where an agent chooses which other (hypothetical) agents to listen to makes constraints on the nature of agents unnecessary, because the choice to listen to some agents and not others can impose any constraints that particular agent cares about, and so an "agent" could be as vague as a "computation" or a program.
(Choosing to listen to a computation means choosing a computation based on considerations other than its output, committing to use its output in a particular way without yet knowing what it's going to be, and carrying out that commitment once the output becomes available, regardless of what it turns out to be.)
This way we can get back to individual rationality, figuring out how an agent should choose to listen to which other agents/computations when coming up with its own beliefs and decisions. But actually occasionally listening to those other computations is the missing step in most decision theories, which would take care of interaction with other agents (both actual and hypothetical).
Good post. But I thought about this a fair bit and I think I disagree with the main point.
Let's say we talk about two AIs merging. Then the tuple of their expected utilities from the merge had better be on the Pareto frontier, no? Otherwise they'd just do a better merge that gets them onto the frontier. Which specific point on the frontier is a matter of bargaining, but the fact that they want to hit the frontier isn't, it's a win-win. And the merges that get them to the frontier are exactly those that output a EUM agent. If the point they want to hit is in a flat region of the frontier, the merge will involve coinflips to choose which EUM agent to become; and if it's curvy at that point, the merge will be deterministic. For realistic agents who have more complex preferences than just linearly caring about one cake, I expect the frontier will be curvy, so deterministic merge into a EUM agent will be the best choice.
Found this interesting and useful. Big update for me is that 'I cut you choose' is basically the property that most (all?) good self therapy modalities use afaict. In that the part or part-coalition running the therapy procedure can offer but not force things, since its frames are subtly biasing the process.
Discussions of how to aggregate values and probabilities feel disjoint. Jeffrey-Bolker formulation of expected utility presents the preference data as two probability distributions over the same sample space, so that expected utility of an event is reconstructed as the ratio of the event's measures given by the two priors. (The measure that goes into the numerator is "shouldness", and the other one remains "probability".)
This gestures at a way of reducing the problem of aggregating values to the problem of aggregating probabilities. In particular, markets seem to be easier to set up for probabilities than for expected utilities, so it might be better to set up two markets that are technically the same type of thing, one for probability and one for shouldness, than to target expected utility directly. Values of different agents are incomparable, but so are priors, any fundamental issues with aggregation seem to remain unchanged by this reformulation. These can't be "prediction" markets since resolution is not straightforward and somewhat circular, grounded in what the coalition will settle on eventually, but logical induction has to deal with similar issues already.
I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but the most fundamental is my pursuit of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.
This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency.
Theories of intelligent agency
By a “theory of intelligent agency” I mean a unified mathematical framework that describes both understanding the world and influencing the world. In this section I’ll outline the two best candidate theories of intelligent agency that we currently have (expected utility maximization and active inference), explain why neither of them is fully satisfactory, and outline how we might do better.
Expected utility maximization
Expected utility maximization is the received view of intelligent agency in many fields (I’ll abbreviate it as EUM, and EUM agents as EUMs). Idealized EUMs have beliefs in the form of probability distributions, and goals in the form of utility functions, as specified by the axioms of probability theory and utility theory. They choose whichever strategy leads to the most utility in expectation; this is typically modelled as a process of search or planning.
EUM is a very productive framework in simple settings—like game theory, bargaining theory, microeconomics, etc. It’s particularly useful for describing agents making one-off decisions between a fixed set of choices. However, it’s much more difficult to use EUM to model agents making sequences of choices over time, especially when they learn and update their concepts throughout that process. The two points I want to highlight here:
So we might hope that a theory of deep learning, or reinforcement learning, or deep reinforcement learning, will help fill in EUM’s blind spots. Unfortunately, theoretical progress has been slow on all of these—they’re just too broad to say meaningful things about in the general case.
Active inference
Fortunately, there’s another promising theory which comes at it from a totally different angle. Active inference is a theory born out of neuroscience. Where EUM starts by assuming an agent already has beliefs and goals, active inference gives us a theory of how beliefs and goals are built up over time.
The core idea underlying active inference is predictive coding. Predictive coding models our brains as hierarchical networks where the lowest level is trying to predict our sensory inputs, the next-lowest level is trying to predict the lowest level, and so on. The higher up the hierarchy you go, the more abstract and compressed the representations become. The lower levels might represent individual “pixels” seen by our retinas, then higher levels lines and shapes, then higher levels physical objects like dogs and cats, then even higher levels abstract concepts like animals and life.
This is, of course, similar to how artificial neural networks work (especially ones trained by self-supervised learning). The key difference: predictive coding tells us that, in the brain, the patterns recognized at each level are determined by reconciling the bottom-up signals and the top-down predictions. For example, after looking at the image below, you might not perceive any meaningful shapes within it. But if you have a strong enough top-down prediction that the image makes sense (e.g. because I’m telling you it does) then that prediction will keep being sent down to lower layers responsible for identifying shapes, until they discover the dog. This explains the sharp shifts in our perceptions when looking at such images: at first we can’t see the dog at all, but when we find it it jumps into focus, and afterwards we can’t unsee it.
Predictive coding is a very elegant theory. And what’s even more elegant is that it explains actions in the same way—as very strong top-down predictions which override the default states of our motor neurons. Specifically, we can resolve conflicts between beliefs and observations either by updating our beliefs, or by taking actions which make the beliefs come true. Active inference is an extension of predictive coding in which some beliefs are so rigid that, when they conflict with observations, it’s easier to act to change future observations than it is to update those beliefs. We can call these hard-to-change beliefs “goals”, thereby unifying beliefs and goals in a way that EUM doesn’t.
This is a powerful and subtle point, and one which is often misunderstood. I think the best way to fully understand this point is in terms of perceptual control theory. Scott Alexander gives a good overview here; I’ll also explain the connection at more length in a follow-up post.
Towards a scale-free unification
Active inference is a beautiful theory—not least because it includes EUM as a special case. Active inference represents goals as probability distributions over possible outcomes. If we interpret the logarithm of each probability as that outcome’s utility (and set aside the value of information) then active inference agents choose actions which maximize expected utility. (One intuition for why such an interpretation is natural comes from Scott Garrabrant's geometric rationality.)
So what does expected utility maximization have to add to active inference? I think that what active inference is missing is the ability to model strategic interactions between different goals. That is: we know how to talk about EUMs playing games against each other, bargaining against each other, etc. But, based on my (admittedly incomplete) understanding of active inference, we don’t yet know how to talk about goals doing so within a single active inference agent.
Why does that matter? One reason: the biggest obstacle to a goal being achieved is often other conflicting goals. So any goal capable of learning from experience will naturally develop strategies for avoiding or winning conflicts with other goals—which, indeed, seems to happen in human minds.
More generally, any theory of intelligent agency needs to model internal conflict in order to be scale-free. By a scale-free theory I mean one which applies at many different levels of abstraction, remaining true even when you “zoom in” or “zoom out”. I see so many similarities in how intelligent agency works at different scales (on the level of human subagents, human individuals, companies, countries, civilizations, etc) that I strongly expect our eventual theory of it to be scale-free.
But active inference agents are cooperative within themselves while having strategic interactions with other agents; this privileges one level of analysis over all the others. Instead, I propose, we should think of active inference agents as being composed of subagents who themselves compete and cooperate in game-theoretic ways. I call this approach coalitional agency; in the next section I characterize my current understanding of it from two different directions.
Two paths towards a theory of coalitional agency
The core idea of coalitional agency is that we should think of agents as being composed of cooperating and competing subagents; and those subagents as being composed of subsubagents in turn; and so on. The broad idea here is not new—indeed, it’s the core premise of Minsky’s Society of Mind, published back in 1986. But I hope that thinking of coalitional agency as incorporating elements of both EUM and active inference will allow progress towards a formal version of the theory.
In this section I’ll give two different characterizations of coalitional agency: one starting from EUM and trying to make it more coalitional, and the other starting from active inference and trying to make it more agentic. More specifically, the first poses the question: if a group of EUMs formed a coalition, what would it look like? The second poses the question: how could active inference agents be more robust to conflict between their internal subagents?
From EUM to coalitional agency
If a group of EUMs formed a coalition, what would it look like? EUM has a standard answer to this: the coalition would be a linearly-aggregated EUM. In this section I first explain why the standard answer is unsatisfactory. I then give an alternative answer: the coalition should be an incentive-compatible decision procedure.
Aggregating into EUMs is very inflexible
In the EUM framework, any non-EUM agent is incoherent in the sense of violating the underlying axioms of probability theory and/or utility theory. So insofar as EUM has predictive power, it predicts that competent coalitions will also be EUMs. But which EUMs? The standard answer is given by Harsanyi’s utilitarian theorem, which shows that (under reasonable-seeming assumptions) an aggregation of EUMs into a larger-scale EUM must have a utility function that’s a weighted average of the subagents’ utilities.
However, this strongly limits the space of possible aggregated agents. Imagine two EUMs, Alice and Bob, whose utilities are each linear in how much cake they have. Suppose they’re trying to form a new EUM whose utility function is a weighted average of their utility functions. Then they’d only have three options:
These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.
We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).
Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).
Coalitional agents are incentive-compatible decision procedures
The space of decision procedures is very broad; can we say more about which decision procedures rational agents should commit to? One key desideratum for commitments is that it’s easy to trust that they’ll be kept. Consider the example above of flipping a coin to decide between options 1 and 2 above. This is fair, but it sets up strong incentives for whoever loses the coinflip to break their commitment, since they will not get any benefit from keeping it.
And it’s even worse than that, because in general the only way to find out another agent’s utilities is to ask them, and they could just lie. From the god’s-eye perspective you can build an EUM which averages subagents’ utilities; from the perspective of the agents themselves, you can’t. In other words, EUMs constructed by taking a weighted average of subagents’ utilities are not incentive-compatible.
EUMs which can't guarantee each other's honesty will therefore want to aggregate into incentive-compatible decision procedures which each agent does best by following. Perhaps the best-known incentive-compatible decision procedure is the fair cake-cutting algorithm, also known as “I cut you choose”. This is a much simpler and more elegant way to split cakes than the example I gave above of Alice and Bob aggregating into a single EUM.
Now, cake-cutting is one very specific type of problem, and we shouldn’t expect there to be incentive-compatible decision procedures with such nice properties for all problems. Nevertheless, there’s a very wide range of possibilities to explore. Some of the simplest possible incentive-compatible decision procedures include:
These decision procedures each give subagents some type of control over the outputs—and, importantly, a type of control that generalizes to a range of problems beyond the ones they were able to consider during bargaining.
Which incentive-compatible decision procedure?
The question is then: how should subagents choose which incentive-compatible bargaining procedure to adopt? The most principled answer is that they should use a bargaining theory framework. This is a little different from the traditional theoretical framework for bargaining. Bargaining doesn’t typically produce ways of organizing the bargainers—instead it produces an object-level answer to whatever problem the bargainers face.
This makes sense when you have a single decision to make. But when bargainers face many possible future decisions, bargaining over outcomes requires specifying which outcome to choose in every possible situation. This is deeply intractable in realistic settings, where bargainers can’t predict every possible scenario they might face.
In those settings it is much more tractable to bargain over methods of making decisions which generalize beyond the problems that the bargainers are currently aware of. I don’t know of much work on this, but the same idealized bargaining solutions (e.g. the Nash bargaining solution) should still apply in principle. The big question is whether there’s anything interesting to be said about the relationship between incentive-compatible decision procedures and bargaining solutions. For example, are there classes of incentive-compatible decision procedures which make it especially easy for agents to identify which one is near the optimal bargaining solution? On a more theoretical level, one tantalizing hint is that the ROSE bargaining solution is also constructed by abandoning the axiom of independence—just as Garrabrant does in his rejection of EUM above. This connection seems worth exploring further.
To finish, I’ve summarized many of the claims from this section in the following table:
What do I mean by “hard to design or reason about?” One nice thing about EUMs is that their behavior is extremely easy to summarize: they do whatever’s best for their goals according to their beliefs. But we can’t talk about decision procedures in the same way. Individual subagents may have goals and beliefs, but the decision procedure itself doesn’t: it just processes those subagents into a final decision.
Fortunately, there’s a way to rescue our intuitive idea that agents should have beliefs and goals. It’ll involve talking about much more complex incentive-compatible decision procedures, though. So first I’ll turn to the other direction in which we can try to derive coalitional agency: starting from active inference.
From active inference to coalitional agency
I just gave an account of coalitional agents in which they’re built up from individual EUMs. In this section I’ll do the opposite: start from an active inference agent and modify it until it looks more like a coalitional agent.
More specifically, consider a hierarchical generative model containing beliefs/goals, where higher layers predict lower layers, and lower layers send prediction errors up to higher layers. Let’s define a subagent as a roughly-internally-consistent cluster of beliefs and goals within that larger agent. Note that this definition is a matter of degree: if we apply a high bar for internal consistency, then each subagent will be small (e.g. beliefs and desires about a single object) whereas a lower bar will lead to larger subagents (e.g. a whole ideology).
Subagents with different beliefs and goals will tend to make different predictions (including “predictions” about which actions they want the agent to take). What modifications do we need to make to our original setup for it to be robust to strategic dynamics between those subagents?
Predicting observations via prediction markets
When multiple subagents make conflicting predictions, the standard approach is to combine them by taking a precision-weighted average. Credit is then assigned to each subagent for the prediction in proportion to how confident it was. But this is not incentive-compatible: subagents can benefit by strategizing about how the other subagents will respond, and changing their responses accordingly.
There are various incentive-compatible ways to elicit predictions from multiple agents (many of which are discussed by Neyman). However, the most elegant incentive-compatible method for aggregating predictions is a prediction market. Each trader on a prediction market can choose to buy shares in propositions it thinks are overpriced and sell shares in propositions it thinks are underpriced. This allows subagents to specialize into different niches within the overall agent. It also incentivizes them to arbitrage away any logical inconsistency they notice. These dynamics are modeled by the Garrabrant induction framework.
Choosing actions via auctions
Given my discussion above about actions being in some sense predictions of future behavior, we might think that actions should be chosen by prediction markets too. However, there’s a key asymmetry: if I expect a complex plan to happen, I can profit by predicting any aspect of it. But if I want a complex plan to happen, I need to successfully coordinate every aspect of it. So, unlike predictions of observations, predictions of actions need to have some mechanism for giving a single plan control over many different actuators.
In active inference, the mechanism by which this occurs is called expected free energy minimization. I’m honestly pretty confused about how expected free energy minimization works, but I strongly suspect that it’s not incentive-compatible. In particular, the discontinuity involved in picking the single highest-value plan seems like it’d induce incentives to overestimate your own plan’s value. However, Demski et al.’s BRIA framework solves this problem by requiring subagents to bid for the right to implement a plan and receive the corresponding reward. Rational subagents will never bid more than the reward they actually expect. So my hunch is that something like this auction system would be the best way to adjust our original setup to make it incentive-compatible.
Aggregating values via voting
The last important component of decision-making is evaluating plans (whether in advance or in hindsight). What happens when different subagents disagree on which goals or values the plans should be evaluated in terms of? Again, the standard approach is to take a precision-weighted average of their evaluations, but this still has all the same incentive-compatibility issues. And unlike predictions, values have no ground truth feedback signal, meaning that prediction markets don’t help.
So I expect that the most appropriate way to aggregate goals/values is via a voting system. This is also the conclusion reached by Newberry and Ord, who model idealized moral decision-making in terms of a parliament in which subagents vote on what values to pursue. Specifically, they propose using random ballot voting, in which each voter’s favorite option is selected with probability proportional to their vote share. This voting algorithm has three particularly notable features:
Putting it all together
I’ve described two paths towards a theory of coalitional agency. On one path, we start from expected utility maximizers and aggregate them to form coalitional agents, via those EUMs bargaining about which decision procedures to use. The problem is that the resulting decision procedure may be incoherent in the sense that it can’t be ascribed beliefs or goals. On the other path, we make interactions between active inference subagents more incentive-compatible by using prediction markets, auctions, and voting (or similar mechanisms) to manage internal conflict.
What I’ll call the coalitional agency hypothesis is the idea that these two paths naturally “meet in the middle”—specifically, that EUMs doing (idealized) bargaining about which decision procedure to use would in many cases converge to something like my modified active inference procedure. If true, we’d then be able to talk about that procedure’s “beliefs” (the prices of its prediction market) and “goals” (the output of its voting procedure).
One line of work which supports the coalitional agency hypothesis is Critch’s negotiable reinforcement learning framework, under which EUMs should bet their influence on any disagreements about the future they have with other agents, so that they end up very powerful if (and only if) their predictions are right. I interpret this result as evidence that (some version of) prediction markets are the default outcome of bargaining over incentive-compatible decision procedures.
But all of this work is still vague and tentative. I’d very much like to develop a more rigorous formulation of coalitional agency. This would benefit greatly from working with collaborators (especially those with strong mathematical skills). So I’ll finish with two calls to action. If you’re a junior(ish) researcher and you want to work with me on any of this, apply to my MATS fellowship. If you’re an experienced researcher and you’d like to chat or otherwise get involved (potentially by joining a workshop series I’ll be running on this) please send me a message directly.
Thanks to davidad, Jan Kulveit, Emmett Shear, Ivan Vendrov, Scott Garrabrant, Abram Demski, Martin Soto, Laura Deming, Aaron Tucker, Adria Garriga, Oliver Richardson, Madeleine Song and others for helping me formulate these ideas.