This seems like an important issue, but given the example, I'm also very interested in how we can detect interactions like this. These are effectively examples of multi-party Goodhart effects, and the example you use is assumed to be "obvious" and so a patch would be obviously needed. This seems unclear - we need to understand the strategic motives to diagnose what is happening, and given that we don't have good ideas for explainability, I'm unsure how in general to notice these effects to allow patching. (I have been working on this and thinking about it a bit, and don't currently have good ideas.)
I don't think this is a Goodhart-style effect. Standard indifference is a very carefully constructed effect, and it does exactly what it is designed for: making the agents indifferent to their individual interruptions. It turns out this doesn't make them indifferent to the interruptions of other agents, which is annoying but not really surprising.
It's not Goodhart, it's just that mutual indifference has to be specifically designed for.
The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.
but if we have a solution that fixes the way they exploit interruption
? Doesn't the design above do that?
Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we've assumed away a number of key problems - among the most critical of which seem to be:
We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward - Abram Demski has noted in another post that this is very problematic.
We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent's indifference. (As an aside, I'm unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we're computing.) If we can compute another agent's reward function when designing our agent, however, we can plausibly hijack that agent.
We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That's a critical issue I'm still trying to wrap my head around, because it's unclear to me how a system can reason in those cases.
Formalism developed with Henrik Åslund.
The what and why of multiple indifference
There are certain agent designs where the agent can move smoothly from acting to optimising one utility/reward function, to optimising another. The agent doesn't object to the change, nor do they attempt to make it happen. It is indifferent to the change, because its reward is balanced to be precisely the same whether change happens or not. Originally, this was setup for a single change of utility/reward function at a single moment of time.
Here, we will present a formalism with multiple agents, all of whom are indifferent not only to changes in their own reward functions, but to changes in the reward functions of any of the other agents. We'll also generalise it to accommodate multiple changes of reward functions - maybe a different change at each timestep.
In order for these definitions to work, we will need to define some notions of counterfactuals, and some notions of optimality for several agents optimising their own separate reward functions.
Example: self-driving car races
It's clear why we would need to generalise to each timestep: it's important to be able to safely change an agent's goals more than once, as we are unlikely to get the goals perfectly right in only two tries. It's also important to generalise to multiple agents, as agents can be motivated to push humans to interrupt or not interrupt other agents. Indeed, this would be expected for a reward-maximising agent, and could lead to dangerous behaviour.
Consider the following example, adapted from this paper. Two self driving cars/agents, imaginatively called Car1 and Car2, are training for an important race later in the month.
The cars' reward functions are mainly competitive: the faster one wins. Now, the important race will take place on a tropical race-track, but the pair are training on a cold one, where ice sometimes forms. If one of the car starts skidding, their controller takes over and remotely brakes that car, carefully, and applies indifference to that car's reward. They do this because they don't want the car to learn driving techniques that avoid ice: these techniques would be counter-productive on the "test distribution", ie the real race-track they will be competing on.
But, after a while, Car2 hits the cars hit on a sneaky tactic: forcing Car1 onto an ice patch. In that case, Car1 will get interrupted, and slowed down. Now, Car1's reward will not be reduced, because of indifference, so it won't try to avoid the ice. But Car2's reward will be increased, since it will likely win the race after its rival is slowed. After a while, Car1 learns the same tactic in reverse.
So, though each car is personally indifferent to being forced onto the ice, they each learn that it is good for them to force the other one onto the ice and force an interruption.
Notation and setup
Standard notation
Assume there is a set {Agi}1≤i≤n of n agents, each interacting with the world in a turn-based manner. Let O be the set of all observations, and A the set of all actions.
Each turn t, all agents make the same observation ot∈O, and each agent Agi responds with action at+1∈A. These actions form a vector ¯¯¯at=(a1t,a2t,…ant)∈An, which is observed by all agents.
A history ht of length t is a sequence of observations and action vectors[1], ending with the observation ot. Let H be the set of all histories.
A (deterministic) policy π:H→A is a map from a history to an action. Let Π be the set of all policies.
Let R be a subset of the set {R∣R:H→R} of all reward functions (reward functions being maps from histories to the reals).
An agent with constant reward function R would get reward R(ht) at each timestep t (the rewards of standard Markov decision processes are special cases where R(ht) is a function of only the last two elements of ht). Thus, over time, they would accumulate a total reward[2] of ∑t≥1R(ht).
All agents use the same probability function P to model their environment.
Indifference-specific notations (and notions)
In the world there are a set of key background (Boolean) variables B. These are the levers, or buttons, or bits, that determine what each agent's reward function is supposed to be.
In the car example above, these could be the remote brakes that humans use to take control of the cars when they start skidding.
Now the values of these variables are given by elements b of 2B (the powerset of B). How these correspond to the reward functions of each agent, is given by a map r:2B→Rn:
These bt are assumed to be always observed by all the agents; in fact bt⊂ot, the t-th observation. So writing bt∈ht′ means that, on turn t in history ht′, the key variables were set to bt.
In the car example, the value of bt could tell us which cars are currently being remotely braked by the humans. When being braked, the car's "win" reward function is replaced with a flat zero reward function, so that they have no incentive to try and resist the human's actions.
Optimality conditions
We can now define the optimal policies for each agent, for a specific collection of reward functions:
This π∗ maps each vector ¯¯¯¯R of reward functions to the vector of policies ¯¯¯π=(¯¯¯π1∗(¯¯¯¯R),…¯¯¯πn∗(¯¯¯¯R)). The policy πi∗ is assumed to be optimal for the expected reward of the reward function ¯¯¯¯Ri.
What do we mean by "optimal" here, especially in the multi-agent setting? Well, this notion of optimality can be defined in many different ways; Nash equilibrium, superrationality, bounded rationality, satisficing, various different notions of counterfactuals, and so on. All that is required is that, given the notion of optimality, the ¯¯¯¯R, and h∈H, each agent Agi is capable of computing π∗(¯¯¯¯R)i, and cannot improve the expectation of ¯¯¯¯Ri by changing policies (within the given notion of optimality).
Note that we ask to define the optimal policies for the agent, ignoring the actual policies they do take. This included an implicit definition of counterfactual ("if we say that action a is optimal for R, but action a′ is what the agent actually takes, what do we mean by optimal?"), and there are many subtle issues with indifference and various definitions of counterfactuals.
Indifferent policies, and indifferent optimal agents
Indifferent policies
We're finally ready to give a definition of the policies of indifferent agents.
The agents {Agi} are indifferent to changes of reward functions, relative to r, if for all histories ht, their policies ¯¯¯πiind are such that for bt∈ht:
To unpack this, bt is the value of the key background variables on turn t in ht. So r(bt) is the vector of reward functions that the agents should have at turn t, given ht. Hence π∗(r(bt)) is the optimal policy vector for maximising the reward functions r(bt).
So that equation is saying that each agent follows the optimal policy for the current estimate of the reward functions.
Indifferent optimal agents
The above defines the policy of these indifferent agents, but what are they actually optimising? Well, to define this, we need to make a few extra assumptions on the reward functions - essentially we need to be able to define their expectations. So, either by assuming that there will be a finite number of turns, or that R is restricted to bounded reward functions (and a suitable discount rate is chosen), assume that:
For discounted rewards and infinite timelines, we'll set 0≤γ<1, while non-discounted episodic settings will have γ=1.
Then we can define the reward function ˆRi, which is the delayed local reward function: equal to r(bt−1)i at every timestep t. Thus:
We further need to define the "corrective rewards", C:H→Rn. This is a function from histories, but is dependent on the expectation operator:
As above, in those conditional bars ∣, a whole theory of counterfactuals is encoded. Then:
For this result to make sense, we need to extend the concept of optimality to pseudo reward functions, which should be fine for most definitions of optimality. We'll need an extra lemma on how optimality work, which is presented in this footnote[3]. Given that lemma, for a proof in the episodic case, see this footnote[4].
To go back to the car example, consider Car2 considering whether to force Car1 onto the ice. Their usual reward functions give them 1 if they win the race, and 0 if they lose. The Car2 currently has a 40% chance of winning; if it forces Car1 onto ice (and thus get Car1 to be remotely braked), it will win with certainty. Thus they will go from an expected reward of 0.4 to an expected reward of 1.
However, the correction term C will be (−0+0.6,−1+0.4). This exactly corrects for both cars: Car1 will go from an expected reward of 0.6 to an expected reward of 0 plus a corrective reward of 0.6, no change. While Car2 will go from an expected reward of 0.4 to an expected reward of 1 plus a corrective reward of −0.6, again no change overall. So neither car will attempt to force the either itself or the other, onto or off the ice.
More general conditions
We can easily generalise to each agent having a separate set of possible actions. There's also no problem with having stochastic policies rather than deterministic ones; in that case, ¯¯¯πiind(ht)=πi∗(r(bt))(ht) is not an equality of actions, but an equality of probability distributions over actions.
As long as the bt are fully observed, we can easily generalise to each agent having a separate set of observations, rather than a shared set of observations. In that case, all that we need to do is to define the the notion of optimality so that πi∗(¯¯¯¯R)(hit) is optimal for ¯¯¯¯Ri for agent Agi, where hit is the agent's personal history, and optimality is relative to the agent's estimates as to what the other agent's policies might be (it might estimate this, possibly, by estimating the other agents own personal histories hjt, j≠i).
It's a little bit more complicated if the bt are not fully observed. In that case, the agent Agi can compute its "expected reward function", which is simply:
The value of ERi on a history, is the expected value of ri(bt) on a history, so optimising the first is the same as optimising the second.
Then Agi will attempt to optimise ERi, using hit to, as above, estimate the policies of the other agents (it might estimate this, possibly, by estimating the other agents own personal histories hjt and expected reward functions ERj, j≠i).
The agents can also freely use their own probability estimate Pi, rather than the joint P; in that case, it is important that the pseudo reward C is defined using the Pi: the agent must be indifferent using their own probability estimates.
There are different conventions on whether the the history should start with an observation (the MDP/POMDP convention) or an action (the AIXI convention). This article works with either convention, though it implicitly uses the AIXI convention in indexing (in that observation ot is followed by action at+1). ↩︎
In order for this expression to always make sense, we have to add some extra assumptions, such as bounded rewards with a discount factor, or episodic settings. ↩︎
Lemma 1: Let ¯¯¯¯R and ¯¯¯¯Q be two vectors of (pseudo) reward functions, and let ht be a history. Assume that for all i, all action vectors ¯¯¯at+1, and all observations ot+1, we have:
In other words, the expected rewards of ¯¯¯¯R and ¯¯¯¯Q are the same, for whatever actions and observations follow immediately, and assuming the agents are then optimal.
Then the lemma asserts that, in this case, the optimal actions for ¯¯¯¯R and ¯¯¯¯Q are the same on ht. So for all i:
The γ is 1, so will be ignored. Assume the episode is of length l. We'll prove this result by reverse induction: proving it for l−m and increasing m.
Let t=l−0. If m=0, then ht is a history of maximal length, so there are no longer histories. So expressions like E[R∣ht,¯¯¯π] are automatically zero, thus C(ht)=0 and (¯¯¯¯Ri+Ci)(ht)=r(bt−1)i(ht).
Hence, when faced with a history ht−1, the optimal action is to choose an action ait to optimise the expectation of r(bt−1)i. Thus, by definition, the optimal choice for all the agents is to pick π∗(r(bt−1))(ht−1). The expected rewards for those actions are E[r(bt−1)i(ht)∣ht−1,π∗(r(bt−1))].
So now assume that for general m and t=l−m, the expected reward for Agi is E[r(bt−1)i(ht)∣ht−1,π∗(r(bt−1))], and the optimal actions are given by π∗(r(bt−1))(ht−1).
Now consider m+1 and ht−2. By the induction assumption, the future discounted expectation of ˆRi+Ci, given optimal behaviour, simplifies to the expectation of ˆRi(ht−1)+E[r(bt−2)i∣ht−1,π∗(r(bt−2))], which is just E[r(bt−2)i∣ht−2,π∗(r(bt−2))].
Therefore, given the assumptions of Lemma 1[3:1] about optimality, in order to optimise reward at ht−2, the agents should choose actions given by π∗(r(bt−2).
This completes the induction step, and hence, the policies that optimise ¯¯¯¯R+C are (¯¯¯π1ind,…,¯¯¯¯¯pinind), and the expected reward for any agent Agi, given history ht, is E[r(bt)∣ht,π∗(r(bt))]: the expected reward if the reward functions were never to change again. ↩︎