Summary: For any normal-form game, it's possible to cast the problem of finding a correlated equilibrium in this game as a 2-player zero-sum game. This seems useful because zero-sum games are easy to analyze and more resistant to collusion.
Consider the following class of games (equivalent to the class of normal-form games):
There are players. Player 's action set is . After each player chooses their action , the state results from these actions (perhaps stochastically), and each player receives utility .
Often, we are interested in finding the Nash equilibrium of a game like this. One strategy for this is to instantiate the players as agents. However, this could cause collusion; see here and here for previous writing on collusion. Zero-sum games seem more resistant to collusion (although maybe not 100% resistant). Additionally, 2-player zero-sum games are typically easier to reason about than -player games.
So we might be interested in finding the Nash equilibrium using a zero-sum game. I don't actually know how to find a mixed Nash equilibrium, so instead I'll present a strategy for finding a correlated equilibrium (a superset of Nash equilibria which are computationally easier to find). Here's how it works:
It will be useful to use the concept of an -correlated equilibrium. While a correlated equilibrium is where no player can gain any expected utility by strategy modification, an -correlated equilibrium is where no player can gain more than expected utility by strategy modification.
Note that the critic's policies correspond to mixtures of strategy modifications; the critic can be seen as jointly picking a player and a strategy modification for the player. Furthermore, the critic's expected utility is half the expected utility gained by the corresponding player for the average strategy modification in this mixture:
because the critic's expected utility is half the difference between player 's expected utility given strategy modification () and player 's expected utility given no strategy modification (). Some facts result:
Correlated equilibria always exist, so at a Nash equilibrium in the zero-sum game, the actor always outputs a correlated equilibrium and gets expected utility 0.
Perhaps in real life, it is inconvenient to observe the state resulting from actions , because we can only observe the state by outputting actions, and maybe we always want to output actions from a correlated equilibrium. In this case we could use counterfactual oversight to usually output , but run the procedure above occasionally to gather training data. It's not clear when it's acceptable to occasionally output strategy-modified action profiles (instead of action profiles from a correlated equilibrium).