While attempting to construct a UDT-like variant of AIXI, the algorithm I wrote down turned out to have a bunch of problems with it that were blocking attempts to prove that it had various nice properties. This will be the first post in a sequence about the various nonobvious things that go wrong when you try to make UDT-AIXI. (Check back here for more links)
Notation and Setup:
To begin with, the basic setting was one where the Turing machines in the environment had oracle access to the policy. In the first attempt, where O was the set of observations, and O∗ was the set of all finite strings of observations, policies had the type signature π:O∗→ΔA, and environments had the type signature O∗×π→ΔO, and functioned by having the option of making oracle calls to the policy with various observation strings to see what action was returned. If the policy randomizes on that input, then the action is simply drawn from the distribution π(→o).
Note that we're just using O∗ instead of the AIXI input type of (O×A)∗, this is because the actions that the agent sees on the input tape can just be seen as a sort of observation, like any other, and this generalization permits looking at cases where the agent may not have an internal sense of which action it took, or cases where the agent gets to peek at what it did in hypothetical situations that didn't happen.
Policy Search and Nash Equilibria:
Glossing over issues of nonhalting environments, we can reframe the UDT search for the best policy as the result of an infinite game, where each player is indexed with an observation string →o∈O∗, and responds with a (probability distribution over) move(s) in A. Each cell corresponds to a deterministic policy of type O∗→A, and the utility in that cell is the expected reward over the standard mixture of environments, when the environments are making oracle calls to that selected policy. All players conveniently have the same utility function, which makes things easier. The optimal deterministic policy is a Nash equilibrium, because any deviation from any of the players would result in an equal or lower expected utility.
This rephrases problems of selecting a policy, to an equilibrium selection problem, because even when all players have the same utility function, there can be inadequate equilibria. Consider a multiplayer game of stag hunt where rabbits are shared equally among the group, but it's still better for everyone to team up and capture the stag than for everyone to catch a rabbit. The all-rabbit equilibrium is a Nash equilibrium that isn't the best ash equilibrium. This case seems simple to resolve, but there are much more difficult examples, such as a MAX-3-SAT game where each hypothetical version of the agent in Omega's head has incompatible observation strings, some short, some very long, and they are each responsible for fixing the value of one variable.
Problem 1: Improper Randomization
Setting the Vingean and game theory issues to one side, there's another thing that's going wrong here. The stated setup of how the policy interacts with the environment is incompatible with this Nash equilibrium view of UDT.
As a simple example of what's going wrong, consider a single-turn game against an environment that makes two queries to what the policy does, and returns a reward of 1 if they are the same, and 0 if they are different. In standard game theory, randomizing between two actions with identical payoff should have the same expected payoff. But here there's a penalty for randomizing! And because each call to the policy returns an action sampled from the distribution π(→o), identical oracle calls to the policy may return different outputs. The proper way to view this to make it compatible with the Nash equilibrium view is to sample the action first, and then plug it into the open slots in the environment. You only randomize once.
This isn't quite as trivial as it looks, because Reflective Oracles have this "intrinsic randomness" thing going on, two identical oracle calls to the same agent may return different outputs. In Abram's words, reflective oracles are annoyingly realist about probability, they act as if a agent can randomize in such a way as to take different actions in the same information state. The proper way to do this is to sample a deterministic policy first, and then plug it into the slots in the environment. When thinking about a situation where an agent (possibly yourself) is randomizing, you may not know what action it will take, but you should think that the agent will take the same action in an identical state.
The obvious hack is to just cache the result every time the environment makes an oracle call, because if you compute it up-front, the policy sample will probably take an infinite amount of information to specify. But if you apply this hack and cache what was said about the policy so far, you run into another issue with non-halting environments. Because Reflective Oracle-AIXI calls an oracle on an environment to get the next bit, instead of just running the environment (maybe the environment doesn't halt, and you still need to get the next bit somehow), you don't get the policy sample. If the partial policy sample used by the environment on turn t takes less than f(t) (where f is computable) bits to encode, you can use the oracle to recover it and use that information on the next turn, but again this fails in full generality. The problem arises when you've got an environment that, on some turn, with some positive probability, doesn't halt and makes infinitely many oracle calls. Then there's no way (that I know of, yet) to guarantee that the behavior of the policy on future turns is consistent with what it did on that turn.
Problem 2: Can't Build Desired Oracle
What if we then attempt to get a new notion of oracle where we have a probability distribution over samples? That might have promise, because only a single sample would be used. For instance, the type of a sample would be M→{0,1}, and we'd be looking for a fixed-point in Δ(M→{0,1}), ie, a distribution over samples s.t. the probability of drawing a sample that maps M to 0 is the same as the probability of M outputting 0 when it selects a sample from the given distribution.
Then it doesn't straightforwardly follow from the Kakutani fixed-point-theorem, because one of the restrictions is that the set of interest is a nonempty, compact, convex subset of a locally convex Hausdorff space. Fixing some bijection between M→{0,1} and [0,1] so we're looking for a fixed-point in Δ([0,1]), we run into the issue that this space isn't compact under any of the natural topologies, and even restricting to computable functions from M→{0,1} (and bijecting that with N), ΔN isn't compact under any of the natural topologies either.
Due to this issue, the earlier post about oracles that result in correlated-equilibria, is incorrect, because the space of interest suffers from this lack of compactness, and the topological preconditions for Kakutani applying weren't checked. The result still holds up for finitely many players with finitely many moves, because we don't need a distribution over N for that, only a distribution over finitely many integers, so the fixed-point theorem goes through in that case.
While attempting to construct a UDT-like variant of AIXI, the algorithm I wrote down turned out to have a bunch of problems with it that were blocking attempts to prove that it had various nice properties. This will be the first post in a sequence about the various nonobvious things that go wrong when you try to make UDT-AIXI. (Check back here for more links)
Notation and Setup:
To begin with, the basic setting was one where the Turing machines in the environment had oracle access to the policy. In the first attempt, where O was the set of observations, and O∗ was the set of all finite strings of observations, policies had the type signature π:O∗→ΔA, and environments had the type signature O∗×π→ΔO, and functioned by having the option of making oracle calls to the policy with various observation strings to see what action was returned. If the policy randomizes on that input, then the action is simply drawn from the distribution π(→o).
Note that we're just using O∗ instead of the AIXI input type of (O×A)∗, this is because the actions that the agent sees on the input tape can just be seen as a sort of observation, like any other, and this generalization permits looking at cases where the agent may not have an internal sense of which action it took, or cases where the agent gets to peek at what it did in hypothetical situations that didn't happen.
Policy Search and Nash Equilibria:
Glossing over issues of nonhalting environments, we can reframe the UDT search for the best policy as the result of an infinite game, where each player is indexed with an observation string →o∈O∗, and responds with a (probability distribution over) move(s) in A. Each cell corresponds to a deterministic policy of type O∗→A, and the utility in that cell is the expected reward over the standard mixture of environments, when the environments are making oracle calls to that selected policy. All players conveniently have the same utility function, which makes things easier. The optimal deterministic policy is a Nash equilibrium, because any deviation from any of the players would result in an equal or lower expected utility.
This rephrases problems of selecting a policy, to an equilibrium selection problem, because even when all players have the same utility function, there can be inadequate equilibria. Consider a multiplayer game of stag hunt where rabbits are shared equally among the group, but it's still better for everyone to team up and capture the stag than for everyone to catch a rabbit. The all-rabbit equilibrium is a Nash equilibrium that isn't the best ash equilibrium. This case seems simple to resolve, but there are much more difficult examples, such as a MAX-3-SAT game where each hypothetical version of the agent in Omega's head has incompatible observation strings, some short, some very long, and they are each responsible for fixing the value of one variable.
Problem 1: Improper Randomization
Setting the Vingean and game theory issues to one side, there's another thing that's going wrong here. The stated setup of how the policy interacts with the environment is incompatible with this Nash equilibrium view of UDT.
As a simple example of what's going wrong, consider a single-turn game against an environment that makes two queries to what the policy does, and returns a reward of 1 if they are the same, and 0 if they are different. In standard game theory, randomizing between two actions with identical payoff should have the same expected payoff. But here there's a penalty for randomizing! And because each call to the policy returns an action sampled from the distribution π(→o), identical oracle calls to the policy may return different outputs. The proper way to view this to make it compatible with the Nash equilibrium view is to sample the action first, and then plug it into the open slots in the environment. You only randomize once.
This isn't quite as trivial as it looks, because Reflective Oracles have this "intrinsic randomness" thing going on, two identical oracle calls to the same agent may return different outputs. In Abram's words, reflective oracles are annoyingly realist about probability, they act as if a agent can randomize in such a way as to take different actions in the same information state. The proper way to do this is to sample a deterministic policy first, and then plug it into the slots in the environment. When thinking about a situation where an agent (possibly yourself) is randomizing, you may not know what action it will take, but you should think that the agent will take the same action in an identical state.
The obvious hack is to just cache the result every time the environment makes an oracle call, because if you compute it up-front, the policy sample will probably take an infinite amount of information to specify. But if you apply this hack and cache what was said about the policy so far, you run into another issue with non-halting environments. Because Reflective Oracle-AIXI calls an oracle on an environment to get the next bit, instead of just running the environment (maybe the environment doesn't halt, and you still need to get the next bit somehow), you don't get the policy sample. If the partial policy sample used by the environment on turn t takes less than f(t) (where f is computable) bits to encode, you can use the oracle to recover it and use that information on the next turn, but again this fails in full generality. The problem arises when you've got an environment that, on some turn, with some positive probability, doesn't halt and makes infinitely many oracle calls. Then there's no way (that I know of, yet) to guarantee that the behavior of the policy on future turns is consistent with what it did on that turn.
Problem 2: Can't Build Desired Oracle
What if we then attempt to get a new notion of oracle where we have a probability distribution over samples? That might have promise, because only a single sample would be used. For instance, the type of a sample would be M→{0,1}, and we'd be looking for a fixed-point in Δ(M→{0,1}), ie, a distribution over samples s.t. the probability of drawing a sample that maps M to 0 is the same as the probability of M outputting 0 when it selects a sample from the given distribution.
Then it doesn't straightforwardly follow from the Kakutani fixed-point-theorem, because one of the restrictions is that the set of interest is a nonempty, compact, convex subset of a locally convex Hausdorff space. Fixing some bijection between M→{0,1} and [0,1] so we're looking for a fixed-point in Δ([0,1]), we run into the issue that this space isn't compact under any of the natural topologies, and even restricting to computable functions from M→{0,1} (and bijecting that with N), ΔN isn't compact under any of the natural topologies either.
Due to this issue, the earlier post about oracles that result in correlated-equilibria, is incorrect, because the space of interest suffers from this lack of compactness, and the topological preconditions for Kakutani applying weren't checked. The result still holds up for finitely many players with finitely many moves, because we don't need a distribution over N for that, only a distribution over finitely many integers, so the fixed-point theorem goes through in that case.