When describing the failure mode, you have the approximate-Solomonoff agent try to predict (by assigning approximately uniform probability since it can't invert the hash in O(n^2) time), and then plug that distribution into the reward function to compute the "expected reward" of action 1 (very small, since only one can be correct).
However, this problem would be circumvented if the agent did things the opposite way - first try to predict (by assigning probability 0.9 to based on the frequency data), and only then trying to invert the hash and failing.
There might be a more general rule here: the order of prediction where you fail later rather than earlier gives better results. Or: the more approximate an answer, the less precedence it should have in the final prediction.
This is still a bit unsatisfying - it's not abstract reasoning (at least not obviously) - but I think the equivalent would look more like abstract reasoning if the underlying predictor had to use smarter search on a smaller hypothesis space than Solomonoff induction.
I think the issue presented in the post is that the Solomonoff hypothesis cannot be sampled from, even though we can determine the probability density function computationally. If we were to compute the expected value of the reward based on our action, we run into the curse of dimensionality: there is a single point contributing most of the reward. A Solomonoff inductor would correctly find the probability density function that h(s_2)=s_1 with high probability.
However, I think that if we ask the Solomonoff predictor to predict the reward directly, then it will correctly arrive at a model that predicts the rewards. So we can fix the presented agent.
Yes, I think part of the issue is the gap between being able to sample given and evaluating the density of given . However, I am not sure how we should score hypotheses that assign density to future observations (instead of predicting bits one at a a time). We will have difficulty computing .
Predicting the rewards directly seems to fix this issue. I don't know if this solution can be generalized to environments other than betting environments.
Summary: we examine a game involving hash functions and show that an algorithm based on resource-bounded Solomonoff induction is outperformed by a simple strategy. This game has some parallels to Vingean reflection.
Betting environments
Consider the following family of environments (which we might call "betting environments"):
The idea of these environments is that the agent's action affects nothing except reward, so acting well in these environments only requires predicting the next reward correctly given the agent's action. The second observation has no apparent effect on what the agent's strategy should be, but we'll see later why it is included. First, we will see how one general approach to problems like this fails.
A resource-bounded algorithm for betting environments
Consider the following resource-bounded variant of Solomonoff induction. Each hypothesis is represented as a program q that takes as input a bit string and returns a probability (interpreted as the probability that the next bit will be 1). As in ordinary Solomonoff induction, the prior probability for a hypothesis q is 2−length(q). The probability that hypothesis q assigns to observation history b1b2...bn is ∏ni=1(if bi=1 then q(b1:i−1) else 1−q(b1:i−1)), i.e. the product of the conditional probabilities for each bit given previous bits. We restrict attention to q values that run in O(n2), where n is the length of the input.
Using this variant of Solomonoff induction, we can develop an AIXI-like algorithm for acting in betting environments. Specifically, when choosing an action, this algorithm does the following:
The purpose of the random action is to learn the correct distribution for each counterfactual reward r(s1,s2,a) in the long run. For example, when choosing action i we could pick a random with probability 1/(i+1). This way, the rate of exploration approaches 0 over time, but the total number of exploration steps approaches infinity.
Failure in an unpredictable environment
Here's an example of a betting environment that this algorithm fails on. Suppose we have some family of hash functions hi for each natural i, such that hi returns strings of length i. Assume that inverting hi takes time Ω(2i). Now consider the following betting environment:
Immediately, it is clear that always choosing action 1 will yield 0.9 expected reward per iteration. This is the optimal strategy for an agent that lacks the computational resources to tell anything about r(s1,s2,1) given s2. I believe that if hi is an appropriate family of hash functions, this will be true for all polynomial-time algorithms in the long run.
Now let's see how our resource-bounded algorithm does on this problem. After a large number of iterations, any winning hypothesis for our resource-bounded Solomonoff induction variant will satisfy:
This is because these conditions state that the hypothesis has close to the correct conditional distributions for s1 and r. If a hypothesis fails to satisfy one of these properties, we can replace it with a variant that does satisfy the property, yielding a hypothesis that assigns infinitely higher log-probability to the data (in the long run) without being much more complicated.
This only leaves the distribution of s2 given s1. Correctly predicting s2 would require reversing the hash function. Given that the hypothesis must run in polynomial time, if we take many samples of s2 given s1 using the hypothesis, we should expect an (exponentially) small fraction to satisfy hn(s2)=s1. So any hypothesis will predict that, with high probability, hn(s2)≠s1.
Given this, our algorithm will take action 0, because it achieves expected reward 0.5, while action 1 achieves an exponentially small expected reward. Obviously, this behavior is highly pathological.
Conclusion
As an alternative to this algorithm, consider the same algorithm except that it completely ignores observation s2. This new algorithm will correctly predict that, about 90% of the time, action 1 yields reward 1. Therefore, it will take action 1. Intuitively, this seems like the correct behavior in general. Since winning hypotheses will predict reward given action and s1 well subject to resource bounds, they can be used in a straightforward way for estimating expected reward for each action.
The problem with our original algorithm parallels a problem with naïve approaches to Vingean reflection. If the agent's environment contains a hash function inverter, then it may encounter a situation similar to the betting environment in this post: it will see the hash code of a string before the string itself. Since the agent is unable to predict the behavior of the hash function inverter, it must use abstract reasoning to decide how much utility to expect from each action, rather than actually simulating the hash function inverter. The same reasoning sometimes applies to smart agents other than hash function inverters.
For future research, it will be useful to see how strategies for solving the hash function betting environment (i.e. predict expected reward 0.9 from action 1) might translate to strategies for Vingean reflection.