I am grateful to Marcus Hutter and the lesswrong team for early feedback, though any remaining errors are mine.
The universal agent AIXI treats the environment it interacts with like a video game it is playing; the actions it chooses at each step are like hitting buttons and the percepts it receives are like images on the screen (observations) and an unambiguous point tally (rewards). It has been suggested that since AIXI is inherently dualistic and doesn't believe anything in the environment can "directly" hurt it, if it were embedded in the real world it would eventually drop an anvil on its head to see what would happen. This is certainly possible, because the math of AIXI cannot explicitly represent the idea that AIXI is running on a computer inside the environment it is interacting with. For one thing, that possibility is not in AIXI's hypothesis class (which I will write M). There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness). On top of that, "any" environment "containing" AIXI is at the wrong computability level for a member of M: our best upper bound on AIXI's computability level is Δ02 = limit-computable (for an ε-approximation) instead of the Σ01 level of its environment class. Reflective oracles can fix this but at the moment there does not seem to be a canonical reflective oracle, so there remains a family of equally valid reflective versions of AIXI without an objective favorite.
However, in my conversations with Marcus Hutter (the inventor of AIXI) he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway. But when humans stick electrodes in our brains, we can observe changed behavior and deduce that our cognition is physical - would this kind of experiment allow AIXI to make the same discovery? Though we could not agree on this for some time, we eventually discovered the crux: we were actually using slightly different definitions for how AIXI should behave off-policy.
In particular, let ξAI be the belief distribution of AIXI. More explicitly,
ξAI:=∑μ∈M2−K(μ)μ
I will not attempt a formal definition here. The only thing we need to know is that M is a set of environments which AIXI considers possible. AIXI interacts with an environment by sending it a sequence of actions a1,a2,... in exchange for a sequence of percepts containing an observation and reward e1=o1r1,e2=o2r2,... so that action at precedes percept et. One neat property of AIXI is that its choice of M satisfies ξAI∈M (this trick is inherited with minor changes from the construction of Solomonoff's universal distribution).
Now let Vπμ be a (discounted) value function for policy π interacting with environment μ, which is the expected sum of discounted rewards obtained by π. We can define the AIXI agent as
πAIXI:=argmaxπVπξAI
By the Bellman equations, this also specifies AIXI's behavior on any history it can produce (all finite percept strings have nonzero probability under ξAI). However, it does not tell us how AIXI behaves when the history includes actions it would not have chosen. In that case, the natural extension is
πAIXI(⋅|ae<t):=argmaxπVπξAI(ae<t)
so that AIXI continues to act optimally (with respect to its updated belief distribution) even when some suboptimal actions have previously been taken.
The philosophy of this extension is that AIXI acts exactly as if a dumb friend has been playing the video game poorly with AIXI watching both the button presses and the screen from a nearby armchair, and then suddenly passes AIXI the controller. This means that if some electrodes were stuck in AIXI's "brain" and caused it to choose poor actions, afterwards it will act as if this is because its dumb friend decided to choose dumb actions (which says nothing whatsoever about the rules of the video game itself). I will call this sort of situation action corruption. It seems that dealing with action corruption reasonably may be sufficient to prevent an agent (or child) from dropping an anvil on its head, at least with some "paternalistic" guidance from engineers (or parents). When a little (carefully controlled, non-destructive) poking around in AIXI's brain is revealed to corrupt actions according to the laws of physics, it is a sensible inference (supported by Ockham's razor) that crushing AIXI's brain with an anvil will lead to irrecoverable action corruption. However, we have just argued that AIXI is not ontologically capable of arriving at that inference. Therefore our naive extension of AIXI off-policy will not respond to action corruption reasonably and may fall prey to the anvil problem.
However, in a way this direct extension of AIXI to off-policy histories is not natural. After all, AIXI should be able to calculate the actions that it would have taken in the past recursively, so we can always determine whether action corruption has taken place. To be specific, let us fix a deterministic AIXI policy (by breaking ties between action values in some consistent way). Any deterministic π can be treated as a function from histories to actions. Taking this view of πAIXI, recursively define a mapping a∗ from (a finite string or infinite sequence of) percepts to the corresponding actions of πAIXI:
a∗(e)t:=πAIXI(a∗(e)e<t)
Now we can define an alternative off-policy version of AIXI that recalculates the optimal action sequence up to the current time and combines this with the (memorized) percepts to generate an alternative history:
πAIXIoff(ae<t):=πAIXI(a∗(e)e<t)
Note that πAIXIoff ignores the true action sequence a, or equivalently does not remember its previous actions. In case this waste of information bothers you, we will later integrate the true actions at into the percepts et and strengthen our conclusions.
This more complicated off-policy behavior does not harm the computability level of ε-approximating AIXI because the recursive computation eventually settles on the right answer (by induction). The runtime to convergence scales up by about a factor of t as one would expect.
It is worth reflecting here on which version of AIXI best describes humans. In particular, what is the meaning of our action sequence, and do we remember it? This discussion requires that we are careful about drawing the lines between a person and her environment. Certainly we observe and recall our own external actions (at some level of abstraction); we hear the words we say and we see our hands grasping and picking up objects. However, these events and even the feeling of performing them are actually a part of our percept stream, which both πAIXI and πAIXIoff maintain similar access to. We can call these self-observations which make up a part of our percepts im(a) to stand for the image of our actions. Perhaps the "true" actions are the conscious decisions that precede physical acts - however, psychological experiments have shown that these decisions are predictable before we know that we have arrived at them. The difficulty of locating a<t suggests to me that perhaps memory only represents im(a<t), and the illusion of access to a<t is created online by retrospection - this is a much closer match to πAIXIoff than πAIXI, though I think the argument is far from rigorous.
In fact, this discussion suggests the more radical solution of entirely throwing out the "internal" choices a and constructing a universal distribution over im(a)e′ (where e′ is the rest of the percept). Then perhaps we can believe in free will only for the current time, and "myopically" choose the most favorable im(at) in a kind of evidential decision theory that may appeal to behaviorists. This idea is close to "Self-Predictive Universal A.I." which constructs the "Self-AIXI" policy πS (the difference is that the paper uses a product measure over belief distributions for policy and environment instead of a unified belief distribution for both). Perhaps there is an interesting connection between πS and πAIXIoff, but I have not found one yet!
Intuitively, I expect that πAIXIoff avoids the anvil problem. It should learn that when engineers non-destructively mess with its brain, im(a) does not always match im(a∗), which is bad because physical consequences (and rewards) always depend directly on im(a). However, outside of brain surgery, a=a∗ and im(a)=im(a∗), meaning that a∗ determines the observed actions of the agent and leads to their consequences (remember that πAIXIoff does not directly see a). Since the physics of AIXI's brain is computing the true action whether or not it is interfered with, one might expect that AIXI eventually decides that its chosen actions have no effect and it is only a passive observer. But in practice, the actions chosen by AIXI when its brain is not being messed with will be very difficult to predict with certainty from physics (this situation is made even worse because the existence of AIXI's brain is necessarily outside of its own hypothesis class, at least without introducing a reflective oracle), where as by assumption a∗ is computed correctly with perfect certainty. So the "free will" hypothesis gains Bayes points over the naturalistic hypothesis under those conditions where AIXI performs reliably[1]. In other words, πAIXIoffbelieves it has free will exactly when it has free will.
Formal Equivalence with an Uncorrupted AIXI
Now we will prove a formal result which describes exactly in what sense the informal statement above is true[2]. For the moment we will step back and consider arbitrary policies and environments where the percepts need not contain images of the actions.
Given any policy π, define the corrupted policy f∘π as
That is, the probability that f∘π takes action a′t given that it has so far chosen actions a′<t and observed percepts e<t, is the sum over the conditional probabilities given by f when π attempts to take any possible action at.
Now given an environment μ with μ(e1:t||a1:t):=∏ti=1μ(ei|e<iai), we will also define an extended environment that includes the action corruption f:
Returning to our extended metaphor, the interpretation is that you are sitting in a room playing a video game on your controller, but your friend is watching you play from another room and sometimes (perhaps if you're doing something he doesn't like) he somehow hacks in and changes your actions a to new actions a′. From your perspective this is all part of the game, in the sense that it behaves exactly like an environment in AIXI's hypothesis class. With the (minor) caveat that your actions are now moving your thumbs instead of pressing the buttons, this description includes the case that your friend actually grabs your controller (but it is worse in that you may not get to see what he does with it).
Now define the history distribution μπ to be generated by the interaction of environment μ with policy π.
taking the sum over a′1:t we obtain μf(e1:t||a∗(e)1:t) by definition, and this is equal to (μf)πAIXI(a∗(e)e1:t) because πAIXI is deterministic.
The theorem says that the probability of the corrupted πAIXIoff producing any string of corrupted actions that give rise to the percept string e1:t is the same as the probability that πAIXI causes the environment to produce the same percept string e1:t. In particular this means that Vf∘πAIXIoffμ=VπAIXIμf. My interpretation is that πAIXIoff deals with (computable) action corruption optimally.
Now assume that there is an invertible function im such that the percepts contain im(a) and with a slight abuse of notation a=im−1(e). That is, assume AIXI can see itself in the mirror.
Corollary: When μ is chosen so that the percepts contain an image of the actions,
μf∘πAIXIoff(im−1(e)e1:t)=(μf)πAIXI(a∗(e)e1:t)
And no other actions are consistent with these percepts under μ, so f∘πAIXIoff interacting with μ causes the percepts e1:t with the same probability that πAIXI would when interacting with the environment μf that sometimes visibly alters its actions according to f.
Closing Thoughts
I believe the off-policy modification of AIXI described above deals with action corruption "correctly." In particular, we can formalize the intuition that πAIXIoff avoids the anvil problem. Since we have said little about the boundedly rational approximations of AIXI, the following argument is only on solid footing when we allow that a very good AIXI approximation has been physically implemented. Assume that before setting AIXI loose with powerful actuators, engineers both expose it to data about the external world and non-destructively tamper with its brain. Because brain surgery interferes with im(a) according to the laws of physics, AIXI's understanding of the laws of physics will increase the weight of environments of the form μf where μ captures the physics of the external world and f involves the physics of AIXI's brain (roughly, the conditional Kolmgorov complexity of such f given μ is low). This suggests a serious risk of permanent interference from destructive interventions like a falling anvil (possibly resulting in 0 reward forever), so once AIXI is let loose it will strongly prefer not to run the experiment of dropping an anvil on its head. All of this holds even though AIXI is not capable of literally believing its action choices a∗ depend on its brain - it simply works out the physicality of the action corruptions.
I do not claim that πAIXIoff solves all problems of embeddedness. A real implementation of AIXI would not choose the optimal actions a∗ but instead some approximation of them. That approximation would get better with more compute. It is possible that under some approximation schemes, πAIXIoff approximations granted more compute view their past suboptimal actions as corrupted actions and conclude that seeking more compute reduces action corruption, but because of potential instabilities in convergence I have not been able to prove this. In any case, πAIXIoff can only "learn about its own embeddedness" through observing the effects of side-channels on its performance, but perhaps cannot entertain that it may be a part of the universe merely by observing the existence of computers. I am not certain whether this is even true, and if so I cannot think of any examples where it would be a defect in practice. It does seem that πAIXIoff is inherently unable to perform anthropic reasoning, but again I am not sure whether this a problem in practice or only a philosophical objection.
My more general stance is that the philosophical problems with AIXI are overstated, and theories of embedded agency should build off of AIXI. Though we certainly have no rigorously justified and fully satisfying theory of embedded agency (the closest may be Laurent Orseau and Mark Ring's "Space Time Embedded Intelligence"), it is not clear that these requirements are well-posed enough to have a unique answer. The most promising path I see to explore the possibilities starts with understanding the variations on AIXI. The naive off-policy extension of AIXI may eventually destroy itself, but I expect relatively minor variations on AIXI (such as πAIXIoff and πS with carefully chosen environment and policy mixtures) to succeed in practice modulo computational boundedness, at least with a little guidance during early stages of development - and frankly I am surprised that their analysis hasn't received more attention[3]. If you are interested, please ask me about the many open problems in this area!
Most of this argument goes through for any Bayesian decision theorist. We implicitly rely on the flexibility of AIXI's hypothesis class by arguing that it should be able to identify the cases when its brain is or is not being tampered with and combine different methods of prediction for each. This is an instance of the "AIXI = God Thesis" in which we assume that any (non-reflective) theory that we can easily describe has a reasonable prior weight under the universal distribution and in practice AIXI will eventually adopt and act based on those theories that are useful for prediction, and therefore perform at least as well as optimal action based on our informally stated theory.
My guess is that rationalists tend to have an ambitious and contrarian streak, which causes us to reject a whole paradigm at the first sign of philosophical limitations and prefer inventing entirely new theories. For example, I am thinking of logical induction, infra-Bayesianism, and singular learning theory. Certainly there are some fascinating ideas here worth exploring for their own sake; but this research is often justified through relevance to A.I. safety. I have only passing knowledge of these topics, but as far as I can tell the connection tends to be pretty weak. In contrast, if any rigorous theory of A.I. safety is possible, it probably needs to factor through a good understanding of an embedded version of AIXI if only to avoid wireheading by locating the reward model or utility function at the right position in its ontology. Admittedly the areas I mention have justifications that sound about as good as mine, but engaging with them is outside the scope of this post.
This post depends on a basic understanding of history-based reinforcement learning and the AIXI model.
I am grateful to Marcus Hutter and the lesswrong team for early feedback, though any remaining errors are mine.
The universal agent AIXI treats the environment it interacts with like a video game it is playing; the actions it chooses at each step are like hitting buttons and the percepts it receives are like images on the screen (observations) and an unambiguous point tally (rewards). It has been suggested that since AIXI is inherently dualistic and doesn't believe anything in the environment can "directly" hurt it, if it were embedded in the real world it would eventually drop an anvil on its head to see what would happen. This is certainly possible, because the math of AIXI cannot explicitly represent the idea that AIXI is running on a computer inside the environment it is interacting with. For one thing, that possibility is not in AIXI's hypothesis class (which I will write M). There is not an easy patch because AIXI is defined as the optimal policy for a belief distribution over its hypothesis class, but we don't really know how to talk about optimality for embedded agents (so the expectimax tree definition of AIXI cannot be easily extended to handle embeddedness). On top of that, "any" environment "containing" AIXI is at the wrong computability level for a member of M: our best upper bound on AIXI's computability level is Δ02 = limit-computable (for an ε-approximation) instead of the Σ01 level of its environment class. Reflective oracles can fix this but at the moment there does not seem to be a canonical reflective oracle, so there remains a family of equally valid reflective versions of AIXI without an objective favorite.
However, in my conversations with Marcus Hutter (the inventor of AIXI) he has always insisted AIXI would not drop an anvil on its head, because Cartesian dualism is not a problem for humans in the real world, who historically believed in a metaphysical soul and mostly got along fine anyway. But when humans stick electrodes in our brains, we can observe changed behavior and deduce that our cognition is physical - would this kind of experiment allow AIXI to make the same discovery? Though we could not agree on this for some time, we eventually discovered the crux: we were actually using slightly different definitions for how AIXI should behave off-policy.
In particular, let ξAI be the belief distribution of AIXI. More explicitly,
ξAI:=∑μ∈M2−K(μ)μI will not attempt a formal definition here. The only thing we need to know is that M is a set of environments which AIXI considers possible. AIXI interacts with an environment by sending it a sequence of actions a1,a2,... in exchange for a sequence of percepts containing an observation and reward e1=o1r1,e2=o2r2,... so that action at precedes percept et. One neat property of AIXI is that its choice of M satisfies ξAI∈M (this trick is inherited with minor changes from the construction of Solomonoff's universal distribution).
Now let Vπμ be a (discounted) value function for policy π interacting with environment μ, which is the expected sum of discounted rewards obtained by π. We can define the AIXI agent as
πAIXI:=argmaxπVπξAIBy the Bellman equations, this also specifies AIXI's behavior on any history it can produce (all finite percept strings have nonzero probability under ξAI). However, it does not tell us how AIXI behaves when the history includes actions it would not have chosen. In that case, the natural extension is
πAIXI(⋅|ae<t):=argmaxπVπξAI(ae<t)so that AIXI continues to act optimally (with respect to its updated belief distribution) even when some suboptimal actions have previously been taken.
The philosophy of this extension is that AIXI acts exactly as if a dumb friend has been playing the video game poorly with AIXI watching both the button presses and the screen from a nearby armchair, and then suddenly passes AIXI the controller. This means that if some electrodes were stuck in AIXI's "brain" and caused it to choose poor actions, afterwards it will act as if this is because its dumb friend decided to choose dumb actions (which says nothing whatsoever about the rules of the video game itself). I will call this sort of situation action corruption. It seems that dealing with action corruption reasonably may be sufficient to prevent an agent (or child) from dropping an anvil on its head, at least with some "paternalistic" guidance from engineers (or parents). When a little (carefully controlled, non-destructive) poking around in AIXI's brain is revealed to corrupt actions according to the laws of physics, it is a sensible inference (supported by Ockham's razor) that crushing AIXI's brain with an anvil will lead to irrecoverable action corruption. However, we have just argued that AIXI is not ontologically capable of arriving at that inference. Therefore our naive extension of AIXI off-policy will not respond to action corruption reasonably and may fall prey to the anvil problem.
However, in a way this direct extension of AIXI to off-policy histories is not natural. After all, AIXI should be able to calculate the actions that it would have taken in the past recursively, so we can always determine whether action corruption has taken place. To be specific, let us fix a deterministic AIXI policy (by breaking ties between action values in some consistent way). Any deterministic π can be treated as a function from histories to actions. Taking this view of πAIXI, recursively define a mapping a∗ from (a finite string or infinite sequence of) percepts to the corresponding actions of πAIXI:
a∗(e)t:=πAIXI(a∗(e)e<t)Now we can define an alternative off-policy version of AIXI that recalculates the optimal action sequence up to the current time and combines this with the (memorized) percepts to generate an alternative history:
πAIXIoff(ae<t):=πAIXI(a∗(e)e<t)Note that πAIXIoff ignores the true action sequence a, or equivalently does not remember its previous actions. In case this waste of information bothers you, we will later integrate the true actions at into the percepts et and strengthen our conclusions.
This more complicated off-policy behavior does not harm the computability level of ε-approximating AIXI because the recursive computation eventually settles on the right answer (by induction). The runtime to convergence scales up by about a factor of t as one would expect.
It is worth reflecting here on which version of AIXI best describes humans. In particular, what is the meaning of our action sequence, and do we remember it? This discussion requires that we are careful about drawing the lines between a person and her environment. Certainly we observe and recall our own external actions (at some level of abstraction); we hear the words we say and we see our hands grasping and picking up objects. However, these events and even the feeling of performing them are actually a part of our percept stream, which both πAIXI and πAIXIoff maintain similar access to. We can call these self-observations which make up a part of our percepts im(a) to stand for the image of our actions. Perhaps the "true" actions are the conscious decisions that precede physical acts - however, psychological experiments have shown that these decisions are predictable before we know that we have arrived at them. The difficulty of locating a<t suggests to me that perhaps memory only represents im(a<t), and the illusion of access to a<t is created online by retrospection - this is a much closer match to πAIXIoff than πAIXI, though I think the argument is far from rigorous.
In fact, this discussion suggests the more radical solution of entirely throwing out the "internal" choices a and constructing a universal distribution over im(a)e′ (where e′ is the rest of the percept). Then perhaps we can believe in free will only for the current time, and "myopically" choose the most favorable im(at) in a kind of evidential decision theory that may appeal to behaviorists. This idea is close to "Self-Predictive Universal A.I." which constructs the "Self-AIXI" policy πS (the difference is that the paper uses a product measure over belief distributions for policy and environment instead of a unified belief distribution for both). Perhaps there is an interesting connection between πS and πAIXIoff, but I have not found one yet!
Intuitively, I expect that πAIXIoff avoids the anvil problem. It should learn that when engineers non-destructively mess with its brain, im(a) does not always match im(a∗), which is bad because physical consequences (and rewards) always depend directly on im(a). However, outside of brain surgery, a=a∗ and im(a)=im(a∗), meaning that a∗ determines the observed actions of the agent and leads to their consequences (remember that πAIXIoff does not directly see a). Since the physics of AIXI's brain is computing the true action whether or not it is interfered with, one might expect that AIXI eventually decides that its chosen actions have no effect and it is only a passive observer. But in practice, the actions chosen by AIXI when its brain is not being messed with will be very difficult to predict with certainty from physics (this situation is made even worse because the existence of AIXI's brain is necessarily outside of its own hypothesis class, at least without introducing a reflective oracle), where as by assumption a∗ is computed correctly with perfect certainty. So the "free will" hypothesis gains Bayes points over the naturalistic hypothesis under those conditions where AIXI performs reliably[1]. In other words, πAIXIoff believes it has free will exactly when it has free will.
Formal Equivalence with an Uncorrupted AIXI
Now we will prove a formal result which describes exactly in what sense the informal statement above is true[2]. For the moment we will step back and consider arbitrary policies and environments where the percepts need not contain images of the actions.
Given any policy π, define the corrupted policy f∘π as
(f∘π)(a′t|a′e<t):=∑atf(a′t|a′<tat,e1:t)π(at|a′e<t)That is, the probability that f∘π takes action a′t given that it has so far chosen actions a′<t and observed percepts e<t, is the sum over the conditional probabilities given by f when π attempts to take any possible action at.
Now given an environment μ with μ(e1:t||a1:t):=∏ti=1μ(ei|e<iai), we will also define an extended environment that includes the action corruption f:
μf(e1:t||a1:t):=∑a′1:tμ(e1:t||a′1:t)t∏i=1f(a′i|a′<iai,e1:i)Returning to our extended metaphor, the interpretation is that you are sitting in a room playing a video game on your controller, but your friend is watching you play from another room and sometimes (perhaps if you're doing something he doesn't like) he somehow hacks in and changes your actions a to new actions a′. From your perspective this is all part of the game, in the sense that it behaves exactly like an environment in AIXI's hypothesis class. With the (minor) caveat that your actions are now moving your thumbs instead of pressing the buttons, this description includes the case that your friend actually grabs your controller (but it is worse in that you may not get to see what he does with it).
Now define the history distribution μπ to be generated by the interaction of environment μ with policy π.
Theorem: ∑a′1:tμf∘πAIXIoff(a′e1:t)=(μf)πAIXI(a∗(e)e1:t)
We can further define πoff for any deterministic π and a version of the theorem holds, with its recursively calculated actions aπ(e) instead of a∗(e).
Proof:
μf∘πAIXIoff(a′e1:t):=μ(e1:t||a′1:t)t∏i=1(f∘πAIXIoff)(a′i|a′e<i)=μ(e1:t||a′1:t)t∏i=1∑atf(a′i|a′<iai,e1:i)πAIXIoff(ai|a′e<i)=μ(e1:t||a′1:t)t∏i=1∑atf(a′i|a′<iai,e1:i)[at=πAIXI(a∗(e)e<i)]=μ(e1:t||a′1:t)t∏i=1f(a′i|a′<ia∗(e)i,e1:i)taking the sum over a′1:t we obtain μf(e1:t||a∗(e)1:t) by definition, and this is equal to (μf)πAIXI(a∗(e)e1:t) because πAIXI is deterministic.
The theorem says that the probability of the corrupted πAIXIoff producing any string of corrupted actions that give rise to the percept string e1:t is the same as the probability that πAIXI causes the environment to produce the same percept string e1:t. In particular this means that Vf∘πAIXIoffμ=VπAIXIμf. My interpretation is that πAIXIoff deals with (computable) action corruption optimally.
Now assume that there is an invertible function im such that the percepts contain im(a) and with a slight abuse of notation a=im−1(e). That is, assume AIXI can see itself in the mirror.
Corollary: When μ is chosen so that the percepts contain an image of the actions,
μf∘πAIXIoff(im−1(e)e1:t)=(μf)πAIXI(a∗(e)e1:t)And no other actions are consistent with these percepts under μ, so f∘πAIXIoff interacting with μ causes the percepts e1:t with the same probability that πAIXI would when interacting with the environment μf that sometimes visibly alters its actions according to f.
Closing Thoughts
I believe the off-policy modification of AIXI described above deals with action corruption "correctly." In particular, we can formalize the intuition that πAIXIoff avoids the anvil problem. Since we have said little about the boundedly rational approximations of AIXI, the following argument is only on solid footing when we allow that a very good AIXI approximation has been physically implemented. Assume that before setting AIXI loose with powerful actuators, engineers both expose it to data about the external world and non-destructively tamper with its brain. Because brain surgery interferes with im(a) according to the laws of physics, AIXI's understanding of the laws of physics will increase the weight of environments of the form μf where μ captures the physics of the external world and f involves the physics of AIXI's brain (roughly, the conditional Kolmgorov complexity of such f given μ is low). This suggests a serious risk of permanent interference from destructive interventions like a falling anvil (possibly resulting in 0 reward forever), so once AIXI is let loose it will strongly prefer not to run the experiment of dropping an anvil on its head. All of this holds even though AIXI is not capable of literally believing its action choices a∗ depend on its brain - it simply works out the physicality of the action corruptions.
I do not claim that πAIXIoff solves all problems of embeddedness. A real implementation of AIXI would not choose the optimal actions a∗ but instead some approximation of them. That approximation would get better with more compute. It is possible that under some approximation schemes, πAIXIoff approximations granted more compute view their past suboptimal actions as corrupted actions and conclude that seeking more compute reduces action corruption, but because of potential instabilities in convergence I have not been able to prove this. In any case, πAIXIoff can only "learn about its own embeddedness" through observing the effects of side-channels on its performance, but perhaps cannot entertain that it may be a part of the universe merely by observing the existence of computers. I am not certain whether this is even true, and if so I cannot think of any examples where it would be a defect in practice. It does seem that πAIXIoff is inherently unable to perform anthropic reasoning, but again I am not sure whether this a problem in practice or only a philosophical objection.
My more general stance is that the philosophical problems with AIXI are overstated, and theories of embedded agency should build off of AIXI. Though we certainly have no rigorously justified and fully satisfying theory of embedded agency (the closest may be Laurent Orseau and Mark Ring's "Space Time Embedded Intelligence"), it is not clear that these requirements are well-posed enough to have a unique answer. The most promising path I see to explore the possibilities starts with understanding the variations on AIXI. The naive off-policy extension of AIXI may eventually destroy itself, but I expect relatively minor variations on AIXI (such as πAIXIoff and πS with carefully chosen environment and policy mixtures) to succeed in practice modulo computational boundedness, at least with a little guidance during early stages of development - and frankly I am surprised that their analysis hasn't received more attention[3]. If you are interested, please ask me about the many open problems in this area!
Most of this argument goes through for any Bayesian decision theorist. We implicitly rely on the flexibility of AIXI's hypothesis class by arguing that it should be able to identify the cases when its brain is or is not being tampered with and combine different methods of prediction for each. This is an instance of the "AIXI = God Thesis" in which we assume that any (non-reflective) theory that we can easily describe has a reasonable prior weight under the universal distribution and in practice AIXI will eventually adopt and act based on those theories that are useful for prediction, and therefore perform at least as well as optimal action based on our informally stated theory.
The theorem that follows is inspired by the algebra of Samuel Alexander and Marcus Hutter's "Reward-Punishment Symmetric Universal Artificial Intelligence."
My guess is that rationalists tend to have an ambitious and contrarian streak, which causes us to reject a whole paradigm at the first sign of philosophical limitations and prefer inventing entirely new theories. For example, I am thinking of logical induction, infra-Bayesianism, and singular learning theory. Certainly there are some fascinating ideas here worth exploring for their own sake; but this research is often justified through relevance to A.I. safety. I have only passing knowledge of these topics, but as far as I can tell the connection tends to be pretty weak. In contrast, if any rigorous theory of A.I. safety is possible, it probably needs to factor through a good understanding of an embedded version of AIXI if only to avoid wireheading by locating the reward model or utility function at the right position in its ontology. Admittedly the areas I mention have justifications that sound about as good as mine, but engaging with them is outside the scope of this post.