In 2013, Eliezer Yudkowsky and Marcello Herreshoff published Tiling Agents for Self-Modifying AI, and the Löbian Obstacle. It is worth comprehending because it:

  • Is a very well written paper.
  • Expresses an idea that is non-obvious, and still relevant to alignment today.
  • Provides insight into what Eliezer and Marcello thought was valuable to work on in the time preceding publication.

When I first read this paper, I terribly misunderstood it. Due to it not being particularly approachable material for someone not well-versed in logic, I was confidently wrong for at least one month. This post summarizes my understanding of what the Löbian Obstacle is (having re-read the paper,) and why I think it is still an important idea one decade after its publication.


An agent  occupies a fully-known, deterministic and closed environment.  has a goal  that is either satisfied or otherwise by an outcome, for which 's preference is satisfaction. An action  performed by an agent created by , hereafter referred to as , must therefore satisfy the statement:

Where  denotes the actual performance of , and  denotes cognitive belief in the succeeding statement. Even if  could verify by inspection of 's design that  will hold, as in:

Where  refers to a proof of  from the axioms of , this is unknowable, as it would require:

For this to be so it would need to be that  could prove that if some proof of  exists in , that  must be true. Tiling Agents for Self-Modifying AI, and the Löbian Obstacle shows this to be impossible; a formal system cannot prove its own soundness schema. 
 


The above was a brief paraphrasing of section two of the original paper, which contains many additional details and complete proofs. How the Löbian Obstacle relates to simulators is my current topic of research, and this section will make the case that this is an important component of designing safe simulators.

We should first consider that simulating an agent is not distinguishable from creating one, and that consequently the implications of creating dangerous agents should generalize to their simulation. Hubinger et al. (2023) have stated similar concerns, and provide a more detailed examination of the argument.

It is also crucial to understand that simulacra are not necessarily terminating, and may themselves use simulation as a heuristic for solving problems. This could result in a kind of hierarchy of simulacra. In advanced simulators capable of very complex simulations, we might expect a complex network of simulacra bound by acausal trade and 'complexity theft,' whereby one simulacrum tries to obtain more simulation complexity as a form of resource acquisition or recursive self-improvement.

I expect this to happen. Lower complexity simulacra may still be more intelligent than their higher complexity counterparts, and as a simulator's simulacra count may grow exponentially, as does the likelihood that one simulacrum attempts complexity theft.

If we want safe simulators, we need the subsequent, potentially abyssal simulacra hierarchy to be aligned all the way down. Without being able to thwart the Löbian Obstacle, I doubt a formal guarantee is attainable. If we could do so, we may only need to simulate one aligned tiling agent, for which we might settle with a high certainty informal guarantee of alignment if short on time. I outlined how I thought that could be done here, although I've advanced the theory considerably since and will post an updated write-up soon.

If we can't reliably thwart the Löbian Obstacle, we should consider alternatives:

  • Can we reliably attain high certainty informal guarantees of alignment for arbitrarily deep simulacra hierarchies?
  • Is limiting the depth of simulacra hierarchies possible?
New Comment