Truly logical counterfactuals really only make sense in the context of bounded rationality. That is, cases where there is a logically necessary proposition, but the agent cannot determine it within their resource bounds. Essentially all aspects of bounded rationality have no satisfactory treatment as yet.
The prisoners' dilemma question does not appear to require dealing with logical counterfactuals. It is not logically contradictory for two agents to make different choices in the same situation, or even for the same agent to make different decisions given the same situation, though the setup of some scenarios may make it very unlikely or even direct you to ignore such possibilities.
The assumption here is that we can implement a system that deterministically computes mathematical functions with no side effects. We can get much closer for this than we can for leaking information outward, but we still fail to do this perfectly. There are real-world exploits that have broken this abstraction to cause undesired behaviour, and which could be used to gather information about some properties of the real world.
For example, cosmic rays cause bit errors and we can already write software that observes them with high probability of not crashing. We can harden our hardware such as by adding error correction, but this reduces the error rate without eliminating it. There are also CPU bugs, heat-related errors, and DRAM charge levels that have been exploited in real applications, and no doubt many other potential vectors that we haven't discovered yet.
It would certainly be extremely difficult to discern much about the real world purely via such channels, but it only takes one exploitable bug that opens a completely unintended inward channel for the game to be over.
It seems to me that there are some serious practical problems in trying to train this sort of behaviour. After all, a successful execution shuts the system off and it never updates on the training signal. You could train it for something like "when the date from the clock input exceeds the date on input SDD, output high on output SDN (which in the live system will feed to a shutdown switch)", but that's a distant proxy. It seems unlikely to generalize correctly to what you really want, which is much fuzzier.
For example, what you really want is more along the lines of determining the actual date (by unaltered human standards) and comparing with the actual human-desired shutdown date (without manipulating what the humans want), and actually shut down (by means that don't harm any humans or anything else they value). Except that this messy statement isn't nearly tight enough, and a superintelligent system would eat the world in a billion possible ways even assuming that the training was done in a way that the system actually tried to meet this objective.
How are we going to train a system to generalize to this sort of objective without it already being Friendly AGI?
Yes, I'm referring to 5a/5b in conjunction with self-reflection as a deduction rule, since the agent is described as being able to use these to derive new propositions.
There is also a serious problem with 5a itself: the agent needs to try to prove that some new proposition P is "consistent with its model of the world". That is, for its existing axioms and derivation rules T, prove that T+P does not derive a contradiction.
If T is consistent, then this is impossible by Gödel's second incompleteness theorem. Hence for any P, Step 5a will always exhaust all possible proofs of up to whatever bound and return without finding such a proof. Otherwise T is inconsistent and it may be able to find such a proof, but it is also obvious that its proofs can't be trusted and not at all surprising that it will take the wrong route.
This looks full of straightforward errors.
Firstly, the agent's logical deduction system is unsound. It includes something comparable with Peano arithmetic (or else Löb's theorem can't be applied), and then adds a deduction rule "if P can be proved consistent with the system-so-far then assume P is true". But we already know that for any consistent extension T of Peano arithmetic there is at least one proposition G for which both G and ~G are consistent with T. So both of these are deducible using the rule. Now, the agent might not find the contradiction, because...
This system can only find proofs up to a length of a million characters. The requirements of Löb's theorem include a formal system that is an extension of PA and is closed under deduction: that is, derives proofs of unbounded length. So we can't apply Löb's theorem to this system anyway.
If they have source code, then they are not perfectly rational and cannot in general implement LDT. They can at best implement a boundedly rational subset of LDT, which will have flaws.
Assume the contrary: Then each agent can verify that the other implements LDT, since perfect knowledge of the other's source code includes the knowledge that it implements LDT. In particular, each can verify that the other's code implements a consistent system that includes arithmetic, and can run the other on their own source to consequently verify that they themselves implement a consistent system that includes arithmetic. This is not possible for any consistent system.
The only way that consistency can be preserved is that at least one cannot actually verify that the other has a consistent deduction system including arithmetic. So at least one of those agents is not a LDT agent with perfect knowledge of each other's source code.
We can in principle assume perfectly rational agents that implement LDT, but they cannot be described by any algorithm and we should be extremely careful in making suppositions about what they can deduce about each other and themselves.