Then, is considered to be a precursor of in universe when there is some -policy s.t. applying the counterfactual " follows " to (in the usual infra-Bayesian sense) causes not to exist (i.e. its source code doesn't run).
A possible complication is, what if implies that creates / doesn't interfere with the creation of ? In this case might conceptually be a precursor, but the definition would not detect it.
Can you plea...
So, let me try to summarize and check my understanding. In the first part of the post, you show that most random reward functions are not corrigible. This looks correct.
In the second part, you want to prove that VNM-coherence is incompatible with corrigibility in this universe, and I don't think I follow. So, suppose that R(A_blue),R(B_blue),R(C_blue)>max(R(A_red),R(B_red),R(C_red)). Now we change the dynamics so that the human will not correct the agent by default, but can be manipulated into it. Then we need to add states A_black and C_black, and arro...
Thank you for explaining this! But then how can this framework be used to model humans as agents? People can easily imagine outcomes worse than death or destruction of the universe.