It seems that some of the problems with orthogonality go away if you use the AI's actions to define the counterfactuals.
In the traditional model "Press" meant a button is pressed that compels the AI to change utility u→v, and it gets a reward E(u|u↛v)−E(v|u→v). Problem was, this could cause the AI to compel people to press the button.
Instead I propose that u→v is an action that the AI can choose to take, and that "Press" allows the AI to take that action. Then it would get rewarded with maxaE(u|a≠u→v)−E(v|u→v), calculated by the AI according to counterfactuals on its own actions (somehow; details left to the student to fill in).
We could add a small (or infinitesimal) ϵ on top of that so that the change u→v is the decision the AI reaches. This makes it likely to encourage "Press", but only as a tie breaker decision.
Why is this immune to the usual attempts by the AI to cause (or deny) "Press"? Simply because if the AI decided not to u→v, then it only loses an ϵ. Thus causing "Press" to happen (which opens the option u→v) will only gain it an ϵ at max.
It seems that some of the problems with orthogonality go away if you use the AI's actions to define the counterfactuals.
In the traditional model "Press" meant a button is pressed that compels the AI to change utility u→v, and it gets a reward E(u|u↛v)−E(v|u→v). Problem was, this could cause the AI to compel people to press the button.
Instead I propose that u→v is an action that the AI can choose to take, and that "Press" allows the AI to take that action. Then it would get rewarded with maxaE(u|a≠u→v)−E(v|u→v), calculated by the AI according to counterfactuals on its own actions (somehow; details left to the student to fill in).
We could add a small (or infinitesimal) ϵ on top of that so that the change u→v is the decision the AI reaches. This makes it likely to encourage "Press", but only as a tie breaker decision.
Why is this immune to the usual attempts by the AI to cause (or deny) "Press"? Simply because if the AI decided not to u→v, then it only loses an ϵ. Thus causing "Press" to happen (which opens the option u→v) will only gain it an ϵ at max.