Why does the AI even "want" failure mode 3? If it's a RL agent, it's not "motivated to maximize its reward", it's "motivated to use generalized cognitive patterns that in its training runs would have marginally maximized its reward". Failure mode 3 is the peak of an entirely separate mountain than the one RL is climbing, and I think a well-designed box setup can (more-or-less "provably") prevent any cross-peak bridges in the form of cognitive strategies that undermine this.
That is to say: yes, it can (or at least, it it's not provable that it can't) imagine a way to break the box, and it can know that the reward it would actually get from breaking the box would be "infinite", but it can be successfully prevented from "feeling" the infinite-ness of that potential reward, because the RL procedure itself doesn't consider a broken-box outcome to be a valid target of cognitive optimization.
Now, this creates a new failure mode, where it hacks its own RL optimizer. But that just makes it unfit, not dangerous. Insofar as something goes wrong to let this happen, it would be obvious and easy to deal with, because it would be optimizing for thinking it would succeed and not for succeeding.
(Of course, that last sentence could also fail. But at least that would require two simultaneous failures to become dangerous; and it seems in principle possible to create sufficient safeguards and warning lights around each of those separately, because the AI itself isn't subverting those safeguards unless they've already failed.)
Why does the AI even "want" failure mode 3? If it's a RL agent, it's not "motivated to maximize its reward", it's "motivated to use generalized cognitive patterns that in its training runs would have marginally maximized its reward". Failure mode 3 is the peak of an entirely separate mountain than the one RL is climbing, and I think a well-designed box setup can (more-or-less "provably") prevent any cross-peak bridges in the form of cognitive strategies that undermine this.
That is to say: yes, it can (or at least, it it's not provable that it can't) imagine a way to break the box, and it can know that the reward it would actually get from breaking the box would be "infinite", but it can be successfully prevented from "feeling" the infinite-ness of that potential reward, because the RL procedure itself doesn't consider a broken-box outcome to be a valid target of cognitive optimization.
Now, this creates a new failure mode, where it hacks its own RL optimizer. But that just makes it unfit, not dangerous. Insofar as something goes wrong to let this happen, it would be obvious and easy to deal with, because it would be optimizing for thinking it would succeed and not for succeeding.
(Of course, that last sentence could also fail. But at least that would require two simultaneous failures to become dangerous; and it seems in principle possible to create sufficient safeguards and warning lights around each of those separately, because the AI itself isn't subverting those safeguards unless they've already failed.)