All of Frank_R's Comments + Replies

I have a question about the conjecture at the end of Direction 17.5. Let  be a utility function with values in  and let  be a strictly monotonous function. Then  and  have the same maxima.  can be non-linear, e.g. . Therefore, I wonder if the condition  should be weaker.

Moreover, I ask myself if it is possible to modify  by a small amount at a place far away from the optimal policy such that  is still optimal fo... (read more)

3Vanessa Kosoy
No, because it changes the expected value of the utility function under various distributions. Good catch, the conjecture as stated is obviously false. Because, we can e.g. take U2 to be the same as U1 everywhere except after some action which π∗ doesn't actually take, in which case make it identically 0. Some possible ways to fix it: * Require the utility function to be of the form U:Oω→[0,1] (i.e. not depend on actions). * Use (strictly) instrumental reward functions. * Weaken the conclusion so that we're only comparing U1 and U2 on-policy (but this might be insufficient for superimitation). * Require π∗ to be optimal off-policy (but it's unclear how can this generalize to finite g).