I have a question about the conjecture at the end of Direction 17.5. Let U1 be a utility function with values in [0,1] and let f:[0,1]→[0,1] be a strictly monotonous function. Then U1 and U2=f∘U1 have the same maxima. f can be non-linear, e.g. f(x)=x2. Therefore, I wonder if the condition u(y)=αv(y)+β should be weaker.
Moreover, I ask myself if it is possible to modify U1 by a small amount at a place far away from the optimal policy such that π is still optimal fo... (read more)
No, because it changes the expected value of the utility function under various distributions.
Good catch, the conjecture as stated is obviously false. Because, we can e.g. take U2 to be the same as U1 everywhere except after some action which π∗ doesn't actually take, in which case make it identically 0. Some possible ways to fix it:
* Require the utility function to be of the form U:Oω→[0,1] (i.e. not depend on actions).
* Use (strictly) instrumental reward functions.
* Weaken the conclusion so that we're only comparing U1 and U2 on-policy (but this might be insufficient for superimitation).
* Require π∗ to be optimal off-policy (but it's unclear how can this generalize to finite g).
I have a question about the conjecture at the end of Direction 17.5. Let U1 be a utility function with values in [0,1] and let f:[0,1]→[0,1] be a strictly monotonous function. Then U1 and U2=f∘U1 have the same maxima. f can be non-linear, e.g. f(x)=x2. Therefore, I wonder if the condition u(y)=αv(y)+β should be weaker.
Moreover, I ask myself if it is possible to modify U1 by a small amount at a place far away from the optimal policy such that π is still optimal fo... (read more)