"most agents stay alive in Pac-Man and postpone ending a Tic-Tac-Toe game", but only in the limit of farsightedness (γ→1)
I think there are two separable concepts at work in these examples, the success of an agent and the agent's choices as determined by the reward functions and farsightedness.
If we compare two agents, one with the limit of farsightedness and the other with half that, farsightedness (γ→1/2), then I expect the first agent to be more successful across a uniform distribution of reward functio... (read more)
By farsightedness, I mean the value of the discount factor γ∈[0,1), with which the agent geometrically discounts rewards at future time steps. That is, the reward r received k steps in the future is discounted as γkr. My theorems assume that, given the reward function R, the agent computes the optimal policy (set) for R at discount rate γ.
There's a different (intuitive) notion of farsightedness, in which the agent can only compute policies within a k-neighborhood of the current state. I think this is the notion you're referring to. In this case, gaining power is a good heuristic, as you say.
If an agent is randomly placed in a given distribution of randomly connected points, I see why there are diminishing returns on seeking more power, but that return is never 0, is it?
You say:
I think there are two separable concepts at work in these examples, the success of an agent and the agent's choices as determined by the reward functions and farsightedness.
If we compare two agents, one with the limit of farsightedness and the other with half that, farsightedness (γ→1/2), then I expect the first agent to be more successful across a uniform distribution of reward functio... (read more)