AI ALIGNMENT FORUM
AF

Thanks for the comment! Note that we use state-action visitation distribution, so we consider trajectories that contain actions as well. This makes it possible to invert (as long as all states are visited). Using only states trajectories, it would indeed be impossible to recover the policy.

Reply

48Goodhart's Law in Reinforcement Learning

2y

5

16Categorical-measure-theoretic approach to optimal policies tending to seek power

3y

0