I think this assumes implicitly that P(A|ref) is uniformly distributed over all the 10,000 options. In a video game I‘d think more that the ”reference” is always to output 0s since the player isn’t interacting. Then The KL divergence could be arbitrarily large. But it’s not really clear in general how to interpret the reference distribution, perhaps someone can clarify?
I think this assumes implicitly that P(A|ref) is uniformly distributed over all the 10,000 options. In a video game I‘d think more that the ”reference” is always to output 0s since the player isn’t interacting. Then The KL divergence could be arbitrarily large. But it’s not really clear in general how to interpret the reference distribution, perhaps someone can clarify?