How do we train a reinforcement learning system to imitate a human producing complex outputs such as strings? Existing approaches are not entirely satisfactory.
Concretely, suppose there is some set of questions. A human can answer each question with a string. We have a set of (question, answer) pairs as training data. Train a model that, given a question, will produce an answer similar to the answer a human would give. Here are some existing approaches to this problem:
In a generative adversarial model, one agent (the imitator) outputs an answer and another agent (the distinguisher) tries to determine if the answer was produced by the imitator or by an actual human. The imitator gets a score for fooling the distinguisher.
Advantages:
Disadvantages:
Given a question, the imitator outputs some representation of a distribution. It is scored according to the estimated probability that this distribution assigns to the human's actual answer (by log probability). This is called "retrodiction" because it must be possible to show that the actual answer had high probability in hindsight.
Practical examples of this approach include variational autoencoders and approximate model counting. In theory, perhaps importance sampling is a good general theory of retrodiction.
Advantages:
Disadvantages: