Do you have plans to measure the alignment of pure RL agents, as opposed to repurposed language models? It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one. An agent would be given first or third-person demonstrations of people trying to maximize their score in a game, and then it would try to do the same, without ever getting to see what the true reward function is. Having something like this would probably be very useful; it would allow us to directly...
The simplest possible acceptable value learning benchmark would look something like this:
- Data is recorded of people playing a video game. They are told to maximize their reward (which can be exactly computed), have no previous experience playing the game, are actually trying to win and are clearly suboptimal (imitation learning would give very bad results).
- The bot is first given all their inputs and outputs, but not their rewards.
- Then it can play the game in place of the humans but again isn’t given the rewards. Preferably the score isn’t shown on screen.
- T
... (read more)