Do you have plans to measure the alignment of pure RL agents, as opposed to repurposed language models? It surprised me a bit when I discovered that there isn’t a standard publicly available value learning benchmark, despite there being data to create one. An agent would be given first or third-person demonstrations of people trying to maximize their score in a game, and then it would try to do the same, without ever getting to see what the true reward function is. Having something like this would probably be very useful; it would allow us to directly measure goodharting, and being quantitative it might help incentivize regular ML researchers to work on alignment. Will you create something like this?
The simplest possible acceptable value learning benchmark would look something like this:
None of the environments and datasets you mention are actually like this. Some people do test their IRL algorithms in a way similar to this (the difference being that they learn from another bot), but the details aren’t standardized.
A harder and more realistic version that I have yet to see in any paper would look something like this:
It’s far from perfect, but if someone could obtain good scores there, it would probably make me much more optimistic about the probability of solving alignment.