They mention and link to iterated amplification in the Medium article.
Scaling up
In the long run, we would like to scale reward modeling to domains that are too complex for humans to evaluate directly. To do this, we need to boost the user’s ability to evaluate outcomes. We discuss how reward modeling can be applied recursively: we can use reward modeling to train agents to assist the user in the evaluation process itself. If evaluation is easier than behavior, this could allow us to bootstrap from simpler tasks to increasingly general and more complex task
Yes, and they cite iterated amplification in their paper as well, but I'm trying to figure out if they're proposing anything new, because the title here is "New safety research agenda: scalable agent alignment via reward modeling" but Paul's post that I linked to already proposed recursively applying reward modeling. Seems like either I'm missing something, or they didn't read that post?
They mention and link to iterated amplification in the Medium article.
... (read more)Yes, and they cite iterated amplification in their paper as well, but I'm trying to figure out if they're proposing anything new, because the title here is "New safety research agenda: scalable agent alignment via reward modeling" but Paul's post that I linked to already proposed recursively applying reward modeling. Seems like either I'm missing something, or they didn't read that post?