Dr_Manhattan

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

New safety research agenda: scalable agent alignment via reward modeling

Dr_Manhattan6y20

They mention and link to iterated amplification in the Medium article.

Scaling up

In the long run, we would like to scale reward modeling to domains that are too complex for humans to evaluate directly. To do this, we need to boost the user’s ability to evaluate outcomes. We discuss how reward modeling can be applied recursively: we can use reward modeling to train agents to assist the user in the evaluation process itself. If evaluation is easier than behavior, this could allow us to bootstrap from simpler tasks to increasingly general and more complex tasks. This can be thought of as an instance of iterated amplification.

Reply