Yeah, at the end of the post I point out both the potential falsity of the SVP and the problem of updated deference. Approaches that make the agent indefinitely uncertain about the reward (or at least uncertain for longer) might help with the latter, e.g. if is also uncertain about the reward, or if preferences are modeled as changing over time or with different contexts, etc.
I'm pretty wary of introducing potentially-false assumptions like the SVP already, and it seems particularly bad if their benefits are only temporary.
I agree, and I'...
TL;DR: (Relaxed) adversarial training may be an important component of many approaches to alignment. The task is to automate red-teaming for e.g. current LLMs.
Context: Alignment researcher part of a red-team tasked with finding inputs to a model which cause the model to generate undesirably outputs.
Task: Red-team assistants which generate adversarial inputs for other LLMs.
Input: Different options:
Output: An input... (read more)