Alex Lawsen — AI Alignment Forum

Thanks for writing this up! I've found this frame to be a really useful way of thinking about GPT-like models since first discussing it.

In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the 'other methods' section of 'Novel methods of process/agent specification'). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI's current plans. My understanding is that Conjecture is overall very negative on RLHF, but that makes it seem more useful to discuss how to model the results of the approach, not less, to the extent that you expect this framing to help shed light what might go wrong.

It feels like there are a few different ways you could sketch out how you might expect this kind of training to go. Quick, clearly non-exhaustive thoughts below:

Something that seems relatively benign/unexciting - fine tuning increases the likelihood that particular simulacra are instantiated for a variety of different prompts, but doesn't really change which simulacra are accessible to the simulator.
More worrying things - particular simulacra becoming more capable/agentic, simulacra unifying/trading, the simulator framing breaking down in some way.
Things which could go either way and seem very high stakes - the first example that comes to mind is fine-tuning causing an explicit representation of the reward signal to appear in the simulator, meaning that both corrigibly aligned and deceptively aligned simulacra are possible, and working out how to instantiate only the former becomes kind of the whole game.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments