Thanks for writing this up! I've found this frame to be a really useful way of thinking about GPT-like models since first discussing it.
In terms of future work, I was surprised to see the apparent low priority of discussing pre-trained simulators that were then modified by RLHF (buried in the 'other methods' section of 'Novel methods of process/agent specification'). Please consider this comment a vote for you to write more on this! Discussion seems especially important given e.g. OpenAI's current plans. My understanding is that Conjecture is overall very ...
Figuring out and posting about how RLHF and other methods ([online] decision transformer, IDA, rejection sampling, etc) modify the nature of simulators is very high priority. There's an ongoing research project at Conjecture specifically about this, which is the main reason I didn't emphasize it as a future topic in this sequence. Hopefully we'll put out a post about our preliminary theoretical and empirical findings soon.
Some interesting threads:
RL with KL penalties better seen as Bayesian inference shows that the optimal policy when you hit a GPT w...
Yeah this is the impression I have of their views too, but I think there are good reasons to discuss what this kind of theoretical framework says about RL anyway, even if you're very against pushing the RL SoTA.