User Comment Replies — AI Alignment Forum

we'll elide all of the subtle difficulties involved in actually getting RL to work in practice

I haven't properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.

1Jesse Hoogland3mo

You might enjoy this new blogpost from HuggingFace, which goes into more detail.

2Jesse Hoogland3mo

I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don't think the RL setup is actually that straightforward. If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching.

AI ALIGNMENT FORUM
AF

All of Abhimanyu Pallavi Sudhir's Comments + Replies