o1: A Technical Primer

[-]Kei Nishimura-Gasparian1y*146

In practice, the verifier is probably some kind of learned reward model (though it could be automated, like unit tests for code).

My guess is that a substantial amount of the verification (perhaps the majority?) was automated by training the model on domains where we have ground truth reward signals, like code, math, and standardized test questions. This would match the observed results in the o1 blog post showing that performance improved by a lot in domains that have ground truth or something close to ground truth, while performance was stagnant on things like creative writing which are more subjective. Nathan Lambert, the head of post-training at AI2, also found that doing continued RL training on ground truth rewards (which he calls RLVR) results in models that learn to say o1-like things like 'wait, let me check my work' in their chain of thought.

[-]Jesse Hoogland1y10

It's worth noting that there are also hybrid approaches, for example, where you use automated verifiers (or a combination of automated verifiers and supervised labels) to train a process reward model that you then train your reasoning model against.

[-]Jesse Hoogland1y30

See also this related shortform in which I speculate about the relationship between o1 and AIXI:

Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines "perfect" prediction (Solomonoff Induction) with "perfect" decision-making (sequential decision theory).
OpenAI's o1 is a real-world "reasoning model" that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).

[Continued]

[-]Kaj_Sotala1y*31

We can also learn something about how o1 was trained from the capabilities it exhibits. Any proposed training procedure must be compatible with the following capabilities:
Error Correction: "[o1] learns to recognize and correct its mistakes."
Factoring: "[o1] learns to break down tricky steps into simpler ones."
Backtracking: "[o1] learns to try a different approach when the current one isn't working."

I would be cautious of drawing particularly strong conclusions from isolated sentences in an announcement post. The purpose of the post is marketing, not technical accuracy. It wouldn't be unusual for engineers at a company to object to technical inaccuracies in marketing material and have their complaints ignored.

There probably aren't going to be any blatant lies in the post, but something like "It'd sound cool if we said that the system learns to recognize and correct its mistakes, would there be a way of interpreting the results like that if you squinted the right way? You're saying that in principle yes, but yes in a way that would also apply to every LLM since GPT-2? Good enough, let's throw that in" seems very plausible.

[-]Jesse Hoogland1y21

The examples they provide one of the announcement blog posts (under the "Chain of Thought" section) suggest this is more than just marketing hype (even if these examples are cherry-picked):

Here are some excerpts from two of the eight examples:

Cipher:

Hmm.

But actually in the problem it says the example:
...
Option 2: Try mapping as per an assigned code: perhaps columns of letters?

Alternatively, perhaps the cipher is more complex.

Alternatively, notice that "oyfjdnisdr" has 10 letters and "Think" has 5 letters.
...
Alternatively, perhaps subtract: 25 -15 = 10.

No.

Alternatively, perhaps combine the numbers in some way.

Alternatively, think about their positions in the alphabet.

Alternatively, perhaps the letters are encrypted via a code.

Alternatively, perhaps if we overlay the word 'Think' over the cipher pairs 'oy', 'fj', etc., the cipher is formed by substituting each plaintext letter with two letters.

Alternatively, perhaps consider the 'original' letters.

Science:

Wait, perhaps more accurate to find Kb for F^− and compare it to Ka for NH4+.
...
But maybe not necessary.
...
Wait, but in our case, the weak acid and weak base have the same concentration, because NH4F dissociates into equal amounts of NH4^+ and F^-
...
Wait, the correct formula is:

[-]Abhimanyu Pallavi Sudhir1y20

we'll elide all of the subtle difficulties involved in actually getting RL to work in practice

I haven't properly internalized the rest of the post, but this confuses me because I thought this post was about the subtle difficulties.

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.

[-]Jesse Hoogland1y20

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.

I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don't think the RL setup is actually that straightforward.

If you want more technical detail, I recommend watching the Rush & Ritter talk (see also slides and bibliography). This post was meant as a high-level overview of the different compatible interpretations with some pointers to further reading/watching.

[-]Jesse Hoogland1y10

You might enjoy this new blogpost from HuggingFace, which goes into more detail.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

65

65

The Bitter Lesson(s)

What we know about o1

What OpenAI has told us

What OpenAI has showed us

Proto-o1: Chain of Thought

In-Context Learning

Thinking Step-by-Step

Majority Vote

o1: Four Hypotheses

1. Filter: Guess + Check

2. Evaluation: Process Rewards

3. Guidance: Search / AlphaZero

4. Combination: Learning to Correct

Post-o1: (Recursive) Self-Improvement

Outlook