All of Jesse Hoogland's Comments + Replies

Neural networks generalize because of this one weird trick

Jesse Hoogland18d20

To be precise, it is a property of singular models (which includes neural networks) in the Bayesian setting. There are good empirical reasons to expect the same to be true for neural networks trained with SGD (across a wide range of different models, we observe the LLC progressively increase from ~0 over the course of training).

Jesse Hoogland18d10

The key distinction is that VC theory takes a global, worst-case approach — it tries to bound generalization uniformly across an entire model class. This made sense historically but breaks down for modern neural networks, which are so expressive that the worst-case is always very bad and doesn't get you anywhere.

The statistical learning theory community woke up to this fact (somewhat) with the Zhang et al. paper, which showed that deep neural networks can achieve perfect training loss on randomly labeled data (even with regularization). The same... (read more)

1Vinayak Pathak18d

Thanks, this clarifies many things! Thanks also for linking to your very comprehensive post on generalization. To be clear, I didn't mean to claim that VC theory explains NN generalization. It is indeed famously bad at explaining modern ML. But "models have singularities and thus number of parameters is not a good complexity measure" is not a valid criticism of VC theory. If SLT indeed helps figure out the mysteries from the "understanding deep learning..." paper then that will be amazing! Ah, I didn't realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us "simpler" models and simpler models generalize better.

Jesse Hoogland1mo22

Yes, my original comment wasn't clear about this, but your nitpick is actually a key part of what I'm trying to get at.

Usually, you start with imitation learning and tack on RL at the end. That's what AlphaGo is. It's what predecessors to Dreamer-V3 like VPT are. It's what current reasoning models are.

But then, eventually, you figure out how to bypass the imitation learning/behavioral cloning part and do RL from the start. Human priors serve as a temporary bootstrapping mechanism until we develop approaches that can learn effectively from scratch.

Jesse Hoogland1mo1617

I think this is important because the safety community still isn't thinking very much about search & RL, even after all the recent progress with reasoning models. We've updated very far away from AlphaZero as a reference class, and I think we will regret this.

On the other hand, the ideas I'm talking about here seem to have widespread recognition among people working on capabilities. Demis is very transparent about where they're headed with language models, AlphaZero, and open-ended exploration (e.g., at 20:48). Noam Brown is adamant about test-time sca... (read more)

Jesse Hoogland1mo2514

What do you call this phenomenon?

First, you train AlphaGo on expert human examples. This is enough to beat Lee Sedol and Ke Jie. Then, you train AlphaZero purely through self-play. It destroys AlphaGo after only a few hours.
First, you train RL agents on human playthroughs of Minecraft. They do okay. Then, DreamerV3 learns entirely by itself and becomes the first to get diamonds.
First, you train theorem provers on human proofs. Then, you train AlphaProof using AlphaZero and you get silver on IMO for the first time.
First, you pretrain a language model on all

... (read more)

2Michaël Trazzi1mo

Nitpick: first alphago was trained by a combination of supervised learning from human expert games and reinforcement learning from self-play. Also, Ke Jie was beaten by AlphaGo Master which was a version at a later stage of development.

Jesse Hoogland1mo1617

Jesse Hoogland3mo10

You might enjoy this new blogpost from HuggingFace, which goes into more detail.

Jesse Hoogland3mo*268

Phi-4: Synthetic data works. Pretraining's days are numbered.

Microsoft just announced Phi-4, a 14B parameter model that matches GPT-4o on some difficult benchmarks. The accompanying technical report offers a glimpse into the growing importance of synthetic data and how frontier model training is changing.

Some takeaways:

The data wall is looking flimsier by the day. Phi-4 is highly capable not despite but because of synthetic data. It was trained on a curriculum of 50 types of synthetic datasets, generated by GPT-4o from a diverse set of or

... (read more)

Vladimir Nesov3mo*99

I don't think Phi-4 offers convincing evidence either way. You can push performance on verifiable tasks quite far without the model becoming generally more capable. AlphaZero doesn't imply that scaling with its methods gestures at general superintelligence, and similarly with Phi-4.

In contrast, using o1-like training as a way to better access ground truth in less tractable domains seems more promising, since by some accounts its tactics on long reasoning traces work even in non-technical domains (unlike for DeepSeek R1), possibly because they are emergent ... (read more)

Jesse Hoogland3mo20

The RL setup itself is straightforward, right? An MDP where S is the space of strings, A is the set of strings < n tokens, P(s'|s,a)=append(s,a) and reward is given to states with a stop token based on some ground truth verifier like unit tests or formal verification.

I agree that this is the most straightforward interpretation, but OpenAI have made no commitment to sticking to honest and straightforward interpretations. So I don't think the RL setup is actually that straightforward.

If you want more technical detail, I recommend watching the Rush &... (read more)

Jesse Hoogland3mo21

The examples they provide one of the announcement blog posts (under the "Chain of Thought" section) suggest this is more than just marketing hype (even if these examples are cherry-picked):

Here are some excerpts from two of the eight examples:

Cipher:

Hmm.

But actually in the problem it says the example:
...
Option 2: Try mapping as per an assigned code: perhaps columns of letters?

Alternatively, perhaps the cipher is more complex.

Alternatively, notice that "oyfjdnisdr" has 10 letters and "Think" has 5 letters.
...
Alternatively, perhaps subtract: 25 -15 = 10.

No.

A

... (read more)

Jesse Hoogland3mo10

It's worth noting that there are also hybrid approaches, for example, where you use automated verifiers (or a combination of automated verifiers and supervised labels) to train a process reward model that you then train your reasoning model against.

Jesse Hoogland3mo30

See also this related shortform in which I speculate about the relationship between o1 and AIXI:

Agency = Prediction + Decision.
AIXI is an idealized model of a superintelligent agent that combines "perfect" prediction (Solomonoff Induction) with "perfect" decision-making (sequential decision theory).
OpenAI's o1 is a real-world "reasoning model" that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).

[Continued]

Many arguments for AI x-risk are wrong

Jesse Hoogland3mo*1810

Agency = Prediction + Decision.

AIXI is an idealized model of a superintelligent agent that combines "perfect" prediction (Solomonoff Induction) with "perfect" decision-making (sequential decision theory).

OpenAI's o1 is a real-world "reasoning model" that combines a superhuman predictor (an LLM like GPT-4) with advanced decision-making (implicit search via chain of thought trained by RL).

To be clear: o1 is no AIXI. But AIXI, as an ideal, can teach us something about the future of o1-like systems.

AIXI teaches us that agency is simple. It involves just two ra... (read more)

3james.lucassen3mo

I think the distinction between systems that perform a single forward pass and then stop and systems that have an OODA loop (tool use) is more stark than the difference between "reasoning" and "chat" models, and I'd prefer to use "agent" for that distinction. I do think that "reasoning" is a bit of a market-y name for this category of system though. "chat" vs "base" is a great choice of words, and "chat" is basically just a description of the RL objective those models were trained with. If I were the terminology czar, I'd call o1 a "task" model or a "goal" model or something.

Many arguments for AI x-risk are wrong

Right. SLT tells us how to operationalize and measure (via the LLC) basin volume in general for DL. It tells us about the relation between the LLC and meaningful inductive biases in the particular setting described in this post. I expect future SLT to give us meaningful predictions about inductive biases in DL in particular.

Many arguments for AI x-risk are wrong

The post is live here.

3Alex Turner1y

Cool post, and I am excited about (what I've heard of) SLT for this reason -- but it seems that that post doesn't directly address the volume question for deep learning in particular? (And perhaps you didn't mean to imply that the post would address that question.)

Jesse Hoogland1y43

If we actually had the precision and maturity of understanding to predict this "volume" question, we'd probably (but not definitely) be able to make fundamental contributions to DL generalization theory + inductive bias research.

Obligatory singular learning theory plug: SLT can and does make predictions about the "volume" question. There will be a post soon by @Daniel Murfet that provides a clear example of this.

1Jesse Hoogland1y

The post is live here.

Generalization, from thermodynamics to statistical physics

Jesse Hoogland1y80

Our work on the induction bump is now out. We find several additional "hidden" transitions, including one that splits the induction bump in two: a first part where previous-token heads start forming, and a second part where the rest of the induction circuit finishes forming.

The first substage is a type-B transition (loss changing only slightly, complexity decreasing). The second substage is a more typical type-A transition (loss decreasing, complexity increasing). We're still unclear about how to understand this type-B transition structurally. How is the model simplifying? E.g., is there some link between attention heads composing and the basin broadening?

My Criticism of Singular Learning Theory

As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, "Statistical Modelling: The Two Cultures", touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms wh

Jesse Hoogland1y20

I think this mostly has to do with the fact that learning theory grew up in/next to computer science where the focus is usually worst-case performance (esp. in algorithmic complexity theory). This naturally led to the mindset of uniform bounds. That and there's a bit of historical contingency: people started doing it this way, and early approaches have a habit of sticking.

Jesse Hoogland1y20

This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.

Agreed. So maybe what I'm actually trying to get at it is a statement about what "universality" means in the context of neural networks. Just as the microscopic details of physical theories don't matter much to their macroscopic properties in the vicinity of critical points ("universality" in statistical physics), just as the microscopic details of rand... (read more)

1Joar Skalse1y

I suppose this depends on what you mean by "most". DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it's fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example. Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a "small" amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I'm not sure).

My Criticism of Singular Learning Theory

Jesse Hoogland1y22

The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given by

$f (x) = θ_{1} + θ_{2} θ_{3} x + θ_{4} θ_{5} θ_{6} x^{2} + θ_{7} θ_{8} θ_{9} θ_{10} x^{3} + θ_{11} θ_{12} θ_{13} θ_{14} θ_{15} x^{4},$
and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials.

I'm not sure, but I think this example is pathological. One possible reason for this to be the case is that the symmetries in this model are entirely "generic" or "global." The more interesting kinds of symmetry are "... (read more)

2Joar Skalse1y

Yes, it's artificial and cherry-picked to make a certain rhetorical point as simply as possible. This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true. Yes, I have been using "Kolmogorov complexity" in a somewhat loose way here. Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that Θ is compact, etc.)

I think there's some chance of models executing treacherous turns in response to a particular input, and I'd rather not trigger those if the model hasn't been sufficiently sandboxed.

Jesse Hoogland1y42

One would really want to know if the complexity measure can predict 'emergence' of capabilities like inner-monologue, particularly if you can spot previously-unknown capabilities emerging which may not be covered in any of your existing benchmarks.

That's our hope as well. Early ongoing work on toy transformers trained to perform linear regression seems to bear out that lambdahat can reveal transitions where the loss can't.

But this type of 'emergence' tends to happen with such expensive models that the available checkpoints are too separated to be inf

... (read more)