Neural networks generalize unexpectedly well. Jesse argues this is because of singularities in the loss surface which reduce the effective number of parameters. These singularities arise from symmetries in the network. More complex singularities lead to simpler functions which generalize better. This is the core insight of singular learning theory.
This is an idea I came up with and presented in the Agent Foundations 2025 at CMU conference.
Here is a nice simple formalism for decision theory, that in particular supports the decision theory coming out of infra-Bayesianism. I now call the latter decision theory "Disambiguative Decision Theory", since the counterfactuals work by "disambiguating" the agent's belief.
Let
If you try to get reward-seekers to cooperate by pooling reward in multi-agent settings, you're not changing its decision theory, you're just changing the reward structure so that CDT reward-seekers are incentivized to cooperate with each other.
tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.
When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:
However, qualitatively neither of these seemed particularly salient to the model:
Thanks for doing this! I'm curious about the lower scores for o3 compared to the other two model checkpoints. Was there a layer of safety training between "RL (late)" and production o3?
...
It seems the answer to my question is probably yes? From the earlier blog post:
...We analyze the increase in verbalized metagaming observed in a portion of capabilities-focused RL training (called exp-rl-cap in Schoen et al) that was part of training o3, prior to any safety- or alignment-focused training.[8] ... We have also observed an increase in reasoning about the reward
Infra-Bayesianism is a mathematical framework for studying artificial learning and intelligence that developed from Vanessa Kosoy’s Learning Theoretic AI Alignment Research Agenda. As applied to reinforcement learning, the main character of infra-Bayesianism is an agent that is learning about an unknown environment and making decisions in pursuit of some goal. Infra-Bayesianism provides novel ways to model this agent’s beliefs and make decisions, which address problems arising when an agent does not or cannot consider the true environment possible at the beginning of the learning process. This setting, a non-realizable environment, is relevant to various scenarios important to AI alignment, including scenarios when agents may consider themselves as part of the environment, and scenarios involving self-modifying agents, multi-agent interactions, and decision theory problems. Furthermore, it is the most...
the computational complexity of individual hypotheses in the hypothesis class cannot be the thing that characterizes the hardness of learning, but rather it has to be some measure of how complex the entire hypothesis class is.
This is true, of course, but mostly immaterial. Outside of contrived examples, it's rare for the hypothesis class to be feasible to learn while containing hypotheses that are infeasible to evaluate. It seems extremely implausible that you can find a hypothesis class that is simultaneously (i) possible to specify in practice [...
This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I’m not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.)
What’s the point of doing this? Well, there are a couple of reasons:
situations in which they explain that actually Islam is true...
Update: It's 2026 and Ted Cruz just posted this Claude conversation excerpt: https://x.com/tedcruz/status/2031140951490486632?s=20
Summary: Perfectly labeled outcomes in training can still boost reward hacking tendencies in generalization. This can hold even when the train/test sets are drawn from the exact same distribution. We induce this surprising effect via a form of context distillation, which we call re-contextualization:
While we solely reinforce honest outcomes, the reasoning traces focus on hacking more than usual. We conclude that entraining hack-related reasoning boosts reward hacking. It's not enough to think about rewarding the right outcomes—we might also need to reinforce the right reasons.
It's often thought that, if a model reward hacks on a task in deployment, then similar hacks...
Yes, but so does this one. (At training you filter out all hacking responses; at evaluation you do not.)