AI researchers warn that advanced machine learning systems may develop their own internal goals that don't match what we intended. This "mesa-optimization" could lead AI systems to pursue unintended and potentially dangerous objectives, even if we tried to design them to be safe and aligned with human values.
This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it.
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.

Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.

We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully...
Cool work!
My current understanding is that this part of the tech tree centrally relies on some kind of token-wise sparsity - the components are built such that they have an impact only on a small subset of tokens.
Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.
- From Scott Alexander's review of Julia Galef's The Scout Mindset.
Alexander goes on to argue that this bias is the source of polarization in society, which is distorting our beliefs and setting us at each other's throats. How could someone believe such different things unless they're either really stupid or lying to conceal their selfishness? I think smart people who care about the truth go on believing conflicting things largely because of confirmation bias and motivated reasoning.
The corner of civilization I'm most worried about is the one figuring out how to handle the advent of strong AI. I think confirmation bias...
The practical method is pluralistic understanding, maintaining multiple pictures/models/framings/worldviews around contentious topics at the same time, even when they are wildly in conflict. This should involve taking them seriously enough to at least give them authority to develop further, to seek out more understanding relevant to them, even (or especially) for the framings that are not currently accepted as decision-relevant, that don't shape beliefs or values.
This relates to how epistemic luck/misfortune is path dependence, and path dependence is defea...
If you've come here via 3Blue1Brown, hi! If want to learn more about interpreting neural networks in general, here are some resources you might find useful:
This is a write up of the Google DeepMind mechanistic interpretability team’s investigation of how language models represent facts. This is a sequence of 5 posts, we recommend prioritising reading post 1, and thinking of it as the “main body” of our paper, and posts 2 to 5 as a series of appendices to be skimmed or dipped into in any order.
Reverse-engineering circuits with superposition is a major unsolved problem in mechanistic interpretability: models use...
In post 4, we also studied a toy model mapping pairs of integers to arbitrary labels where we knew all the data and could generate as much as we liked, and didn’t find the toy model any easier to interpret, in terms of finding internal sparsity or meaningful intermediate states.
I just read all of post 4
https://www.lesswrong.com/s/hpWHhjvjn67LJ4xXX/p/JRcNNGJQ3xNfsxPj4
There is nothing there about a toy model mapping pairs of integers.
What am I missing?
(Last revised: January 2026. See changelog at the bottom.)
Part of the “Intro to brain-like-AGI safety” post series.
Thus far in the series, Post #1 set out some definitions and motivations (what is “brain-like AGI safety” and why should we care?), and Posts #2 & #3 split the brain into a Learning Subsystem (cortex, striatum, cerebellum, amygdala, etc.) that “learns from scratch” using learning algorithms, and a Steering Subsystem (hypothalamus, brainstem, etc.) that is mostly genetically-hardwired and executes innate species-specific instincts and reactions.
Then in Post #4, I talked about the “short-term predictor”, a circuit which learns, via supervised learning, to predict a signal in advance of its arrival, but only by perhaps a fraction of a second. Post #5 then argued that if we form a closed...
I think “predict sensory input” is the main training signal for the Thought Generator, loosely analogous to how “predict next token” is the training signal for LLM pretraining. (Cf. §4.7.) So “predict sensory inputs” wouldn’t be a separate box from the Thought Generator, but rather a core function of the Thought Generator. Does that help? Sorry if I’m missing your point.
Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.
Of course, this is a bit of an exaggeration. No single paper ever kills a field of research on its own, and deep learning theory was not exactly the most productive and healthy field at the time this was published. And the paper didn't come close to addressing all theoretical approaches to understanding aspects of deep learning. But if I had to point to a single paper that shattered the feeling of optimism at the time, it would be Zhang et al. 2016.[1]
Believe it or not, this unassuming table rocked the field of deep learning theory back in 2016, despite probably involving
These randomly trained models, are they uncertain or confidently wrong on the test data?
My model of what is going on here is that stochastic gradient descent is acting roughly like an MCMC sampling method. It's producing a random sample from the space of low loss parameters. And that the simpler hypothesis correspond to larger parameter space volumes.
When the network needs to memorize, it needs to use nearly all it's parameters, meaning a small parameter-space volume. When the network is learning a pattern, it's only using a small fraction of it's parame...
In this post, we describe a generalization of Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) called crisp supra-MDPs and supra-POMDPs. The new feature of these decision processes is that the stochastic transition dynamics are multivalued, i.e. specified by credal sets. We describe how supra-MDPs give rise to crisp causal laws, the hypotheses of infra-Bayesian reinforcement learning. Furthermore, we discuss how supra-MDPs can approximate MDPs by a coarsening of the state space. This coarsening allows an agent to be agnostic about the detailed dynamics while still having performance guarantees for the full MDP.
Analogously to the classical theory, we describe an algorithm to compute a Markov optimal policy for supra-MDPs with finite time horizons. We also prove the existence of a stationary optimal policy for...
Shifting the losses by one time step doesn't really matter, since we're mostly interested in the shape of the regret bound which (up to mild changes in the constants) is not affected by this.