The Solomonoff prior is a mathematical formalization of Occam's razor. It's intended to provide a way to assign probabilities to observations based on their simplicity. However, the simplest programs that predict observations well might be universes containing intelligent agents trying to influence the predictions. This makes the Solomonoff prior "malign" - its predictions are influenced by the preferences of simulated beings.
...We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.
We apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated.
Here are some training experiments I think AI companies should consider running. In each case, the idea is to modify the company's actual training process as described and then study the resulting AI. (You presumably shouldn't deploy/use the resulting AI...)
[Quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know. Also also, it's possible I'm arguing with an imaginary position here and everyone already agrees with everything below.]
In research discussions about LLMs, I often pick up a vibe of casual, generalized skepticism about model-generated CoT (chain-of-thought) explanations.
CoTs (people say) are not trustworthy in general. They don't always reflect what the model is "actually" thinking or how it has "actually" solved a given problem.
This claim is true as far as it goes. But people sometimes act like it goes much further than (IMO) it really does.
Sometimes it seems to license an attitude of "oh, it's no use reading what the model says in...
But this new kind of transcoder gives us a string of text, in English (or another language of your choice), which we can simply read.
And this string of text simply... tells us what the sub-block is doing.
Natural-Language Autoencoders sound a lot like this, although they interpret activations rather than transcoding, and of course they don't have the second property mention in the post (zero reconstruction error).
This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it.
VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.

Importantly, we show that we can decompose attention layers, which interp methods like transcoders and SAEs have historically struggled with.

We also build attribution graphs of the model for some prompts using causally important parameter subcomponents as the nodes, and interpret parts of them.
While we made these graphs, we discovered that our adversarial ablation method seemed pretty important for faithfully...
Cool work!
My current understanding is that this part of the tech tree centrally relies on some kind of token-wise sparsity - the components are built such that they have an impact only on a small subset of tokens.
Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.
- From Scott Alexander's review of Julia Galef's The Scout Mindset.
Alexander goes on to argue that this bias is the source of polarization in society, which is distorting our beliefs and setting us at each other's throats. How could someone believe such different things unless they're either really stupid or lying to conceal their selfishness? I think smart people who care about the truth go on believing conflicting things largely because of confirmation bias and motivated reasoning.
The corner of civilization I'm most worried about is the one figuring out how to handle the advent of strong AI. I think confirmation bias...
The practical method is pluralistic understanding, maintaining multiple pictures/models/framings/worldviews around contentious topics at the same time, even when they are wildly in conflict. This should involve taking them seriously enough to at least give them authority to develop further, to seek out more understanding relevant to them, even (or especially) for the framings that are not currently accepted as decision-relevant, that don't shape beliefs or values.
This relates to how epistemic luck/misfortune is path dependence, and path dependence is defea...
(These are some old results for Haiku 3.5 that I'm reporting just because I have them on hand. Haiku 3.5 doesn't actually have uplift from filler tokens, so it's not clear how interesting this is.)
When running an NLA on this input
... (read more)