A collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches.
This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity.
The main focus of this post is clarifying the basic machinery of the behavioral selection model, and conveying why it matters to disambiguate between different “motivations” for AI behavior. Very similar or identical behavior in training can correspond to radically different outcomes in deployment based on what motivated it.
I’ll preface by saying: I think the behavioral selection model is quite predictive and useful to understand, especially in the short-medium term. But it leaves out some really important dynamics for predicting AI motivations, and I wish I had clarified this more in the original post. Most importantly (as Habryka...
A recent short story by Gabriel Mukobi illustrates a near-term scenario where things go bad because new developments in LLMs allow LLMs to accelerate capabilities research without a correspondingly large acceleration in safety research.
This scenario is disturbingly close to the situation we already find ourselves in. Asking the best LLMs for help with programming vs technical alignment research feels very different (at least to me). LLMs might generate junk code, but you can keep pointing out the problems with the code, and the code will eventually work. This can be faster than doing it myself, in cases where I don't know a language or library well; the LLMs are moderately familiar with everything.
When I try to talk to LLMs about technical AI safety work, however, I just...
I'm curious to hear any updates you have on this thinking a couple years later.
...We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training.
We apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated.
I had Opus 4.7 do a longer investigation where it tested many more problems and had an AI look at the NLA output to see if it looks like it has a CoT. I also had it do analysis of the casese where the model gets the problems wrong. The results are kinda complicated but:
Amusingly, just after I posted this, Anthropic released "Teaching Claude why" which has a bunch of information on how they behaviorally align their AIs (or at least how they iterate on particular behaviors/properties). (Though this doesn't seem close to sufficient for a reasonably complete understanding.)
TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing generated falsehoods.
This post presents a summary of the paper, including examples of transcripts and other miscellaneous findings.
arXiv paper | Code | Transcripts
I'd be curious to see the same results on LLMs' relation to copyrighted songs. Some LLMs deny knowing famous song's lyrics, or having these songs in their training data.
This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post.
In ARC's latest paper, we study the following problem: given a randomly initialized multilayer perceptron (MLP), produce an estimate for the expected output of the model under Gaussian input. The usual approach to this problem is to sample many possible inputs, run them all through the model, and take the average. Instead, we produce an estimate "mechanistically", without running the model even once. For wide models, our approach produces more accurate estimates, both in theory and in practice.
Paper: Estimating the expected output of wide random MLPs more efficiently than sampling
Code: mlp_cumulant_propagation GitHub repo
We are excited about this result as...
Indeed, tensor network diagrams show up in our algorithm (see Appendix A of the paper). We've also been thinking about mechanistic estimation for tensor network contractions as a problem in their own right, partly because they appear to be needed for harder MLP cases.