AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

2384
Raemon17d2013
Plans A, B, C, and D for misalignment risk
My main question is "why do you think Shut Down actually costs more political will?".  I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells. I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than  "Controlled Takeoff"). > Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff. > > Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff." > > A Global Controlled Takeoff That Works has a lot of moving parts.  > > You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.  > > You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors. > > These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period." > > I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques." > > So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down. > > (Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for) I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single: "Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground." I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.") But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU  monitoring but... still trying to push ahead as fast as we can?".
Jakub Growiec26d2316
Four ways learning Econ makes people dumber re: future AI
As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.
Natural Abstractions: Key Claims, Theorems, and Critiques
Best of LessWrong 2023

Lawrence, Erik, and Leon attempt to summarize the key claims of John Wentworth's natural abstractions agenda, formalize some of the mathematical proofs, outline how it aims to help with AI alignment, and critique gaps in the theory, relevance to alignment, and research methodology.

by LawrenceC
18Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?
Alek Westover
3d
0
1Technical Acceleration Methods for AI Safety: Summary from October 2025 Symposium
Martin Leitgab
3d
0
19Reducing risk from scheming by studying trained-in scheming behavior
ryan_greenblatt
10d
0
22Rogue internal deployments via external APIs
Fabien Roger, Buck
10d
2
41Current Language Models Struggle to Reason in Ciphered Language
Fabien Roger, Shiyuan Guo
12d
3
45Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal, Victor Gillioz, TurnTrout, cloud
12d
0
23Iterated Development and Study of Schemers (IDSS)
ryan_greenblatt
16d
0
2Assuring Agent Safety Evaluations By Analysing Transcripts
bigmouthmodels, Cozmin Ududec, Benjamin Hilton
16d
0
54Realistic Reward Hacking Induces Different and Deeper Misalignment
Jozdien
17d
0
37The Thinking Machines Tinker API is good news for AI control and security
Buck
17d
7
Load More

Recent Discussion

leogao's Shortform
leogao
3y
Wei Dai2h20

Have you seen A Master-Slave Model of Human Preferences? To summarize, I think every human is trying to optimize for status, consciously or subconsciously, including those who otherwise fit your description of idealized platonic researcher. For example, I'm someone who has (apparently) "chosen ultimate (intellectual) freedom over all else", having done all of my research outside of academia or any formal organizations, but on reflection I think I was striving for status (prestige) as much as anyone, it was just that my subconscious picked a different strat... (read more)

Reply
Research Agenda: Synthesizing Standalone World-Models
38
Thane Ruthenis
1mo

tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda's subproblems and my sketches of how to tackle them.


Back at the end of 2023, I wrote the following:

I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)

On the inside view, I'm pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a...

(Continue Reading - 3241 more words)
6Rohin Shah7d
Say what now? This seems very false: * See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do. * Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn't be much of a skill ladder to climb for things like painting. * Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar) Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).
3Thane Ruthenis2d
Right, I probably should've expected this objection and pre-addressed it more thoroughly. I think this is a bit of "missing the forest for trees". In my view, every single human concept and every single human train of thoughts is an example of human world-models' autosymbolicity. What are "cats", "trees", and "grammar", if not learned variables from our world-model that we could retrieve, understand the semantics of, and flexibly use for the purposes of reasoning/problem-solving?  We don't have full self-interpretability by default, yes. We have to reverse-engineer our intuitions and instincts (e. g., grammar from your example), and for most concepts, we can't break their definitions down into basic mathematical operations. But in modern adult humans, there is a vast interpreted structure that contains an enormous amount of knowledge about the world, corresponding to, well, literally everything a human consciously knows. Which, importantly, includes every fruit of our science and technology. If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.
Rohin Shah1d30

Which, importantly, includes every fruit of our science and technology.

I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate progress through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)

In contrast, I'd expect individual steps of scientific progress that happen within a single mind oft... (read more)

Reply
4. Existing Writing on Corrigibility
26
Max Harms
1y

(Part 4 of the CAST sequence)

This document is an in-depth review of the primary documents discussing corrigibility that I’m aware of. In particular, I'll be focusing on the writing of Eliezer Yudkowsky and Paul Christiano, though I’ll also spend some time at the end briefly discussing other sources. As I go through the writing of those who’ve come before, I want to specifically compare and contrast those ideas with the conceptualization of corrigibility put forth in earlier documents and the strategy proposed in The CAST Strategy. At a high level I mostly agree with Christiano, except that he seems to think we’ll get corrigibility emergently, whereas I think it’s vital that we focus on directly training purely corrigible agents (and he wants to focus on recursive architectures...

(Continue Reading - 31674 more words)
Joe Carlsmith1d20

I appreciated the detailed discussion and literature review here -- thanks.

Reply11
Video and transcript of talk on giving AIs safe motivations
9
Joe Carlsmith
1mo

(This is the video and transcript of talk I gave at the UT Austin AI and Human Objectives Initiative in September 2025. The slides are also available here. The main content of the talk is based on this recent essay.)

Talk

Hi, everyone. Thank you for coming. I'm honored to be part of this series and part of the beginning of this series.

Plan

I'm going to briefly introduce the core AI alignment problem as I see it. It's going to be a particular version of that problem, the version that I think is highest stakes. And then I'm going to talk about my current high-level picture of how that problem gets solved at a technical level. There's a bunch of aspects of this problem that aren't the technical level that...

(Continue Reading - 14790 more words)
3Steven Byrnes4d
The headings “behavioral tools” and “transparency tools” both kinda assume that a mysterious AI has fallen out of the sky, and now you have to deal with it, as opposed to either thinking about, or intervening on, how the AI is trained or designed. (See Connor’s comment here.) (Granted, you do mention “new paradigm”, but you seem to be envisioning that pretty narrowly as a transparency intervention.) I think that’s an important omission. For example, it seems to leave out making inferences about Bob from the fact that Bob is human. That’s is informative even if I’ve never met Bob (no behavioral data) and can’t read his mind (no transparency). (Sorry if I’m misunderstanding.)
Joe Carlsmith2d30

Hi Steve -- thanks for this comment, I can see how the vibe of the talk/piece might call to mind something like "studying/intervening on an existing AI system" rather than focusing on how its trained/constructed, but I do mean for the techniques I discuss to cover both. For example, and re: your Bob example, I talk about our existing knowledge of human behavior as an example of behavioral science here -- and I talk lot about studying training as a part of behavioral science, e.g.: 

Let’s call an AI’s full range of behavior across all safe and accessibl

... (read more)
Reply
Jesse Hoogland's Shortform
Jesse Hoogland
3y
30Jesse Hoogland4d
We recently put out a new paper on a scalable generalization of influence functions, which quantify how training data affects model behavior (see Nina's post). I'm excited about this because it takes a completely new methodological approach to measuring influence.  Instead of relying on a Hessian inverse (which is ill-defined and expensive), our new "Bayesian" influence functions (BIF) rely on a covariance calculation (which can be scalably estimated with MCMC). This approach is more theoretically sound (no more Hessian inverses), and it achieves what I think are a more desirable set of engineering tradeoffs (better model-size scaling but worse dataset-size scaling).   At Timaeus, we think these kinds of techniques are on the critical path to safety. Modern alignment techniques like RLHF and Constitutional AI are about controlling model behavior by selecting the right training data. If this continues to be the case, we will need better tools for understanding and steering the pipeline from data to behavior. It's still early days for the BIF. We've done some initial validation on retraining benchmarks and other quantitative tests (follow-up work coming soon), where the BIF comes out looking strong, but more work will be needed to understand the full set of costs and benefits. As that foundation gets established, we expect we'll be able to start applying these techniques directly to safety-relevant problems.  You can read the full announcement thread on X (reproduced below):
leogao2d30

very exciting! influence functions are one of the best approaches for understanding generalization systematically. always excited to see improvements to influence function methodology.

Reply
4Logan Riggs2d
Great work! Listened to a talk from Philipp on it today and am confused on why we can't just make a better benchmark than LDS? Why not just train eg 1k different models, where you left 1 datapoint out? LDS is noisy, so I'm assuming 1k datapoints that exactly capture what you want is better than 1M datapoints that are an approximation. [1] As an estimate, Nano-GPT speedrun takes a little more than 2 min now, so you can train 1001 of these in: 2.33*1k/60 = 38hrs on 8 H100's which is maybe 4 b200's which is $24/hr, so ~$1k. And that's getting a 124M param LLM trained on 730M tokens up to GPT2 level. Y'all's quantitative setting for Fig 4 was a 2M parameter Resnet on Cifar-10 on 5k images, which would be much cheaper to do (although the GPT2 one has been very optimized, so you could just do the speedrun one but on less data). LDS was shown to be very noisy, but a colleague mentioned that this could be because 5k images is a very small amount of data. I guess another way to validate LDS is running the expensive full-training run on a few datapoints.  Confusion on LDS Hyperparameter Sweep Meaning Y'all show in Fig 4 that there are large error bars across seeds for the different methods. This ends up being a property of LDS's noisiness, as y'all show in Figures 7-8 (where BIF & EK-FAC are highly correlated).  This means that, even using noisy LDS, you don't need to re-run 5 times if a new method is much better than previous ones (but only if it's narrowly better). What I'm confused about is why you retrained on 100 different ways to resample the data at each percentage? Is this just because LDS is noisy, so you're doing the thing where randomly sampling 100 datapoints 500 times gives you a good approximation of the causal effect of each individual datapoint (or that is what LDS actually is)? Was there high variance in the relative difference between methods across the 100 retrained models?  Other Experiments Just wild speculation that there are other data attr
3ryan_greenblatt3d
I'm pretty excited about building tools/methods for better dataset influence understanding, so this intuitively seems pretty exciting! (I'm both interested in better cheap approximation of the effects of leaving some data out and the effects of adding some data in.) (I haven't looked at the exact method and results in this paper yet.)
Load More
7Vanessa Kosoy
This post is a great review of the Natural Abstractions research agenda, covering both its strengths and weaknesses. It provides a useful breakdown of the key claims, the mathematical results and the applications to alignment. There's also reasonable criticism. To the weaknesses mentioned in the overview, I would also add that the agenda needs more engagement with learning theory. Since the claim is that all minds learn the same abstractions, it seems necessary to look into the process of learning, and see what kind of abstractions can or cannot be learned (both in terms of sample complexity and in terms of computational complexity). Some thoughts about natural abstractions inspired by this post: * The concept of natural abstractions seems closely related to my informally conjectured agreement theorem for infra-Bayesian physicalism. In a nutshell, two physicalist agents in the same universe with access to "similar" information should asymptotically arrive at similar beliefs (notably this is false for cartesian agents because of the different biases resulting from the different physical points of view). * A possible formalization of the agreement theorem inspired by my richness of mathematics conjecture: Given two beliefs Ψ and Φ, we say that Ψ⪯Φ when some conditioning of Ψ on a finite set of observations produces a refinement of some conditioning of Φ on a finite set of observations (see linked shortform for mathematical details). This relation is a preorder. In general, we can expect an agent to learn a sequence of beliefs of the form Ψ0≺Ψ1≺Ψ2≺…  Here, the sequence can be over physical time, or over time discount or over a parameter such as "availability of computing resources" or "how much time the world allows you for thinking between decisions": the latter is the natural asymptotic for metacognitive agents (see also logical time). Given two agents, we get two such sequences {Ψi} and {Φi}. The agreement theorem can then state that for all i∈N, there exists j