AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

175
Rohin Shah8d71
Research Agenda: Synthesizing Standalone World-Models
> a human's world-model is symbolically interpretable by the human mind containing it. Say what now? This seems very false: * See almost anything physical (riding a bike, picking things up, touch typing a keyboard, etc). If you have a dominant hand / leg, try doing some standard tasks with the non-dominant hand / leg. Seems like if the human mind could symbolically interpret its own world model this should be much easier to do. * Basically anything to do with vision / senses. Presumably if vision was symbolically interpretable to the mind then there wouldn't be much of a skill ladder to climb for things like painting. * Symbolic grammar usually has to be explicitly taught to people, even though ~everyone has a world model that clearly includes grammar (in the sense that they can generate grammatical sentences and identify errors in grammar) Tbc I can believe it's true in some cases, e.g. I could believe that some humans' far-mode abstract world models are approximately symbolically interpretable to their mind (though I don't think mine is). But it seems false in the vast majority of domains (if you are measuring relative to competent, experienced people in those domains, as seems necessary if you are aiming for your system to outperform what humans can do).
Raemon18d2013
Plans A, B, C, and D for misalignment risk
My main question is "why do you think Shut Down actually costs more political will?".  I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells. I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than  "Controlled Takeoff"). > Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff. > > Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff." > > A Global Controlled Takeoff That Works has a lot of moving parts.  > > You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.  > > You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors. > > These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period." > > I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques." > > So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down. > > (Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for) I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single: "Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground." I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.") But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU  monitoring but... still trying to push ahead as fast as we can?".
Jakub Growiec1mo2316
Four ways learning Econ makes people dumber re: future AI
As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.
Load More
18Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?
Alek Westover
4d
0
2Technical Acceleration Methods for AI Safety: Summary from October 2025 Symposium
Martin Leitgab
4d
0
19Reducing risk from scheming by studying trained-in scheming behavior
ryan_greenblatt
11d
0
22Rogue internal deployments via external APIs
Fabien Roger, Buck
12d
2
41Current Language Models Struggle to Reason in Ciphered Language
Fabien Roger, Shiyuan Guo
13d
3
45Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal, Victor Gillioz, TurnTrout, cloud
13d
0
23Iterated Development and Study of Schemers (IDSS)
ryan_greenblatt
17d
0
2Assuring Agent Safety Evaluations By Analysing Transcripts
bigmouthmodels, Cozmin Ududec, Benjamin Hilton
17d
0
54Realistic Reward Hacking Induces Different and Deeper Misalignment
Jozdien
18d
0
37The Thinking Machines Tinker API is good news for AI control and security
Buck
18d
7
Load More

Recent Discussion

leogao's Shortform
leogao
3y
1leogao8h
what i meant by that is something like: assuming we are in this short-timelines-no-breakthroughs world (to be clear, this is a HUGE assumption! not claiming that this is necessarily likely!), to win we need two things: (a) base case: the first AI in the recursive self improvement chain is aligned, (b) induction step: each AI can create and align its successor. i claim that if the base case AI is about as aligned as current AI, then condition (a) is basically either satisfied or not that hard to satisfy. like, i agree current models sometimes lie or are sycophantic or whatever. but these problems really don't seem nearly as hard to solve as the full AGI alignment problem. like idk, you can just ask models to do stuff and they like mostly try their best, and it seems very unlikely that literal GPT-5 is already pretending to be aligned so it can subtly stab us when we ask it to do alignment research. importantly, under our assumptions, we already have AI systems that are basically analogous to the base case AI, so prosaic alignment research on systems that exist today right now is actually just lots of progress on aligning the base case AI, and in my mind a huge part of the difficulty of alignment in the longer-timeline world is because we don't yet have the AGI/ASI, so we can't do alignment research with good empirical feedback loops. like tbc it's also not trivial to align current models. companies are heavily incentivized to do it and yet they haven't succeeded fully. but this is a fundamentally easier class of problem than aligning AGI in longer-timelines world.
1Raemon8h
Mmm nod. (I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?) There's a version of Short Timeline World (which I think is more likely? but, not confidently) which is : "the current paradigm does basically work... but, the way we get to ASI, as opposed to AGI, routes through 'the current paradigm helps invent a new better paradigm, real fast'." In that world, GPT5 has the possibility-of-true-generality, but, not necessarily very efficiently, and once you get to the sharper part of the AI 2027 curve, the mechanism by which the next generation of improvement comes is via figuring out alternate algorithms.
Rohin Shah44m20

I bucket this under "given this ratio of right/wrong responses, you think a smart alignment researcher who's paying attention can keep it in a corrigibility basin even as capability levels rise?". Does that feel inaccurate, or, just, not how you'd exactly put it?

I'm pretty sure it is not that. When people say this it is usually just asking the question: "Will current models try to take over or otherwise subvert our control (including incompetently)?" and noticing that the answer is basically "no".[1] What they use this to argue for can then vary:

  1. Curre
... (read more)
Reply
1leogao7h
i guess so? i don't know why you say "even as capability levels rise" - after you build and align the base case AI, humans are no longer involved in ensuring that the subsequent more capable AIs are aligned. i'm mostly indifferent about what the paradigms look like up the chain. probably at some point up the chain things stop looking anything human made. but what matters at that point is no longer how good we humans are at aligning model n, but how good model n-1 is at aligning model n.
Research Agenda: Synthesizing Standalone World-Models
38
Thane Ruthenis
1mo

tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda's subproblems and my sketches of how to tackle them.


Back at the end of 2023, I wrote the following:

I'm fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)

On the inside view, I'm pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a...

(Continue Reading - 3241 more words)
4Thane Ruthenis3d
Right, I probably should've expected this objection and pre-addressed it more thoroughly. I think this is a bit of "missing the forest for trees". In my view, every single human concept and every single human train of thoughts is an example of human world-models' autosymbolicity. What are "cats", "trees", and "grammar", if not learned variables from our world-model that we could retrieve, understand the semantics of, and flexibly use for the purposes of reasoning/problem-solving?  We don't have full self-interpretability by default, yes. We have to reverse-engineer our intuitions and instincts (e. g., grammar from your example), and for most concepts, we can't break their definitions down into basic mathematical operations. But in modern adult humans, there is a vast interpreted structure that contains an enormous amount of knowledge about the world, corresponding to, well, literally everything a human consciously knows. Which, importantly, includes every fruit of our science and technology. If we understood an external superhuman world-model as well as a human understands their own world-model, I think that'd obviously get us access to tons of novel knowledge.
Rohin Shah2d*40

Which, importantly, includes every fruit of our science and technology.

I don't think this is the right comparison, since modern science / technology is a collective effort and so can only cumulate thinking through mostly-interpretable steps. (This may also be true for AI, but if so then you get interpretability by default, at least interpretability-to-the-AIs, at which point you are very likely better off trying to build AIs that can explain that to humans.)

In contrast, I'd expect individual steps of scientific progress that happen within a single mind oft... (read more)

Reply
4. Existing Writing on Corrigibility
26
Max Harms
1y

(Part 4 of the CAST sequence)

This document is an in-depth review of the primary documents discussing corrigibility that I’m aware of. In particular, I'll be focusing on the writing of Eliezer Yudkowsky and Paul Christiano, though I’ll also spend some time at the end briefly discussing other sources. As I go through the writing of those who’ve come before, I want to specifically compare and contrast those ideas with the conceptualization of corrigibility put forth in earlier documents and the strategy proposed in The CAST Strategy. At a high level I mostly agree with Christiano, except that he seems to think we’ll get corrigibility emergently, whereas I think it’s vital that we focus on directly training purely corrigible agents (and he wants to focus on recursive architectures...

(Continue Reading - 31674 more words)
Joe Carlsmith2d20

I appreciated the detailed discussion and literature review here -- thanks.

Reply11
Video and transcript of talk on giving AIs safe motivations
9
Joe Carlsmith
1mo

(This is the video and transcript of talk I gave at the UT Austin AI and Human Objectives Initiative in September 2025. The slides are also available here. The main content of the talk is based on this recent essay.)

Talk

Hi, everyone. Thank you for coming. I'm honored to be part of this series and part of the beginning of this series.

Plan

I'm going to briefly introduce the core AI alignment problem as I see it. It's going to be a particular version of that problem, the version that I think is highest stakes. And then I'm going to talk about my current high-level picture of how that problem gets solved at a technical level. There's a bunch of aspects of this problem that aren't the technical level that...

(Continue Reading - 14790 more words)
3Steven Byrnes5d
The headings “behavioral tools” and “transparency tools” both kinda assume that a mysterious AI has fallen out of the sky, and now you have to deal with it, as opposed to either thinking about, or intervening on, how the AI is trained or designed. (See Connor’s comment here.) (Granted, you do mention “new paradigm”, but you seem to be envisioning that pretty narrowly as a transparency intervention.) I think that’s an important omission. For example, it seems to leave out making inferences about Bob from the fact that Bob is human. That’s is informative even if I’ve never met Bob (no behavioral data) and can’t read his mind (no transparency). (Sorry if I’m misunderstanding.)
Joe Carlsmith3d30

Hi Steve -- thanks for this comment, I can see how the vibe of the talk/piece might call to mind something like "studying/intervening on an existing AI system" rather than focusing on how its trained/constructed, but I do mean for the techniques I discuss to cover both. For example, and re: your Bob example, I talk about our existing knowledge of human behavior as an example of behavioral science here -- and I talk lot about studying training as a part of behavioral science, e.g.: 

Let’s call an AI’s full range of behavior across all safe and accessibl

... (read more)
Reply
Jesse Hoogland's Shortform
Jesse Hoogland
3y
30Jesse Hoogland5d
We recently put out a new paper on a scalable generalization of influence functions, which quantify how training data affects model behavior (see Nina's post). I'm excited about this because it takes a completely new methodological approach to measuring influence.  Instead of relying on a Hessian inverse (which is ill-defined and expensive), our new "Bayesian" influence functions (BIF) rely on a covariance calculation (which can be scalably estimated with MCMC). This approach is more theoretically sound (no more Hessian inverses), and it achieves what I think are a more desirable set of engineering tradeoffs (better model-size scaling but worse dataset-size scaling).   At Timaeus, we think these kinds of techniques are on the critical path to safety. Modern alignment techniques like RLHF and Constitutional AI are about controlling model behavior by selecting the right training data. If this continues to be the case, we will need better tools for understanding and steering the pipeline from data to behavior. It's still early days for the BIF. We've done some initial validation on retraining benchmarks and other quantitative tests (follow-up work coming soon), where the BIF comes out looking strong, but more work will be needed to understand the full set of costs and benefits. As that foundation gets established, we expect we'll be able to start applying these techniques directly to safety-relevant problems.  You can read the full announcement thread on X (reproduced below):
leogao3d30

very exciting! influence functions are one of the best approaches for understanding generalization systematically. always excited to see improvements to influence function methodology.

Reply
4Logan Riggs3d
Great work! Listened to a talk from Philipp on it today and am confused on why we can't just make a better benchmark than LDS? Why not just train eg 1k different models, where you left 1 datapoint out? LDS is noisy, so I'm assuming 1k datapoints that exactly capture what you want is better than 1M datapoints that are an approximation. [1] As an estimate, Nano-GPT speedrun takes a little more than 2 min now, so you can train 1001 of these in: 2.33*1k/60 = 38hrs on 8 H100's which is maybe 4 b200's which is $24/hr, so ~$1k. And that's getting a 124M param LLM trained on 730M tokens up to GPT2 level. Y'all's quantitative setting for Fig 4 was a 2M parameter Resnet on Cifar-10 on 5k images, which would be much cheaper to do (although the GPT2 one has been very optimized, so you could just do the speedrun one but on less data). LDS was shown to be very noisy, but a colleague mentioned that this could be because 5k images is a very small amount of data. I guess another way to validate LDS is running the expensive full-training run on a few datapoints.  Confusion on LDS Hyperparameter Sweep Meaning Y'all show in Fig 4 that there are large error bars across seeds for the different methods. This ends up being a property of LDS's noisiness, as y'all show in Figures 7-8 (where BIF & EK-FAC are highly correlated).  This means that, even using noisy LDS, you don't need to re-run 5 times if a new method is much better than previous ones (but only if it's narrowly better). What I'm confused about is why you retrained on 100 different ways to resample the data at each percentage? Is this just because LDS is noisy, so you're doing the thing where randomly sampling 100 datapoints 500 times gives you a good approximation of the causal effect of each individual datapoint (or that is what LDS actually is)? Was there high variance in the relative difference between methods across the 100 retrained models?  Other Experiments Just wild speculation that there are other data attr
3ryan_greenblatt4d
I'm pretty excited about building tools/methods for better dataset influence understanding, so this intuitively seems pretty exciting! (I'm both interested in better cheap approximation of the effects of leaving some data out and the effects of adding some data in.) (I haven't looked at the exact method and results in this paper yet.)
Load More
A Mechanistic Interpretability Analysis of Grokking
Best of LessWrong 2022

Neel Nanda reverse engineers neural networks that have "grokked" modular addition, showing that they operate using Discrete Fourier Transforms and trig identities. He argues grokking is really about phase changes in model capabilities, and that such phase changes may be ubiquitous in larger models.

by Neel Nanda
19Neel Nanda
Self-Review: After a while of being insecure about it, I'm now pretty fucking proud of this paper, and think it's one of the coolest pieces of research I've personally done. (I'm going to both review this post, and the subsequent paper). Though, as discussed below, I think people often overrate it. Impact The main impact IMO is proving that mechanistic interpretability is actually possible, that we can take a trained neural network and reverse-engineer non-trivial and unexpected algorithms from it. In particular, I think by focusing on grokking I (semi-accidentally) did a great job of picking a problem that people cared about for non-interp reasons, where mech interp was unusually easy (as it was a small model, on a clean algorithmic task), and that I was able to find real insights about grokking as a direct result of doing the mechanistic interpretability. Real models are fucking complicated (and even this model has some details we didn't fully understand), but I feel great about the field having something that's genuinely detailed, rigorous and successfully reverse-engineered, and this seems an important proof of concept. IMO the other contributions are the specific algorithm I found, and the specific insights about how and why grokking happens. but in my opinion these are much less interesting. Field-Building Another large amount of impact is that this was a major win for mech interp field-building. This is hard to measure, but some data: * There are multiple papers I like that are substantially building on/informed by these results (A toy model of universality, the clock and the pizza, Feature emergence via margin maximization, Explaining grokking through circuit efficiency * It's got >100 citations in less than a year (a decent chunk of these are semi-fake citations from this being used as a standard citation for 'mech interp exists as a field', so I care more about the "how many papers would not exist without this" metric) * It went pretty viral on Twi