AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

New report: "Scheming AIs: Will AIs fake alignment during training in order to get power?"
Best of LessWrong 2023

Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.

by Joe Carlsmith
15Fabien Roger
I think that prior to this paper, the discussion around scheming was pretty confusing, spread throughout many posts which were not all specifically about scheming, and was full of pretty bad arguments. This paper fixed that by bringing together most (all?) main considerations for and against expecting scheming to emerge. I found this helpful to clarify my thinking around the topic, which makes me more confident in my focus on AI control and made me less confused when I worked on the Alignment faking paper. It is also helpful as a list of reasons why someone reasonable might expect scheming (without finding it overwhelmingly likely either) that I can point skeptical people at without being afraid that it contains massive over or understatements. I think this paper will become pretty outdated as we get closer to understanding what AGI looks like and as we get better model organisms, but I think that it currently is the best resource about the conceptual arguments for and against scheming propensity. I strongly recommend (the audio version of) this paper for people who want to work on scheming propensity.
50Welcome & FAQ!
Ruby, habryka
4y
9
53Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
ryan_greenblatt
2d
6
29How To Become A Mechanistic Interpretability Researcher
Neel Nanda
2d
0
48Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
ryan_greenblatt
8d
2
22AI companies have started saying safeguards are load-bearing
Zach Stein-Perlman
9d
1
72AI Induced Psychosis: A shallow investigation
Tim Hua
9d
0
29Do-Divergence: A Bound for Maxwell's Demon
johnswentworth, David Lorell
9d
0
27New Paper on Reflective Oracles & Grain of Truth Problem
Cole Wyeth
10d
0
21Notes on cooperating with unaligned AIs
Lukas Finnveden
12d
1
48(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)
johnswentworth, David Lorell
13d
0
Load More

Recent Discussion

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
53
ryan_greenblatt
2d

I've recently written about how I've updated against seeing substantially faster than trend AI progress due to quickly massively scaling up RL on agentic software engineering. One response I've heard is something like:

RL scale-ups so far have used very crappy environments due to difficulty quickly sourcing enough decent (or even high quality) environments. Thus, once AI companies manage to get their hands on actually good RL environments (which could happen pretty quickly), performance will increase a bunch.

Another way to put this response is that AI companies haven't actually done a good job scaling up RL—they've scaled up the compute, but with low quality data—and once they actually do the RL scale up for real this time, there will be a big jump in AI capabilities (which yields...

(Continue Reading - 2089 more words)
Rohin Shah16h109

I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years[1]), and (c) not a great measure of what people care about[2][3]. I don't think you should strongly expect that particular line to continue to be straight over a long time period.

(Though I tend to think that it's more likely to slow down by 2027 than to speed up, relative to what you'd predict by extrapolating the 2024-2... (read more)

Reply1
6Vladimir_Nesov1d
Non-OpenAI pre-RLVR chatbots might serve as an anchor for how long it takes an AI company to turn an algorithmic idea into a frontier model, after it becomes a clearly worthwhile thing to do. Arguably only Anthropic managed to catch up to OpenAI, and it took them 1.5 years with Sonnet 3.5. Even Google never caught up after 2+ years, their first credibly frontier chatbot is Gemini 2.5 Pro, which is already well into RLVR (and similarly for Grok 4). So it seems reasonable to expect that it would take about 2 years for RLVR-based models to start being done well, somewhere in 2026-2027. The IMO results probably indicate something about the current lower bound on capabilities in principle, for informally graded tasks such as natural language proofs. This is a lot higher than what finds practical use so far, and improvements in 2026-2027 might be able to capture this kind of thing (without needing the scale of 2026 compute).
1jsd1d
In their BOTEC, it seems you roughly agree with a group size of 64 and 5 reuses per task (since 5 * 64 is between 100 and 1k). You wrote $0.1 to $1 per rollout, whereas they have in mind 500,000 * $15 / 1M = $7.5. 500,000 doesn't seem especially high for hard agentic software engineering tasks which often reach into the millions.  Does the disagreement come from: * Thinking the $15 estimate from opportunity cost is too high (so compute cost lower than Mechanize claims) * Expecting most of the RL training to somehow not be end-to-end? (so compute cost lower than Mechanize claims) * Expecting spending per RL environment to be smaller than compute spending, even if within an OOM.
5ryan_greenblatt1d
* I expect lower cost per rollout on average due to AI companies doing RL on a bunch of smaller tasks and from companies not necessarily using tons of reasoning tokens on most envs. Also, API prices are marked up relative to what companies actually pay on compute which can easily add a factor of 3. If we are just looking at the hardest agentic software engineering environments, then this closes the gap a decent amount. * I expect spending on RL enviroments to be more like 10x lower than RL training compute rather than similar (and I wouldn't be surprised by a large gap) because it's hard to massively scale up spending on RL envs effectively in a short period of time while we already have an scaled up industrial process for buying more compute. I'm more sympathetic to "companies will spend this much on some high quality RL envs" than "the typical RL env will be very expensive", but I think some disagreement remains.
ryan_greenblatt's Shortform
ryan_greenblatt
2y
ryan_greenblatt1d2546

Precise AGI timelines don't matter that much.

While I do spend some time discussing AGI timelines (and I've written some posts about it recently), I don't think moderate quantitative differences in AGI timelines matter that much for deciding what to do[1]. For instance, having a 15-year median rather than a 6-year median doesn't make that big of a difference. That said, I do think that moderate differences in the chance of very short timelines (i.e., less than 3 years) matter more: going from a 20% chance to a 50% chance of full AI R&D automation withi... (read more)

Reply411
1Bogdan Ionut Cirstea1d
Isn't the SWE-Bench figure and doubling time estimate from the blogpost even more relevant here than fig. 19 from the METR paper?
2Thomas Kwa1d
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.
Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
48
ryan_greenblatt
8d

Here's a relatively important question regarding transparency requirements for AI companies: At which points in time should AI companies be required to disclose information? (While I focus on transparency, this question is also applicable to other safety-relevant requirements, and is applicable to norms around voluntary actions rather than requirements.)

A natural option would be to attach transparency requirements to the existing processes of pre-deployment testing and releasing a model card when a new model is released. As in, companies would be required to include the relevant information whenever they release a new model (likely in the model card). This is convenient because pre-deployment testing and model cards are already established norms in the AI industry, which makes it easier to attach something new to these existing processes rather...

(See More - 857 more words)
Davidmanheim1d30

One possible important way to address parts of this is by moving from only thinking about model audits and model cards, towards organizational audits. That is, the organization should have policies about when to test and what and when to disclose test results; an organizational safety audit would decide if those policies are appropriate, sufficiently transparent, and sufficient given the risks - and also check to ensure the policies are being  followed.

Note that Anthropic has done something like this, albeit weaker, by undergoing an ISO management sys... (read more)

Reply2
[Intro to brain-like-AGI safety] 7. From hardcoded drives to foresighted plans: A worked example
27
Steven Byrnes
3y

(Last revised: June 2025. See changelog at the bottom.)

7.1 Post summary / Table of contents

Part of the “Intro to brain-like-AGI safety” post series.

(Much of this post had been originally published here.)

The previous post presented a big picture of how I think motivation works in the human brain, but it was a bit abstract. In this post, I will walk through an example. To summarize, the steps will be:

  1. (Section 7.3) Our brains gradually develop a probabilistic generative model of the world and ourselves;
  2. (Section 7.4) There’s a “credit assignment” process, where something in the world-model gets flagged as “good”;
  3. (Section 7.5) There’s a “valence” signal roughly tracking the expected probability of the “good” thing, along with processes that systematically lead to higher-valence thoughts over lower-valence ones. This signal drives us to “try”
...
(Continue Reading - 2853 more words)
3Lucas Teixeira4d
I'm curious on how/if goal coherence over long term plans is explained by your "planning as reward shaping" model? If planning amounts to an escalation of more and more "real thoughts" (i.e. I'm idly thinking about prinsesstårta -> A fork-full of prinsesstårta is heading towards my mouth), because these correspond to stronger activations in a valenced latent in my world model, and my thought generator is biased towards producing higher valence thoughts, it's unclear to me why we wouldn't just default to the production of untopical thoughts (i.e. I'm idly thinking about prinsesstårta -> I'm thinking about underneath a weighted blanket) and never get anything done in the world. One reply would be to bite the bullet and say yup, humans due in fact have deficits in their long term planning strategies and this accounts for them but this feels unsatisfying; if the story given in my comment above was the only mechanism I'd expect us to be much worse. One possible reply is that "non-real thoughts" don't reliably lead towards rewards from the steering subsystem and the thought assessors down weight the valence associated w/ these thoughts thus leading to them being generated w/ lower frequency; consequently then, the only thought sequences which remain are ones which terminate in "real thoughts" and stimulate accurate predictions of the steering subsystem. This seems plausibly sufficient, but it still doesn't answer the question of why people don't arbitrarily switch into "equally real but non-topical" thought sequences at higher frequencies.
Steven Byrnes1d20

Thanks! I’m not 100% sure what you’re getting at, here are some possible comparisons:

  • “idle musing about eating prinsesstårta sometime in the future” VERSUS “plan to eat the prinsesstårta on my plate right now” → latter is preferred
  • “idle musing about eating prinsesstårta sometime in the future” VERSUS “idle musing about snuggling under a weighted blanket sometime in the future” → either might be preferred, depending on which has higher valence, which in turn depends on whether I’m hungry or tired etc.
  • “idle musing about eating prinsesstårta sometime in the f
... (read more)
Reply
Four ways learning Econ makes people dumber re: future AI
100
Steven Byrnes
14d
This is a linkpost for https://x.com/steve47285/status/1958527894965108829

(Cross-posted from X, intended for a general audience.)

There’s a funny thing where economics education paradoxically makes people DUMBER at thinking about future AI. Econ textbooks teach concepts & frames that are great for most things, but counterproductive for thinking about AGI. Here are 4 examples. Longpost:

THE FIRST PIECE of Econ anti-pedagogy is hiding in the words “labor” & “capital”. These words conflate a superficial difference (flesh-and-blood human vs not) with a bundle of unspoken assumptions and intuitions, which will all get broken by Artificial General Intelligence (AGI).

By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after...

(Continue Reading - 1632 more words)
Steven Byrnes2d40

This is the idea that at some point in scaling up an organization you could lose efficiency due to needing more/better management, more communication (meetings) needed and longer communication processes, "bloat" in general. I'm not claiming it’s likely to happen with AI, just another possible reason for increasing marginal cost with scale.

Hmm, that would apply to an individual firm but not to a product category, right? If Firm 1 is producing so much [AGI component X] that they pile up bureaucracy and inefficiency, then Firms 2, 3, 4, and 5 will start produ... (read more)

Reply
Load More
Rohin Shah16h109
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
I agree with this directionally, and generally tend to believe that certain kinds of straight lines will continue to be straight. But also iirc you rely a lot on the METR trend line, which is (a) noisy, (b) over a short timescale (~2 years[1]), and (c) not a great measure of what people care about[2][3]. I don't think you should strongly expect that particular line to continue to be straight over a long time period. (Though I tend to think that it's more likely to slow down by 2027 than to speed up, relative to what you'd predict by extrapolating the 2024-2025 trend, so I agree with the overall conclusion of the post. And obviously one should have non-trivial probability on each of {slow down, stay the same, speed up}, as you say.) 1. ^ I'm excluding GPT-2 and GPT-3 which seem very noisy. 2. ^ Compared to things like Moore's Law, or energy efficiency of solar, or pretraining perplexity, or other similar things where we observe straight lines. 3. ^ None of this is to say that the METR graph is bad! It is the best empirical evidence on timelines I'm aware of.
Kaj_Sotala8d54
AI companies have started saying safeguards are load-bearing
I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.
Steven Byrnes13d1218
Four ways learning Econ makes people dumber re: future AI
I might have overdone it on the sass, sorry. This is much sassier than my default (“scrupulously nuanced and unobjectionable and boring”)… * …partly because I’m usually writing for lesswrong and cross-posting on X/Twitter, whereas this one was vice-versa, and X is a medium that seems to call for more sass; * …partly in an amateur ham-fisted attempt to do clickbait (note also: listicle format!) because this is a message that I really want to put out there; * …and yes, partly because I do sometimes feel really frustrated talking to economists (#NotAllEconomists), and I think they can and should do better, and the sass is reflecting a real feeling that I feel. But I think next time I would dial it back slightly, e.g. by replacing “DUMBER” with “WORSE” in the first sentence. I’m open to feedback, I don’t know what I’m doing. ¯\_(ツ)_/¯ > I don't think they are more "incorrect" than, say, the AI is Normal Technology folks. Yeah, I agree that lots of CS professors are deeply mistaken about the consequences of AGI, and ditto with the neuroscientists, and ditto with many other fields, including even many of the people trying to build AGI right now. I don’t think that economists are more blameworthy than other groups, it just so happens that this one particular post is aimed at them. > I think the crux more or less comes down to skepticism about the plausibility of superintelligence in the next decade or so. I think you’re being overly generous. “Decade or so” is not the crux. In climate change, people routinely talk about bad things that might happen in 2050, and even in 2100, or farther! People also routinely talk 30 years out or more in the context of science, government, infrastructure, institution-building, life-planning, etc. People talk about their grandkids and great-grandkids growing up, etc. If someone expected superintelligence in the next 50 years but not the next 20—like if they really expected that, viscerally, with a full understanding of its implications—then that belief would be a massive, central influence on their life and worldview. That’s not what’s going on in the heads of the many (most?) people in academia who don’t take superintelligence seriously. Right?
Load More