AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

13
Paul's research agenda FAQ
Best of LessWrong 2018

Alex Zhu spent quite awhile understanding Paul's Iterated Amplication and Distillation agenda. He's written an in-depth FAQ, covering key concepts like amplification, distillation, corrigibility, and how the approach aims to create safe and capable AI assistants.

by zhukeepa
Raemon3d2011
Plans A, B, C, and D for misalignment risk
My main question is "why do you think Shut Down actually costs more political will?".  I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells. I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than  "Controlled Takeoff"). > Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff. > > Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff." > > A Global Controlled Takeoff That Works has a lot of moving parts.  > > You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.  > > You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors. > > These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period." > > I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques." > > So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down. > > (Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for) I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single: "Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground." I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.") But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU  monitoring but... still trying to push ahead as fast as we can?".
Jakub Growiec11d2316
Four ways learning Econ makes people dumber re: future AI
As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.
Vaniver23d1519
If anyone builds it, everyone will plausibly be fine
I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response.  It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here? > I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways. 1. It is challenging to reward based on deeper dynamics instead of surface dynamics. RL is only as good as the reward signal, and without already knowing what behavior is 'aligned' or not, developers will not be able to push models towards doing more aligned behavior. 1. Do you remember the early RLHF result where the simulated hand pretended it was holding the ball with an optical illusion, because it was easier and the human graders couldn't tell the difference? Imagine that, but for arguments for whether or not alignment plans will work. 2. Goal-directed agency unlocks capabilities and pushes against corrigibility, using the same mechanisms. 1. This is the story that EY&NS deploy more frequently, because it has more 'easy call' nature. Decision theory is pretty predictable. 3. Instruction / oversight-based systems depend on a sharp overseer--the very thing we're positing we don't have. So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?
Load More
23Iterated Development and Study of Schemers (IDSS)
ryan_greenblatt
20h
0
2Assuring Agent Safety Evaluations By Analysing Transcripts
bigmouthmodels, Cozmin Ududec, Benjamin Hilton
1d
0
38The Thinking Machines Tinker API is good news for AI control and security
Buck
2d
7
66Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen, Fabien Roger
3d
4
53Plans A, B, C, and D for misalignment risk
ryan_greenblatt
3d
39
9Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity
David Africa
5d
0
27Eliciting secret knowledge from language models
Bartosz Cywiński, Arthur Conmy, Sam Marks
9d
0
21Lectures on statistical learning theory for alignment researchers
Vanessa Kosoy
10d
1
6Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs
Igor Ivanov
15d
0
12Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Thane Ruthenis
15d
6
Load More

Recent Discussion

Plans A, B, C, and D for misalignment risk
53
ryan_greenblatt
3d

I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.

In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.

Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:

  • Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period,
...
(Continue Reading - 1729 more words)
ryan_greenblatt8h40

Maybe I should clarify my view a bit on Plan A vs "shut it all down":

  • Both seem really hard to pull off from a political will and actually making it happen.
    • Plan A is complicated and looking at I go "oh jeez, this seems really hard to make happen well, idk if the US government has the institutional capacity to pull off something like this". But, I also think pulling off "shut it all down" seems pretty rough and pulling off a shutdown that lasts for a really long time seems hard.
    • Generally, it seems like Plan A might be hard to pull off and it's easy for
... (read more)
Reply1
2ryan_greenblatt8h
One alternative way of thinking about this is to decompose plans by which actor the plan is for: * Plan A: Most countries, at least the US and China * Plan B: The US government (and domestic industry) * Plan C: The leading AI company (or maybe a few of the leading AI companies) * Plan D: A small team with a bit of buy in within the leading AI company This isn't a perfect breakdown, e.g. Plan A might focus mostly on what the US should do, but it might still be helpful. This decomposition was proposed by @Lukas Finnveden.
14Thomas Larsen2d
One framing that I think might be helpful for thinking about "Plan A" vs "shut it all down" is: "Suppose that you have the political will for an n-year slowdown, i.e. after n years, you are forced to handoff trust to superhuman AI systems (e.g. for n = 5, 10, 30). What should the capability progression throughout the slowdown be?" This framing forces a focus on the exit condition / plan to do handoff, which I think is an underdiscussed weakness of the "shut it all down" plan.  I think my gut reaction is that the most important considerations are: (i) there are a lot of useful things you can do with the AIs, so I want more time with the smarter AIs, and (ii) I want to scale through the dangerous capability range slowly and with slack (as opposed to at the end of the slowdown).  * this makes me think that particularly for a shorter slowdown (e.g. 5 years), you want to go fast at the beginning (e.g. scale to ~max controllable AI over the first year or two), and then elicit lots of work out of those AIs for the rest of the time period. * A key concern for the above plan is that govts/labs botch the measurement of "max controllable AI", and scale too far. * But it's not clear to me how a further delay helps with this, unless you have a plan for making the institutions better over time, or pursuing a less risky path (e.g. ignoring ML and doing human intelligence augmentation). * Going slower, on the other hand, definitely does help, but requires not shutting it all down. * More generally, it seems good to do something like "extend takeoff evenly by a factor of n", as opposed to something like "pause for n-1 years, and then do a 1 year takeoff".   * I am sympathetic to shut all down and go for human augmentation: I do think this reduces AI takeover risk a lot, but this requires a very long pause, and it requires our institutions to bet big on a very unpopular technology. I think that convincing governments to "shut it all down" without an exit strategy at all seems
2Raemon2d
This framing feels reasonable-ish, with some caveats.[1] I am assuming we're starting the question at the first stage where either "shut it down" or "have a strong degree of control over global takeoff" becomes plausibly politically viable. (i.e. assume early stages of Shut It Down and Controlled Takeoff both include various partial measures that are more immediately viable and don't give you the ability to steer capability-growth that hard) But, once it becomes a serious question "how quickly should we progress through capabilities", then one thing to flag is, it's not like you know "we get 5 years, therefore, we want to proceed through those years at X rate." It's "we seem to have this amount of buy-in currently..." and the amount of buy-in could change (positively or negatively). Some random thoughts on things that seem important: * I would want to do at least some early global pause on large training runs, to check if you are actually capable of doing that at all. (in conjunction with some efforts attempting to build international goodwill about it) * One of the more important things to do as soon as it's viable, is to stop production of more compute in an uncontrolled fashion. (I'm guessing this plays out with some kind of pork deals for nVidia and other leaders[2], where the early steps are 'consolidate compute', and then them producing the chips that are more monitorable, and which they get to make money from, but also are sort of nationalized). This prevents a big overhang. * Before I did a rapid-growth of capabilities, I would want a globally set target of "we are able to make some kind of interpretability strides or evals that let us make better able to predict the outcome of the next training run." ( If it's not viable to do that, well, then we don't. (but, then we're not really having a real convo about how slow the takeoff should ideally be, just riding the same incentive wave we're currently riding with slightly more steering). ((We can instead
The Thinking Machines Tinker API is good news for AI control and security
38
Buck
2d

Last week, Thinking Machines announced Tinker. It’s an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of safety papers that involve RL on big models.

But it's more interesting to me for another reason: the design of this API makes it possible to do many types of ML research without direct access to the model you’re working with. APIs like this might allow AI companies to reduce how many of their researchers (either human or AI) have access to sensitive model weights, which is good for reducing the probability of weight exfiltration and other rogue...

(Continue Reading - 1758 more words)
Neel Nanda14h30

Imo a significant majority of frontier model interp would be possible with the ability to cache and add residual streams, even just at one layer. Though caching residual streams might enable some weight exfiltration if you can get a ton out? Seems like a massive pain though

Reply
2Buck2d
Yeah, what I'm saying is that even if the computation performed in a hook is trivial, it sucks if that computation has to happen on a different computer than the one doing inference.
0Adam Karvonen2d
In nnsight hooks are submitted via an API to run on a remote machine, and the computation is performed on the same computer as the one doing the inference. They do some validation to ensure that it's only legit Pytorch stuff, so it isn't just arbitrary code execution.
2Buck2d
Yeah for sure. A really nice thing about the Tinker API is that it doesn't allow users to specify arbitrary code to be executed on the machine with weights, which makes security much easier.
ryan_greenblatt's Shortform
ryan_greenblatt
2y
ryan_greenblatt18h*4251

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't[1].

I think AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing a better job at this than other companies, I think all of these companies should provide more information about this.

It's particularly striking that Anthropic says nothing about whether they train against CoT given their system card (for 4.5 Sonnet) is very thorough and includes a section on "Reasoning faithfulness" (kudo... (read more)

Reply1
Stress Testing Deliberative Alignment for Anti-Scheming Training
57
Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, alex.lloyd
24d
This is a linkpost for antischeming.ai

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv

Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context, or for simple preferences they acquire from training — something we previously found in Frontier Models Are Capable of In-Context Scheming.

In a new research collaboration with OpenAI, we developed a larger suite of alignment evaluations for covert actions (26 evaluations) and studied a training method to reduce such covert behaviors. We manage to significantly reduce (by ~30x; OpenAI o3: 13.0%→0.4%; OpenAI o4-mini: 8.7%→0.3%) the rate of covert actions across our diverse suite by only training against a single type of...

(See More - 259 more words)
1Bronson Schoen5d
Great questions! Taking these in reverse order to hopefully motivate the answers more clearly: * The question we got originally was something "maybe training against scheming using the technique we have available just works". * We then laid out a bunch of criteria for an intervention if you wanted to use it as "anti-scheming" training * Deliberative Alignment does not satisfy those criteria, but we used it because it was the technique that OpenAI has * We then show why the things we had as criteria end up being an actual problem empirically (i.e. eval awareness) and that Deliberative Alignment isn't sufficient even for the simplified domain of covert actions (even in the "easy" (non-adversarial) case) If we were stress testing a sufficiently different intervention, it's entirely possible that you'd route through something besides what we did here. This is also very different from how you'd approach the problem if you already had buy in to study the "hard" adversarial case or to make broader use of model organisms.  Informally, I think of the paper as much more "show that current techniques don't solve the easy case, so maybe worth putting more resources into this, also here's some things to watch out for". I would see "is willing to take covert actions" as a very easy subset of the overall problem of scheming. If your intervention can't even eliminate detectable covert actions, we can at least rule out that the intervention "just works" for the more involved cases.  I don't think you can adapt this to cover real scheming. I think there are several fundamental limitations to it (not comprehensive): 1. The real hard case of scheming IMO is obviously the adversarial case, which we don't cover at all in the paper. 2. Doesn't work to eliminate the target bad behavior even in the original paper (Deliberative Alignment: Reasoning Enables Safer Language Models) 3. Puts optimization pressure on the CoT - this is especially egregious, as if it did eliminate the t
Tomek Korbak19h10

Thanks for the detailed response!

Reply
0Bronson Schoen5d
Notably all the ones in "Appendix N.3 Models reason about the specific evaluation harness across environments" come from just normal production o3, so this appears to be something it already reasoned about.  (Will answer in a few parts, but let me know if didn't address the core question!) Where does the term come from? "autop" is likely derived from autoprompt (AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts - Shin, 2020). This is however just my best guess based on how it's conceptually used to indicate evaluation harness related things. When does the model start using it? Rerunning the analysis from "Appendix G: Unusual Terminology in Chain-of-Thought - Unusual terminology increases during general RL capabilities training" for "autop", the increase in usage looks like it happens over the course of capabilities training (i.e. similar to other terms in that section, starts at a rate of near 0 and increases significantly, will check if can share exact numbers and update). What is the “OpenAI autop grader”? So two possible hypothesis would be: 1. It is referring to a specific implementation of a grader that it has learned about during pretraining 2. It is using the term "autop" very generally as shorthand for many different things "harness" related I lean toward (2) given how varied the usage is. For example, here o3 uses "autop-run" to mean something like a test harness: For the exp-rl-cap Late checkpoint for example, we see frequently see the model about reasoning about just the concept of a "grader" generally (and trying to find it):  across a wide range of environments: including literally looking for "grade" / "Grader": and in some cases reasoning about where it has seen grading / scoring scripts in "tasks previously solved": I'd be interested to understand though to what extent this is effected by potentially having literally seen a grading / scoring script in its posttraining environments. (Let me kn
1Tomek Korbak5d
That's helpful, thanks! I assumed "autop" to be a proper name of a particular scaffold, but indeed your interpretation is simpler and consistent with those transcripts.
Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
12
Thane Ruthenis
15d

"If your life choices led you to a place where you had to figure out anthropics before you could decide what to do next, are you really living your life correctly?"

– Eliezer Yudkowsky

To revisit our premises: Why should we think the end result is achievable at all? Why should it be possible to usefully represent the universe as an easily interpretable symbolic structure?

First, I very much agree with the sentiment quoted above, so we aren't quite doing that here. Most of the actual reason is just: it sure looks like that's the case, empirically. As I'd argued before, human world-models seem autosymbolic, and the entirety of our (quite successful) scientific edifice relies on something-like-this being true. I think the basic case is convincing enough not to require...

(Continue Reading - 1164 more words)
Thane Ruthenis21h10

Some new data on that point:

Maybe if lots of noise is constantly being injected into the universe, this would change things. Because then the noise counts as part of the initial conditions. So the K-complexity of the universe-history is large, but high-level structure is common anyway because it's more robust to that noise?

To summarize what the paper argues (from my post in that thread):

  • Suppose the microstate of a system is defined by a (set of) infinite-precision real numbers, corresponding to e. g. its coordinates in phase space.
  • We define the coarse-grai
... (read more)
Reply
Load More