AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

20
Raemon8d2011
Plans A, B, C, and D for misalignment risk
My main question is "why do you think Shut Down actually costs more political will?".  I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells. I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than  "Controlled Takeoff"). > Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff. > > Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff." > > A Global Controlled Takeoff That Works has a lot of moving parts.  > > You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.  > > You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors. > > These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period." > > I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques." > > So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down. > > (Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for) I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single: "Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground." I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.") But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU  monitoring but... still trying to push ahead as fast as we can?".
Jakub Growiec17d2316
Four ways learning Econ makes people dumber re: future AI
As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.
Vaniver1mo1519
If anyone builds it, everyone will plausibly be fine
I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response.  It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here? > I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways. 1. It is challenging to reward based on deeper dynamics instead of surface dynamics. RL is only as good as the reward signal, and without already knowing what behavior is 'aligned' or not, developers will not be able to push models towards doing more aligned behavior. 1. Do you remember the early RLHF result where the simulated hand pretended it was holding the ball with an optical illusion, because it was easier and the human graders couldn't tell the difference? Imagine that, but for arguments for whether or not alignment plans will work. 2. Goal-directed agency unlocks capabilities and pushes against corrigibility, using the same mechanisms. 1. This is the story that EY&NS deploy more frequently, because it has more 'easy call' nature. Decision theory is pretty predictable. 3. Instruction / oversight-based systems depend on a sharp overseer--the very thing we're positing we don't have. So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?
Load More

Recent Discussion

Tom Davidson's Shortform
Tom Davidson
10h
Reducing risk from scheming by studying trained-in scheming behavior
18
ryan_greenblatt
12h

In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.

This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build...

(Continue Reading - 3065 more words)
Rogue internal deployments via external APIs
22
Fabien Roger, Buck
1d

tl;dr: A heavily-monitored internally deployed AI with sensitive access (e.g. to model weights) could bypass safety controls by asking a weakly-monitored external API version of itself (or another AI) what to do. It could then use its privileged access to execute instructions that the external API generated under less monitoring.

Once AI companies build powerful AIs, they may:

  • Give internal AIs access to sensitive internal privileges (e.g. access to the internal infra that touches model weights, help with training of the next generation of AIs, …);
  • Have mitigations specific to those internal deployments enforced via monitoring of the internal LLM API;
  • Have an external API deployment without those mitigations.

In this situation, an internal AI may build an agent scaffold that makes calls to the external API and uses the internal sensitive privileges....

(Continue Reading - 1559 more words)
1J Bostock18h
I think this was predictable, since "Try to catalogue every way in which an AI could set up a rogue deployment" is an argument from exhaustive free association. I also predict that you will think of another one within a year. Conditional on you thinking of a new method, my gut says there's a 1/2 chance that it was something you could have thought of a year ago, and a 1/2 chance that it's based on some new empirical discovery about LLMs made between a year ago and when you think of it (a la "The model triggers emergent misalignment in a distilled mini version of itself.")
Fabien Roger17h20
  1. My guess is that for control of things around human level, you should be able to use humans to find all the things that an LLM could have thought of. This is an update against long-in-advance small-scale efforts being good enough, not an update against the general methodology.
  2. I think there are domains where you can actually point at a small category of things that actually capture the most important threats. I think this exercise was for me an update in the direction of rogue internal deployments having lots of important and non-trivial subcategories as opposed to few.
Reply
ryan_greenblatt's Shortform
ryan_greenblatt
2y
48ryan_greenblatt7d
Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't[1]. I think AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing a better job at this than other companies, I think all of these companies should provide more information about this. It's particularly striking that Anthropic says nothing about whether they train against CoT given their system card (for 4.5 Sonnet) is very thorough and includes a section on "Reasoning faithfulness" (kudos to them for providing so much other information!). Naively, this seems like some evidence that they are training against CoT and didn't want to disclose this.[2] I think we should try to create an incentive gradient that pushes AI companies to disclose information even if that information makes them look bad, so we should make companies feel some heat for not disclosing (reasonably) important information like how/whether they are training against CoT. ---------------------------------------- It's worth noting that some versions of training against CoT might be totally fine (as in, they don't incentivize obfuscation of misaligned reasoning) and that some particular approach for training against CoT could be worth it even if it does incentivize obfuscating misaligned reasoning. For instance, it could end up being the case that training against CoT works well and we end up with sufficient confidence that this doesn't result in substantial obfuscation of reasoning. The situation is also just generally complicated because things like "what SFT/prompting initialization do you use for your CoT" could have big effects on obfuscation and aren't exactly well described as applying optimization pressure to CoT. (But, if you start iterating on the SFT/prompting initialization to reduce misalignment observed in the CoT, this could cause problems.) Given that the situation is complicated, ideally AI companies would publ
ryan_greenblatt1d110

Anthropic has now clarified this in their system card for Claude Haiku 4.5:
Image

Thanks to them for doing this!

See also Sam Bowman's tweet thread about this.

Reply
Current Language Models Struggle to Reason in Ciphered Language
39
Fabien Roger, Shiyuan Guo
3d

tl;dr: We fine-tune or few-shot LLMs to use reasoning encoded with simple ciphers (e.g. base64, rot13, putting a dot between each letter) to solve math problems. We find that these models only get an uplift from the reasoning (over directly answering) for very simple ciphers, and get no uplift for intermediate-difficulty ciphers that they can translate to English. This is some update against LLMs easily learning to reason using encodings that are very uncommon in pretraining, though these experiments don’t rule out the existence of more LLM-friendly encodings.

📄Paper, 🐦Twitter, 🌐Website

Research done as part of the Anthropic Fellows Program. Compute was generously funded by Scale and the Anthropic Fellows program.

Summary of the results

We teach LLMs to use one particular cipher, such as:

  • “letter to word with dot” maps each
...
(Continue Reading - 1364 more words)
Jozdien2d10
  • Speak using what looks like code words (e.g. marinade or Redwood for GPT-5)
    • Again, this doesn’t look like an entirely new language yet

[...]

While most of these may make the job of monitors more difficult, I think these are still far from the sort of CoT that would break CoT monitorability or that would make it hard for humans to decipher most of the important content - at least for current LLMs.

I'm confused by this. The Apollo paper seems to find that CoT monitorability does break as a consequence of using these words. Section 4 in particular seems to i... (read more)

Reply
1Bronson Schoen3d
I’m a bit confused about this distinction, given that in the Bowen paper precisely the thing it breaks is CoT monitorability. While this has the limitations that: * It was directly induced by the training setup * Restricted to just CoT (and not CoT+actions) It seems to meet the bar of “empirically beat the CoT monitor” Or by “making the job of the monitors more difficult” do you mean that you could like train a monitor on this CoT and recover monitorability? I think this is likely true, but “the model learns steganographic reasoning that breaks CoT Monitorability for both a trained and untrained monitor” seems like a much higher bar than beating an untrained monitor. I would definitely argue that the models in the Bowen paper break the “hard for humans to decipher most of the important content” threshold.
2Fabien Roger2d
The thing I was pointing at is how the o3 scratchpads seem "shorter" (by default, even when you don't train against them), not how in Bowen et al the scratchpad stops mentioning the malicious activity entirely. I should have maybe pointed at the weird GPT-5 CoT, it might have been less confusing. Given the simplicity of the hacks in Bowen et al, I don't think they require more reasoning than what a model can do in a forward pass - so my guess is that information about the hack is just completely absent. The thing I am trying to study in this post is the dynamics around CoT by necessity, not by propensity. I think there is some intermediate hope like "if the model sometimes uses weird distracting words, make sure your CoT monitor evals have weird distracting words, and if the model is less good at it than regular English, change the prompt / do training to be robust to weird distracting words (potentially from a much broader distribution than what the model would do)". This is a bar that I expect people doing monitoring could meet with a bit of effort, and that is much easier to meet than actually having to train against the specific and weird steganographic encoding that the model would use if it used steganography (which requires you to understand the specific and weird steganographic encoding it is using). Low confidence, I agree there is a real difficulty here.
Load More
A central AI alignment problem: capabilities generalization, and the sharp left turn
Best of LessWrong 2022

Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.

by So8res
18Reducing risk from scheming by studying trained-in scheming behavior
ryan_greenblatt
12h
0
22Rogue internal deployments via external APIs
Fabien Roger, Buck
1d
2
39Current Language Models Struggle to Reason in Ciphered Language
Fabien Roger, Shiyuan Guo
3d
3
45Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal, vgillioz, TurnTrout, cloud
3d
0
23Iterated Development and Study of Schemers (IDSS)
ryan_greenblatt
7d
0
2Assuring Agent Safety Evaluations By Analysing Transcripts
bigmouthmodels, Cozmin Ududec, Benjamin Hilton
7d
0
54Realistic Reward Hacking Induces Different and Deeper Misalignment
Jozdien
7d
0
37The Thinking Machines Tinker API is good news for AI control and security
Buck
8d
7
68Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen, Fabien Roger
8d
6
54Plans A, B, C, and D for misalignment risk
ryan_greenblatt
8d
39
Load More
4Mikhail Samin
Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think The sharp left turn is not a simple observation that we've seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment. Many times over the past year, I've been surprised by people in the field who've read Nate's post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I'll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about. For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they'll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing. Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things togeth