AI ALIGNMENT FORUM
AF

HomeLibraryQuestionsAll Posts
About

AI Alignment Posts

Popular Comments

2870
How To Go From Interpretability To Alignment: Just Retarget The Search
Best of LessWrong 2022

Here's a simple strategy for AI alignment: use interpretability tools to identify the AI's internal search process, and the AI's internal representation of our desired alignment target. Then directly rewire the search process to aim at the alignment target. Boom, done. 

by johnswentworth
4Alex_Altair
This post expresses an important idea in AI alignment that I have essentially believed for a long time, and which I have not seen expressed elsewhere. (I think a substantially better treatment of the idea is possible, but this post is fine, and you get a lot of points for being the only place where an idea is being shared.)

Recent Discussion

Fabien's Shortform
Fabien Roger
2y
20Fabien Roger18h
I listened to the books Original Sin: President Biden's Decline, Its Cover-up, and His Disastrous Choice to Run Again and The Divider: Trump in the White House, 2017–2021. Both clearly have an axe to grind and I don't have enough US politics knowledge to know which claims are fair, and which ones are exaggerations and/or are missing important context, but these two books are sufficiently anti-correlated that it seems reasonable to update based on the intersection of the 2 books. Here are some AGI-relevant things I learned: * It seems rough to avoid sycophancy dynamics as president: * There are often people around you who want to sabotage you (e.g. to give more power to another faction), so you need to look out for saboteurs and discourage disloyalty. * You had a big streak of victories to become president, which probably required a lot of luck but also required you to double down on the strength you previously showed and be confident in your abilities to succeed again in higher-stakes positions. * Of course you can still try to be open to new factual information, but being calibrated about how seriously to take dissenting points of view sounds rough when the facts are not extremely crisp and legible. * This makes me somewhat more pessimistic about how useful AGI advisors could be. * Unconditional truthfulness (aka never lying) seems very inconvenient in politics. It looks quite common to e.g. have to choose between truthfulness and loyalty in an environment that strongly encourages loyalty. Seems especially convenient when the lie is about some internal state ("did you know that Biden was too old?" "do you think the current policy of the admin you work for is good?"). But even regular cover-ups about factual information seem quite common. * I think this makes designing good model specs for AGI advisors potentially quite hard, especially if the AGI advisors also have to answer questions from journalists and other adversarial entities. * I wonder
Buck13h8-8

Re that last point, you might be interested to read about "the constitution is not a suicide pact": many prominent American political figures have said that survival of the nation is more important than constitutionality (and this has been reasonably well received by other actors, not reviled).

Reply
Tom Davidson's Shortform
Tom Davidson
2d
Reducing risk from scheming by studying trained-in scheming behavior
18
ryan_greenblatt
3d

In a previous post, I discussed mitigating risks from scheming by studying examples of actual scheming AIs.[1] In this post, I'll discuss an alternative approach: directly training (or instructing) an AI to behave how we think a naturally scheming AI might behave (at least in some ways). Then, we can study the resulting models. For instance, we could verify that a detection method discovers that the model is scheming or that a removal method removes the inserted behavior. The Sleeper Agents paper is an example of directly training in behavior (roughly) similar to a schemer.

This approach eliminates one of the main difficulties with studying actual scheming AIs: the fact that it is difficult to catch (and re-catch) them. However, trained-in scheming behavior can't be used to build...

(Continue Reading - 3065 more words)
Rogue internal deployments via external APIs
22
Fabien Roger, Buck
3d

tl;dr: A heavily-monitored internally deployed AI with sensitive access (e.g. to model weights) could bypass safety controls by asking a weakly-monitored external API version of itself (or another AI) what to do. It could then use its privileged access to execute instructions that the external API generated under less monitoring.

Once AI companies build powerful AIs, they may:

  • Give internal AIs access to sensitive internal privileges (e.g. access to the internal infra that touches model weights, help with training of the next generation of AIs, …);
  • Have mitigations specific to those internal deployments enforced via monitoring of the internal LLM API;
  • Have an external API deployment without those mitigations.

In this situation, an internal AI may build an agent scaffold that makes calls to the external API and uses the internal sensitive privileges....

(Continue Reading - 1559 more words)
1J Bostock3d
I think this was predictable, since "Try to catalogue every way in which an AI could set up a rogue deployment" is an argument from exhaustive free association. I also predict that you will think of another one within a year. Conditional on you thinking of a new method, my gut says there's a 1/2 chance that it was something you could have thought of a year ago, and a 1/2 chance that it's based on some new empirical discovery about LLMs made between a year ago and when you think of it (a la "The model triggers emergent misalignment in a distilled mini version of itself.")
Fabien Roger3d20
  1. My guess is that for control of things around human level, you should be able to use humans to find all the things that an LLM could have thought of. This is an update against long-in-advance small-scale efforts being good enough, not an update against the general methodology.
  2. I think there are domains where you can actually point at a small category of things that actually capture the most important threats. I think this exercise was for me an update in the direction of rogue internal deployments having lots of important and non-trivial subcategories as opposed to few.
Reply
ryan_greenblatt's Shortform
ryan_greenblatt
2y
ryan_greenblatt4d110

Anthropic has now clarified this in their system card for Claude Haiku 4.5:
Image

Thanks to them for doing this!

See also Sam Bowman's tweet thread about this.

Reply
Current Language Models Struggle to Reason in Ciphered Language
41
Fabien Roger, Shiyuan Guo
5d

tl;dr: We fine-tune or few-shot LLMs to use reasoning encoded with simple ciphers (e.g. base64, rot13, putting a dot between each letter) to solve math problems. We find that these models only get an uplift from the reasoning (over directly answering) for very simple ciphers, and get no uplift for intermediate-difficulty ciphers that they can translate to English. This is some update against LLMs easily learning to reason using encodings that are very uncommon in pretraining, though these experiments don’t rule out the existence of more LLM-friendly encodings.

📄Paper, 🐦Twitter, 🌐Website

Research done as part of the Anthropic Fellows Program. Compute was generously funded by Scale and the Anthropic Fellows program.

Summary of the results

We teach LLMs to use one particular cipher, such as:

  • “letter to word with dot” maps each
...
(Continue Reading - 1364 more words)
Jozdien5d10
  • Speak using what looks like code words (e.g. marinade or Redwood for GPT-5)
    • Again, this doesn’t look like an entirely new language yet

[...]

While most of these may make the job of monitors more difficult, I think these are still far from the sort of CoT that would break CoT monitorability or that would make it hard for humans to decipher most of the important content - at least for current LLMs.

I'm confused by this. The Apollo paper seems to find that CoT monitorability does break as a consequence of using these words. Section 4 in particular seems to i... (read more)

Reply
1Bronson Schoen5d
I’m a bit confused about this distinction, given that in the Bowen paper precisely the thing it breaks is CoT monitorability. While this has the limitations that: * It was directly induced by the training setup * Restricted to just CoT (and not CoT+actions) It seems to meet the bar of “empirically beat the CoT monitor” Or by “making the job of the monitors more difficult” do you mean that you could like train a monitor on this CoT and recover monitorability? I think this is likely true, but “the model learns steganographic reasoning that breaks CoT Monitorability for both a trained and untrained monitor” seems like a much higher bar than beating an untrained monitor. I would definitely argue that the models in the Bowen paper break the “hard for humans to decipher most of the important content” threshold.
2Fabien Roger5d
The thing I was pointing at is how the o3 scratchpads seem "shorter" (by default, even when you don't train against them), not how in Bowen et al the scratchpad stops mentioning the malicious activity entirely. I should have maybe pointed at the weird GPT-5 CoT, it might have been less confusing. Given the simplicity of the hacks in Bowen et al, I don't think they require more reasoning than what a model can do in a forward pass - so my guess is that information about the hack is just completely absent. The thing I am trying to study in this post is the dynamics around CoT by necessity, not by propensity. I think there is some intermediate hope like "if the model sometimes uses weird distracting words, make sure your CoT monitor evals have weird distracting words, and if the model is less good at it than regular English, change the prompt / do training to be robust to weird distracting words (potentially from a much broader distribution than what the model would do)". This is a bar that I expect people doing monitoring could meet with a bit of effort, and that is much easier to meet than actually having to train against the specific and weird steganographic encoding that the model would use if it used steganography (which requires you to understand the specific and weird steganographic encoding it is using). Low confidence, I agree there is a real difficulty here.
Load More
Raemon10d2011
Plans A, B, C, and D for misalignment risk
My main question is "why do you think Shut Down actually costs more political will?".  I think Plan A and "Shut It Down" both require very similar opening steps that are the most politically challenging part AFAICT, and once the world is even remotely considering those steps, the somewhat different shut-it-down steps don't seem particularly hard sells. I also think Plan A "bad implementation" is much more likely, and also much worse (again see "Shut It Down" is simpler than  "Controlled Takeoff"). > Gear 2: You need to compare the tractability of Global Shut Down vs Global Controlled Takeoff That Actually Works, as opposed to Something That Looks Close To But Not Actually A Controlled Takeoff. > > Along with Gear 3: "Shut it down" is much simpler than "Controlled Takeoff." > > A Global Controlled Takeoff That Works has a lot of moving parts.  > > You need the international agreement to be capable of making any kind of sensible distinctions between safe and unsafe training runs, or even "marginally safer" vs "marginally less safe" training runs.  > > You need the international agreement to not turn into molochian regulatory-captured horror that perversely reverses the intent of the agreement and creates a class of bureaucrats who don't know anything about AI and use the agreement to dole out favors. > > These problems still exist in some versions of Shut It Down too, to be clear (if you're trying to also ban algorithmic research – a lot of versions of that seem like they leave room to argue about whether agent foundations or interpretability count). But, they at least get coupled with "no large training runs, period." > > I think "guys, everyone just stop" is a way easier schelling point to coordinate around, than "everyone, we're going to slow down and try to figure out alignment as best we can using current techniques." > > So, I am not currently convinced that Global Controlled Takeoff That Actually Works is any more politically tractable than Global Shut Down. > > (Caveat: Insofar as your plan is "well, we will totally get a molochian moral maze horror, but, it'll generally move slower and that buys time", eh, okay, seems reasonable. But, at least be clear to yourself about what you're aiming for) I agree you do eventually want to go back to Plan A anyway, so I mostly am just not seeing why you really want to treat these as separate plans, rather than a single: "Okay, we wanna get all the compute centralized and monitored, we want a lot more control over GPU production, we want to buy as much time and proceed as carefully as we can. At any given time, we want the option to either be basically shut down or in controlled-takeoff mode, depending on some conditions on the ground." I agree with some of the risks of "geopolitical situation might get harder to have control over" and "humanity generally becoming anti-progress" but these don't even seem strictly worse in Shutdown World vs Controlled Takeoff world. (in particular in a "shut it down" world where the framing is "we are eventually going to build a thing that everyone agrees is good, we're just making sure we get it right.") But, "guys, this is very dangerous, we are proceeding very carefully before summoning something smarter than us, while trying out best to all reap the benefits of it" seems like a way easier narrative to get everyone bought into than "guys this is dangerous enough to warrant massive GPU  monitoring but... still trying to push ahead as fast as we can?".
Jakub Growiec19d2316
Four ways learning Econ makes people dumber re: future AI
As professor of economics and self-proclaimed doomer myself, I greatly appreciate this post! These are almost exactly my feelings when talking to fellow economists who typically think that, by an unspoken assumption, all AI will always be normal technology, a tool in people's hands. I think your capital/labor point is particularly spot on. I've had a problem with that framing for several years now. That's why I proposed a "hardware-software" framework, which I elaborated in a few of my papers and one book. The idea is simple: just divide production factors differently! The key distinction is not whether it's man or machine, it's whether it's physical work or information processing. More in a LW post: The Hardware-Software Framework: A New Perspective on Economic Growth with AI — LessWrong, and in my 2022 book, Accelerating Economic Growth: Lessons From 200,000 Years of Technological Progress and Human Development | SpringerLink.
18Reducing risk from scheming by studying trained-in scheming behavior
ryan_greenblatt
3d
0
22Rogue internal deployments via external APIs
Fabien Roger, Buck
3d
2
41Current Language Models Struggle to Reason in Ciphered Language
Fabien Roger, Shiyuan Guo
5d
3
45Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal, vgillioz, TurnTrout, cloud
5d
0
23Iterated Development and Study of Schemers (IDSS)
ryan_greenblatt
9d
0
2Assuring Agent Safety Evaluations By Analysing Transcripts
bigmouthmodels, Cozmin Ududec, Benjamin Hilton
9d
0
54Realistic Reward Hacking Induces Different and Deeper Misalignment
Jozdien
9d
0
37The Thinking Machines Tinker API is good news for AI control and security
Buck
10d
7
68Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen, Fabien Roger
10d
6
54Plans A, B, C, and D for misalignment risk
ryan_greenblatt
11d
39
Load More