AI Alignment Forum

2mo

Thoughts On Evaluation Awareness in Claude Opus 4.5.

Context:

Anthropic released Claude Opus 4.5 earlier today (model card). Opus 4.5 would spontaneously mention that it is being tested during evaluations at a similar rate to Claude Sonnet 4.5, but lower than Haiku 4.5 (pg. 65).

Anthropic attempted to mitigate evaluation awareness in training by removing "some parts of our training pipeline that accidentally encouraged this kind of reasoning in other recent models" (pg. 65). The model card later mentioned that Sonnet 4.5 was trained on "prompts th... (read more)

Plans A, B, C, and D for misalignment risk

ryan_greenblatt

2mo

I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.

In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.

Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:

Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period,

...

(Continue Reading - 1729 more words)

2Charbel-Raphaël2d

Summary * Plan A: Most countries, at least the US and China * Plan B: The US government (and domestic industry) * Plan C: The leading AI company (or maybe a few of the leading AI companies) * Plan D: A small team with a bit of buy in within the leading AI company * Plan E: No will how much lead time we have to spend on x-risk focused safety work in each of these scenarios: * Plan A: 10 years * Plan B: 1-3 years * Plan C: 1-9 months (probably on the lower end of this) * Plan D: ~0 months, but ten people on the inside doing helpful things PlanProbability of ScenarioTakeover Risk Given ScenarioExpected Risk ContributionPlan A5%7%0.35%Plan B10%13%1.30%Plan C25%20%5.00%Plan D45%45%20.25%Plan E15%75%11.25%Total100%-38.15%

2Charbel-Raphaël2d

I have three main critiques: 1. The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act. 2. Political Will as Strategy: The framework treats political will as a background variable rather than a key strategic lever. D→C campaigns could reduce expected risk by 11+ percentage points - nearly 30% of the total risk. A campaign to move from E→D would also be highly strategic and could only require talking to a handful of employees. 3. Missing “Plan A-Minus”: No need to lose your lead necessarily. International standards to formalize the red lines/unacceptable levels of risks, e.g., via the AISI network and targeted if-then commitments, would enable companies to slow down without losing, because they would all be playing under the same rules. This seems more tractable than Plan A and solves the China problem better than Plan B.

ryan_greenblatt6h20

The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act.

What about the US trying somewhat hard to buy lead time, e.g., by sabotaging Chinese AI companies?

The framework treats political will as a background variable rather than a key strategic lever.

I roughly agree with this. It's useful to condition on (initial) political will when making a technical plan, but I agree raising political will is important and one issue with this perspective is it might incorrectly make this less salient.

Please, Don't Roll Your Own Metaethics

Wei Dai

12d

One day, when I was an intern at the cryptography research department of a large software company, my boss handed me an assignment to break a pseudorandom number generator passed to us for review. Someone in another department invented it and planned to use it in their product, and wanted us to take a look first. This person must have had a lot of political clout or was especially confident in himself, because he rejected the standard advice that anything an amateur comes up with is very likely to be insecure and he should instead use one of the established, off the shelf cryptographic algorithms, that have survived extensive cryptanalysis (code breaking) attempts.

My boss thought he had to demonstrate the insecurity of the PRNG by coming up...

(See More - 486 more words)

Dmitrii Krasheninnikov15h10

>perhaps I should expand the target audience?

Plausibly the target audience should include AIs that will eventually be trying to do philosophy, assuming they'll be trained on your posts' contents or would be able to browse them.

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár, Rohin Shah

If you've come here via 3Blue1Brown, hi! If want to learn more about interpreting neural networks in general, here are some resources you might find useful:

This is a write up of the Google DeepMind mechanistic interpretability team’s investigation of how language models represent facts. This is a sequence of 5 posts, we recommend prioritising reading post 1, and thinking of it as the “main body” of our paper, and posts 2 to 5 as a series of appendices to be skimmed or dipped into in any order.

Executive Summary

Reverse-engineering circuits with superposition is a major unsolved problem in mechanistic interpretability: models use...

(Continue Reading - 6357 more words)

Fabien Roger1d40

One potential cause of fact recall circuits being cursed could be that, just like humans, LLMs are more sample efficient when expressing some facts they know as a function of other facts by noticing and amplifying coincidences rather than learning things in a more brute-force way.

For example, if a human learns to read base64, they might memorize decode(VA==) = T not by storing an additional element in a lookup table, but instead by noticing that VAT is the acronym for value added tax, create a link between VA== and value added tax, and then recall at infer... (read more)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

AI Alignment Posts

Popular Comments

AI Alignment Posts

Popular Comments

Recent Discussion

Executive Summary