A theory of how alignment research should work
(cross-posted from danielfilan.com)
Epistemic status:
Maybe obvious to everyone but me, or totally wrong (this doesn't really grapple with the challenges of working in a domain where an intelligent being might be working against you), but:
Since people have reported not being able to see the tweet thread, I will reproduce it in this comment (with pictures replaced by my descriptions of them):
If developers had to prove to regulators that powerful AI systems are safe to deploy, what are the best arguments they could use?
Our new report tackles the (very big!) question of how to make a ‘safety case’ for AI.
[image of the start of the paper]
We define a safety case as a rationale developers provide to regulators to show that their AI systems are unlikely to cause a catastrophe.
The term ‘safety case’ is not new. In many industries (e.g. aviation), products are ‘put on trial’ before they are released.
[cartoon of a trial: regulator is a judge, one side is a developer advancing a safety case, other side is a red team advancing a risk case]
We simplify the process of making a safety case by breaking it into six steps.
- Specify the macrosystem (all AI systems) and the deployment setting.
- Concretize 'AI systems cause a catastrophe' into specific unacceptable outcomes (e.g. the AI systems build a bioweapon)
- Justify claims about the deployment setting.
- Carve up the collection of AI systems into smaller groups (subsystems) that can be analyzed in isolation.
- Assess risk from subsystems acting unilaterally.
- Assess risk from subsystems cooperating together.
[The above, but in picture format]
We first focus on step 5: how would developers argue that individual subsystems are safe? These arguments can be grouped into four categories.
Arguments could assert AI systems are
- too weak to be dangerous even if open-sourced (inability)
- unable to cause disaster due to restrictions (control)
- safe despite ablility to cause harm (trustworthiness)
- judged to be safe by credible AI advisors (deference)
[The above, but in picture format and slightly reworded]
Consider the analogy of a robber and a diamond. How might one justify that the robber won’t steal the diamond?
[Image with heading: "Why won't the robber steal the diamond?".
Row 1 says "Inability The robber is an infant. Analogy: e.g. GPT-3 is clearly too weak to be dangerous", next to an image of a baby near a diamond on a pedestal.
Row 2 says "Control The robber is in jail. Analogy: e.g. AI systems are sandboxed and can't escape.", next to an image of a prisoner in a box with bars, near a diamond on a pedestal.
Row 3 says "Trustworthiness The robber does not steal. Analogy: e.g. AI systems follow developer instructions.", next to an image of a person sitting down cross-legged and wearing an orange robe, next to a diamond on a pedestal.
Row 4 says "Deference The security service CEO says the diamond is secure. Analogy: AI advisors say that AI systems are safe.", next to an image of a figure with grey hair saying "All good" next to a diamond on a pedestal with lines coming out of it, connecting it to small grey boxes (sensors?).]
The ordering of these categories is intentional. As AI systems become more powerful, developers will likely rely mostly on inability, then control, then trustworthiness, and finally, deference to AI advisors.
[Image of graph where the horizontal axis is "Increasingly powerful AI" and the vertical axis is "Primary safety argument". Inability, Control, Trustworthiness, and Deference are shown in order from bottom-left to top-right. An arrow connects the words "We are here" to Inability.]
Next, we give examples of arguments in each category. Arguments are ranked on three axes:
- Practicality
- Strength
- Scalability
No argument received full marks! Research will be needed to justify the safety of advanced AI systems.
[A complicated diagram showing a variety of arguments under the Inability, Control, Trustworthiness, and Deference categories, together with ratings for their Practicality, Maximum Strength, and Scalability.]
The arguments in the previous step pertain to small groups of AI systems. It would be difficult to directly apply them to large groups. We also explain how to justify that the actions of many AI systems won’t cause a catastrophe (step 6 in our framework).
[Image titled "Large-scale AI misbehavior". Below are 3 rows, with 2 columns. The left column is labelled "Causes" and the right is labelled "Strategies".
Row 1: Cause: Widespread alignment faking. Strategy: Blitzkrieg: overwhelm controls
Row 2: Cause: Infectious jailbreaks. Strategy: Strike: disable infrastructure.
Row 3: Cause: Rapid memetic value drift. Strategy: Hivemind: combine intelligence.
dots are shown below, likely to indicate that there are more causes and strategies not shown.]
We are hoping this report will:
- Motivate research that further clarifies the assumptions behind safety arguments.
- Inform the design of hard safety standards.
More in the paper: https://bit.ly/3IJ5N95 Many thanks to my coauthors! @NickGabs01, @DavidSKrueger, and @thlarsen.
Might be of interest to @bshlgrs, @RogerGrosse, @DavidDuvenaud, @EvanHub, @aleks_madry, @ancadianadragan, @rohinmshah, @jackclarkSF, @Manderljung, @RichardMCNgo
FWIW, the discussion of AI-driven propaganda doesn't seem as prescient.
So [in 2024], the most compute spent on a single training run is something like 5x10^25 FLOPs.
As of June 20th 2024, this is exactly Epoch AI's central estimate of the most compute spent on a single training run, as displayed on their dashboard.
Frankfurt-style counterexamples for definitions of optimization
In "Bottle Caps Aren't Optimizers", I wrote about a type of definition of optimization that says system S is optimizing for goal G iff G has a higher value than it would if S didn't exist or were randomly scrambled. I argued against these definitions by providing a examples of systems that satisfy the criterion but are not optimizers. But today, I realized that I could repurpose Frankfurt cases to get examples of optimizers that don't satisfy this criterion.
A Frankfurt case is a thought experiment designed to disprove the following intuitive principle: "a person is morally responsible for what she has done only if she could have done otherwise." Here's the basic idea: suppose Alice is considering whether or not to kill Bob. Upon consideration, she decides to do so, takes out her gun, and shoots Bob. But little-known to her, a neuroscientist had implanted a chip in her brain that would have forced her to shoot Bob if she had decided not to. That said, the chip didn't activate, because she did decide to shoot Bob. The idea is that she's morally responsible, even tho she couldn't have done otherwise.
Anyway, let's do this with optimizers. Suppose I'm playing Go, thinking about how to win - imagining what would happen if I played various moves, and playing moves that make me more likely to win. Further suppose I'm pretty good at it. You might want to say I'm optimizing my moves to win the game. But suppose that, unbeknownst to me, behind my shoulder is famed Go master Shin Jinseo. If I start playing really bad moves, or suddenly die or vanish etc, he will play my moves, and do an even better job at winning. Now, if you remove me or randomly rearrange my parts, my side is actually more likely to win the game. But that doesn't mean I'm optimizing to lose the game! So this is another way such definitions of optimizers are wrong.
That said, other definitions treat this counter-example well. E.g. I think the one given in "The ground of optimization" says that I'm optimizing to win the game (maybe only if I'm playing a weaker opponent).
Thanks for finding this! Will link it in the transcript.
Hmm I think somehow the problem is that the equals sign in your url is being encoded as an ASCII value with a % sign etc rather than being treated as a raw equals sign, weird.
For one particularly legible example, see this comment by janus.
Link should presumably be to this comment.
The other is the friendly robot waving hello just underneath.
When I wrote that, I wasn't thinking so much about evals / model organisms as stuff like:
basically stuff along the lines of "when you put agents in X situation, they tend to do Y thing", rather than trying to understand latent causes / capabilities