I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes.
In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan.
Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory:
The Global Call for AI Red Lines was signed by 12 Nobel Prize winners, 10 former heads of state and ministers and over 300 prominent signatories. Launched at the UN General Assembly and presented to the UN Security Council. There is still much to be done, so we need to capitalize on this momentum. We are sharing this to solicit feedback and collaboration.
The mission of the agenda is to help catalyse international agreement to prevent unacceptable AI risks as soon as possible.
Here are the main research projects I think are important for moving this needle. We need all hands on deck:
...We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety
I completely agree! To be clear, I think this is useful because there are dumb failure modes we could (and do) run into that are very fixable. Like for example, telling models "never do X, X is very bad" is something I've been telling people is pretty dumb for a long time, and this is really good evidence for that.
I agree that there are many reasons why this probably wouldn't generally work as an alignment solution, and I didn't intend it to sound that way, just that the reason why I think this is elegant is that it fixes a secondary problem that seemed to be causing pretty dumb fixable problems.
Hans Moravec: Behold my book Mind Children. Within, I project that, in 2010 or thereabouts, we shall achieve strong AI. I am not calling it "Artificial General Intelligence" because this term will not be coined for another 15 years or so.
Eliezer (who is not actually on the record as saying this, because the real Eliezer is, in this scenario, 8 years old; this version of Eliezer has all the meta-heuristics of Eliezer from 2021, but none of that Eliezer's anachronistic knowledge): Really? That sounds like a very difficult prediction to make correctly, since it is about the future, which is famously hard to predict.
Imaginary Moravec: Sounds like a fully general counterargument to me.
Eliezer: Well, it is, indeed, a fully general counterargument against futurism. Successfully predicting...
Imprecisely multiplying two analog numbers should not require 10^5 times the minimum bit energy in a well-designed computer.
A well-designed computer would also use, say, optical interconnects that worked by pushing one or two photons around at the speed of light. So if neurons are in some sense being relatively efficient at the given task of pumping thousands upon thousands of ions in and out of a depolarizing membrane in order to transmit signals at 100m/sec -- every ion of which necessarily uses at least the Landauer minimum energy -- they are being vas...
In a comment on my post on topological truth predicates, Paul suggests an approach that uses probabilistic oracle machines instead, in order to make this work more comprehensible to computer scientists. I like this idea!
Paul sketches a framework developed by him and Jessica Taylor, based on a conversation with Scott Aaronson; in this post, I propose a slight simplification of their framework. My version has an oracle , which takes the source code of a probabilistic oracle machine and a . If for every possible oracle , halts with probability one and outputs either or , then : (i) returns "true" if the probability that returns is ; (ii) returns "false" if it is ; (iii) randomly returns "true" or "false" if
I think that I need this in my variant of AIXI in order to filter out "world models" which don't necessarily halt, and I think this will be enough to do so, but I'll leave working out the details to a later post.
AIXI's hypothesis class is the (lower semicomputable chronological) semimeasures, so I do not know why halting should be required for this application?
(My question is now of mainly historical interest, since the later versions of reflective oracles do not require queries to be about machines that halt, AND because they "complete" the distributions generated by those machines to proper measures)
Some AI labs claim to care about AI safety, but continue trying to build AGI anyway. Peter argues they should explicitly state why they think this is the right course of action, given the risks. He suggests they should say something like "We're building AGI because [specific reasons]. If those reasons no longer held, we would stop."
I have three main critiques:
- The China Problem: Plan B’s 13% risk doesn’t make sense if China (DeepSeek) doesn’t slow down and is only 3 months behind. Real risk is probably the same as for E, 75% unless there is a pivotal act.
- Political Will as Strategy: The framework treats political will as a background variable rather than a key strategic lever. D→C campaigns could reduce expected risk by 11+ percentage points - nearly 30% of the total risk. A campaign to move from E→D would also be highly strategic and could only require talking to a handful of employee
... (read more)