Joe summarizes his new report on "scheming AIs" - advanced AI systems that fake alignment during training in order to gain power later. He explores different types of scheming (i.e. distinguishing "alignment faking" from "powerseeking"), and asks what the prerequisites for scheming are and by which paths they might arise.
This is a hackathon project write-up. This does not represent the views of my employer. Thanks to Tianyi Alex Qiu and Aidan Ewart for helpful discussions.
Sometimes, people will search for and find a situation where an LLM would do something bad, and then argue that the situation was natural enough that it is concerning. When is this a valid reason to be concerned? What can be learned from such evidence?
This post:
You can directly argue that the scary demos are representative of a broader class of situations where the model might actually do something bad.
For example, in the case of an...
oops, meant to say "What happens if you merge the bash and the audit tool, just giving the AI a single bash tool from which it can call an audit command", fixed

Schmidt Sciences invites proposals for a pilot program in AI interpretability. We seek new methods for detecting and mitigating deceptive behaviors from AI models, such as when models knowingly give misleading or harmful advice to users. If this pilot uncovers signs of meaningful progress, it may unlock a significantly larger investment in this space.
Can we develop interpretability methods that (1) detect deceptive behaviors exhibited by LLMs and (2) steer their reasoning to eliminate these behaviors? |
|---|
Successful tools will generalize to realistic use cases, moving beyond typical academic benchmarks and addressing concrete risks arising from deceptive behaviors. Importantly, we are looking for interpretability tools that outperform baselines that do not rely on access to weights, to prove that we can...
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
absent 1 and 2 there aren't comparably strong reasons to run vivariums like our world.
Why the focus on "reasons"?
Many things exist from causes that are not "reasons" in the sense of a decision-maker choosing something with an objective. All reasons are causes, but not all causes are reasons. For example, reproduction is a process that creates a lot of things without "reasons" in the central case of the word referring to something "reasoning".
And, if you wonder what caused you (or us) to exist, a good contender is "a causing-things-to-exist maximizer".
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won't notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it'll be hard for them to pull that off.
This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions:
Consider Parfit's hitchhiker:
An FDT agent is supposed to reason as follows:
I can't check today, but whoops, sorry if I typoed the equation at some step.
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
Thanks!
I’m interested in why you think consequentialism in necessarily maximising. An AGI might have multiple mutually incompatible goals it it solving for, and choose some balance of those, not maximising on any.
For one thing, my headline claim is “ruthless sociopath”, not “maximizing”. “Ruthless sociopath” is pointing to something that’s missing (intrinsic concern for the welfare of other people), not something that’s present (behaviors that maximize something in the world).
For another thing, strictly speaking, perfect maximization is impossible without ...