In the 2010s, Paul Christiano built an extensive body of work on AI alignment—see the “Iterated Amplification” series for a curated overview as of 2018.
One foundation of this program was an intuition that it should be possible to build “act-based approval-directed agents” (“approval-directed agents” for short). These AGIs, for example, would not lie to their human supervisors, because their human supervisors wouldn’t want them to lie, and these AGIs would only do things that their human supervisors would want them to do. (It sounds much simpler than it is!)
Another foundation of this program was a set of algorithmic approaches, Iterated Distillation and Amplification (IDA), that supposedly offers a path to actually building these approval-directed AI agents.
I am (and have always been) a skeptic of IDA:...
This is a hackathon project write-up. This does not represent the views of my employer. Thanks to Tianyi Alex Qiu and Aidan Ewart for helpful discussions.
Sometimes, people will search for and find a situation where an LLM would do something bad, and then argue that the situation was natural enough that it is concerning. When is this a valid reason to be concerned? What can be learned from such evidence?
This post:
You can directly argue that the scary demos are representative of a broader class of situations where the model might actually do something bad.
For example, in the case of an...
"such evidence" is not really clear in my sentence. What I meant is that you don't have the sort of update on directly related behavior that this first subsection is about, since P(takeover bad from this model) is saturated at 0 for actions from a weak model. But I agree you do get evidence via updating on alignment (across models), which is more like the sort of evidence that the next subsection is about.

Schmidt Sciences invites proposals for a pilot program in AI interpretability. We seek new methods for detecting and mitigating deceptive behaviors from AI models, such as when models knowingly give misleading or harmful advice to users. If this pilot uncovers signs of meaningful progress, it may unlock a significantly larger investment in this space.
Can we develop interpretability methods that (1) detect deceptive behaviors exhibited by LLMs and (2) steer their reasoning to eliminate these behaviors? |
|---|
Successful tools will generalize to realistic use cases, moving beyond typical academic benchmarks and addressing concrete risks arising from deceptive behaviors. Importantly, we are looking for interpretability tools that outperform baselines that do not rely on access to weights, to prove that we can...
Also available in markdown at theMultiplicity.ai/blog/schelling-goodness.
This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge...
absent 1 and 2 there aren't comparably strong reasons to run vivariums like our world.
Why the focus on "reasons"?
Many things exist from causes that are not "reasons" in the sense of a decision-maker choosing something with an objective. All reasons are causes, but not all causes are reasons. For example, reproduction is a process that creates a lot of things without "reasons" in the central case of the word referring to something "reasoning".
And, if you wonder what caused you (or us) to exist, a good contender is "a causing-things-to-exist maximizer".
I agree that AI successor-alignment is probably easier than the human AI alignment problem.
One additional difficulty for the AIs is that they need to solve the alignment problem in a way that humans won't notice/understand (or else the humans could take the alignment solution and use it for themselves / shutdown the AIs). During the regime before human obsolescence, if we do a reasonable job at control, I think it'll be hard for them to pull that off.
This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions:
Consider Parfit's hitchhiker:
An FDT agent is supposed to reason as follows:
I can't check today, but whoops, sorry if I typoed the equation at some step.
A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs.