Safety First: safety before full alignment. The deontic sufficiency hypothesis.

Chris Lakin

It could be the case that these two goals are separable and independent:

“AI safety”: avoiding existential risk, s-risk, actively negative outcomes
“AI getting-everything-we-want” (CEV)

Getting our actively positive desires fulfilled ≠? Getting safety

This is what Davidad calls this the Deontic Sufficiency Hypothesis.

If the hypothesis is true, it should be possible to de-pessimize and mitigate the urgent risk from AI without necessarily ensuring that AI creates actively positive outcomes. Because, for safety, it is only necessary to ensure that actively harmful outcomes do not occur. And hopefully this is easier than achieving “full alignment”.

Safety first! We can figure out the rest later.

Quotes from Davidad's The Open Agency Architecture plans

This is Davidad’s plan with the Open Agency Architecture (OAA).

A list of core AI safety problems and how I hope to solve them (2023 August)

1.1. First, instead of trying to specify "value", instead "de-pessimize" and specify the absence of a catastrophe, and maybe a handful of bounded constructive tasks like supplying clean water. A de-pessimizing OAA would effectively buy humanity some time, and freedom to experiment with less risk, for tackling the CEV-style alignment problem—which is harder than merely mitigating extinction risk.

Davidad's Bold Plan for Alignment: An In-Depth Explanation — LessWrong (2023 April)

Deontic Sufficiency Hypothesis: This hypothesis posits that it is possible to identify desiderata that are adequate to ensure the model doesn't engage in undesirable behavior. Davidad is optimistic that it's feasible to find desiderata ensuring safety for a few weeks before a better solution is discovered, making this a weaker approach than solving outer alignment. For instance, Davidad suggests that even without a deep understanding of music, you can be confident your hearing is safe by ensuring the sound pressure level remains below 80 decibels. However, since the model would still be executing a pivotal process with significant influence, relying on a partial solution for decades could be risky.

Getting traction on the deontic feasibility [sic] hypothesis
Davidad believes that using formalisms such as Markov Blankets would be crucial in encoding the desiderata that the AI should not cross boundary lines at various levels of the world-model. We only need to “imply high probability of existential safety”, so according to davidad, “we do not need to load much ethics or aesthetics in order to satisfy this claim (e.g. we probably do not get to use OAA to make sure people don't die of cancer, because cancer takes place inside the Markov Blanket, and that would conflict with boundary preservation; but it would work to make sure people don't die of violence or pandemics)”. Discussing this hypothesis more thoroughly seems important.

An Open Agency Architecture for Safe Transformative AI (2022 December)

Deontic Sufficiency Hypothesis: There exists a human-understandable set of features of finite trajectories in such a world-model, taking values in , such that we can be reasonably confident that all these features being near 0 implies high probability of existential safety, and such that saturating them at 0 is feasible^[2] with high probability, using scientifically-accessible technologies.
I am optimistic about this largely because of recent progress toward formalizing a natural abstraction of boundaries by Critch and Garrabrant. I find it quite plausible that there is some natural abstraction property $Q$ of world-model trajectories that lies somewhere strictly within the vast moral gulf of
$All Principles That Human CEV Would Endorse \Rightarrow Q \Rightarrow Don't Kill Everyone$

AI Neorealism: a threat model & success criterion for existential safety (2022 December)

For me the core question of existential safety is this:
$\begin{matrix} Under these conditions, what would be the best strategy for building an AI system that helps us ethically end the acute risk period without creating its own catastrophic risks that would be worse than the status quo? \end{matrix}$
It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).

How to formalize safety?

If the deontic sufficiency hypothesis is true, there should be an independent/separable way to formalize what “safety” is. This is why I think boundaries/membranes could be helpful for AI safety: See Agent membranes and formalizing “safety”.

Thanks to Jonathan Ng for reviewing a draft of this post and to Alexander Gietelink Oldenziel for encouraging me to post it.

Note that Davidad has ~~not~~ reviewed or verified this post.

AI ALIGNMENT FORUM
AF