A collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches.
(Fictional) Optimist: So you expect future artificial superintelligence (ASI) “by default”, i.e. in the absence of yet-to-be-invented techniques, to be a ruthless sociopath, happy to lie, cheat, and steal, whenever doing so is selfishly beneficial, and with callous indifference to whether anyone (including its own programmers and users) lives or dies?
Me: Yup! (Alas.)
Optimist: …Despite all the evidence right in front of our eyes from humans and LLMs.
Me: Yup!
Optimist: OK, well, I’m here to tell you: that is a very specific and strange thing to expect, especially in the absence of any concrete evidence whatsoever. There’s no reason to expect it. If you think that ruthless sociopathy is the “true core nature of intelligence” or whatever, then you should really look at yourself in a mirror and...
Thanks!
Re other ways to do the feat:
Re RL: thanks for those examples. I'd have defined it via smg like# data points or amoun...
The following is an edited transcript of a talk I gave. I have given this talk at multiple places, including first at Anthropic and then for ELK winners and at Redwood Research, though the version that this document is based on is the version I gave to SERI MATS fellows. Thanks to Jonathan Ng, Ryan Kidd, and others for help transcribing that talk. Substantial edits were done on top of the transcription by me. Though all slides are embedded below, the full slide deck is also available here.
Today I’m going to be talking about deceptive alignment. Deceptive alignment is something I'm very concerned about and is where I think most of the existential risk from AI comes from. And I'm going to try to make the case...
I reread the distillation of this (not the full article sadly due to time limitations), so my understanding of Evan's views might be off. Here's my retrospective:
High path dependence:
• Evan argues that before it obtains the true training objective, gradient descent likely obtains a proxy goal. This seems intuitive, though there's debate as to whether we should interpret recent results as indicating alignment seems to be happening by default or whether it remains a hard, unsolved problem.
• Evan argues that the model will gain an understanding of its own sit...
Epistemic status: intuitive speculation with scattered mathematical justification.
My goal here is to interrogate the dream of writing a beautiful algorithm for intelligence and thereby ensuring safety. For example:
I don't know precisely what alternative he had in mind, but I only seem to remember reading clean functional programs from MIRI, so that is one possibility. Whether or not anyone else endorses it, that is the prototypical picture I have in mind as the holy grail of glass box AGI implementation. In my mind, it has careful comments explaining the precisely chosen, recursive prior and decision rules that make it go "foom."
Is the "inscrutable" complexity of deep neural networks unavoidable? There is has been some prior discussion of the desire to avoid it as map-territory confusion, and I am...
Ah I see (I'll try a bit to remember to say "by so and so").
AI alignment has a culture clash. On one side, the “technical-alignment-is-hard” / “rational agents” school-of-thought argues that we should expect future powerful AIs to be power-seeking ruthless consequentialists. On the other side, people observe that both humans and LLMs are obviously capable of behaving like, well, not that. The latter group accuses the former of head-in-the-clouds abstract theorizing gone off the rails, while the former accuses the latter of mindlessly assuming that the future will always be the same as the present, rather than trying to understand things. “Alas, the power-seeking ruthless consequentialist AIs are still coming,” sigh the former. “Just you wait.”
As it happens, I’m basically in that “alas, just you wait” camp, expecting ruthless future AIs. But my camp faces a real question: what exactly is it...
I get that, and I appreciate the comments!
(Side note: in case you missed it, I expanded my last comment into a post “Why we should expect ruthless sociopath ASI”. Thanks for the inspiration, and hope you don’t mind.)
The very capabilities that make the nightmare scary (deep, nuanced understanding of how humans think and feel and operate) are exactly the kind of rich social cognition that, in our only empirical examples, comes tangled up with perspective-taking, moral awareness, and empathy.
Here are a couple reasons to think it’s not too “tangled up” in hum...
In this post, we introduce contributions and supracontributions[1], which are basic objects from infra-Bayesianism that go beyond the crisp case (the case of credal sets). We then define supra-POMDPs, a generalization of partially observable Markov decision processes (POMDPs). This generalization has state transition dynamics that are described by supracontributions.
We use supra-POMDPs to formalize various Newcombian problems in the context of learning theory where an agent repeatedly encounters the problem. The one-shot version of these problems are well-known to highlight flaws with classical decision theories.[2] In particular, we discuss the opaque, transparent, and epsilon-noisy versions of Newcomb's problem, XOR blackmail, and counterfactual mugging.
We conclude by stating a theorem that describes when optimality for the supra-POMDP relates to optimality for the Newcombian problem. This theorem is significant because it gives...
Halpern and Leung propose the "minimax weighted expected regret" (MWER) decision-rule, which is a generalization of the minimax-expected-regret (MER) decision-rule. In contrast, our decision rule is a weighted generalization of maximin-expected-utility (MMEU). The problem with MER is that it doesn't work very well with learning. The closest thing to doing learning with MER is adversarial bandits. However, adversarial regret is statistically intractable for Markov Decision Processes. And even with bandits there is a hidden obliviousness assumption if you try to interpret it in a principled decision-theoretic way.