I think my main confusion is that Paul talks about many different ways deliberation could work (e.g. RL-based IDA and human-in-the-counterfactual-loop seem pretty different), and it’s not clear what approach he thinks is most plausible.
I have similar questions, and I'm not sure how much of it is that Paul is uncertain himself, and how much is Paul not having communicated his thinking yet. Also one thing to keep in mind is that different forms of deliberation could be used at different levels of the system, so for example one method can be used to model/emulate/extrapolate the overseer's deliberation and another one for the end-user.
On a more general note, I'm really worried that we don't have much understanding of how or why human deliberation can lead to good outcomes in the long run. It seems clear that an individual human deliberating in isolation is highly likely to get stuck or go off the rails, and groups of humans often do so as well. To the extent that we as a global civilization seemingly are able to make progress in the very long run, it seems at best a fragile process, which we don't know how to reliably preserve, or reproduce in an artificial setting.
Some recent discussion about what Paul Christiano means by "short-term preferences" got me thinking more generally about deliberation as a method of figuring out the human user's or users' "actual preferences". (I can't give a definition of "actual preferences" because we have such a poor understanding of meta-ethics that we don't even know what the term should mean or if they even exist.)
To set the framing of this post: We want good outcomes from AI. To get this, we probably want to figure out the human user's or users' "actual preferences" at some point. There are several options for this:
The third option is the focus of this post. The first two are also very worthy of consideration—they just aren't the focus here. Also the list isn't meant to be comprehensive; I would be interested to hear any other approaches.
In terms of Paul's recent AI alignment landscape tree, I think this discussion fits under the "Learn from teacher" node, but I'm not sure.
Terminological note: In this post, I use "deliberation" and "reflection" interchangeably. I think this is standard, but I'm not sure. If anyone uses these terms differently, I would like to know how they distinguish between them.
Approaches to deliberation that have been suggested so far
In this section, I list some concrete-ish approaches to deliberation that have been considered so far. I say "concrete-ish" rather than "concrete" because each of these approaches seems underdetermined in many ways, e.g. for "humans sitting down", it's not clear if we split the humans up in some way, which humans we use, how much time we allow, what kind of "voting"/parliamentary system we use, and so on. Later on in this post I will talk about properties for deliberation, so the "concrete-ish" approaches here are concrete in two senses: (a) they have some of these properties filled in (e.g. "humans sitting down" says the computation happens primarily inside human brains); and (b) within a single property, they might specify a specific mechanism (e.g. saying "use counterfactual oracles somehow" is more concrete than saying "use an approach where the computation doesn't happen inside human brains").
Properties of deliberation
With the examples above in hand, I want to step back and abstract out some properties/axes/dimensions they have.
I'm not sure that these dimensions cleanly separate or how important they are. There are also probably many other dimensions that I'm missing.
Since I had trouble distinguishing between some of the above properties, I made the following table:
Comparison table
The following table summarizes my understanding of where each of the concrete-ish approaches stands on a subset of the above properties. I've restricted the comparison to a subset of the properties because many approaches leave certain questions unanswered and also because if I add too many columns the table will become difficult to read.
In addition to the approaches listed above, I've included HCH since I think it's an interesting theoretical case to look at.
* We could imagine asking a question like "What are my actual preferences?" to get an explicit answer, or just ask AI assistants to do something (in which case the output of deliberation is not explicit).
† Paul says "Rather than learning a reward function from human data, we also train it by amplification (acting on the same representations used by the generative model). Again, we can distill the reward function into a neural network that acts on sequences of observations, but now instead of learning to predict human judgments it’s predicting a very large implicit deliberation." The "implicit" in this quote seems to refer to the process (rather than output) of deliberation. See also the paragraph starting with "To summarize my own understanding" in this comment (which I think is talking about RL-based IDA), which suggests that maybe we should distinguish between "understandable in theory if we had the time" vs "understandable within the time constraints we have" (in the table I went with the latter). There is also the question of whether a reward function is "explicit enough" as a representation of values.
‡ Q5 (p. 32) in the CEV paper clarifies that the computation to find CEV wouldn't be sentient, but I'm not sure if the paper says whether the cognition will resemble human thought.
Takeaways
Acknowledgments
Thanks to Wei Dai for suggesting the point about solving meta-ethics. (However, I may have misrepresented his point, and this acknowledgment should not be seen as an endorsement by him.)
From the CEV paper: "Do we want our coherent extrapolated volition to satisfice, or maximize? My guess is that we want our coherent extrapolated volition to satisfice […]. If so, rather than trying to guess the optimal decision of a specific individual, the CEV would pick a solution that satisficed the spread of possibilities for the extrapolated statistical aggregate of humankind." (p. 36)
And: "This is another reason not to stand in awe of the judgments of a CEV—a solution that satisfices an extrapolated spread of possibilities for the statistical aggregate of humankind may not correspond to the best decision of any individual, or even the best vote of any real, actual adult humankind." (p. 37) ↩︎
Paul says "So an excellent agent with a minimal understanding of human values seems OK. Such an agent could avoid getting left behind by its competitors, and remain under human control. Eventually, once it got enough information to understand human values (say, by interacting with humans), it could help us implement our values." ↩︎