Deliberation as a method to find the "actual preferences" of humans

riceissa

I think my main confusion is that Paul talks about many different ways deliberation could work (e.g. RL-based IDA and human-in-the-counterfactual-loop seem pretty different), and it’s not clear what approach he thinks is most plausible.

I have similar questions, and I'm not sure how much of it is that Paul is uncertain himself, and how much is Paul not having communicated his thinking yet. Also one thing to keep in mind is that different forms of deliberation could be used at different levels of the system, so for example one method can be used to model/emulate/extrapolate the overseer's deliberation and another one for the end-user.

On a more general note, I'm really worried that we don't have much understanding of how or why human deliberation can lead to good outcomes in the long run. It seems clear that an individual human deliberating in isolation is highly likely to get stuck or go off the rails, and groups of humans often do so as well. To the extent that we as a global civilization seemingly are able to make progress in the very long run, it seems at best a fragile process, which we don't know how to reliably preserve, or reproduce in an artificial setting.

From the CEV paper: "Do we want our coherent extrapolated volition to satisfice, or maximize? My guess is that we want our coherent extrapolated volition to satisfice […]. If so, rather than trying to guess the optimal decision of a specific individual, the CEV would pick a solution that satisficed the spread of possibilities for the extrapolated statistical aggregate of humankind." (p. 36)

And: "This is another reason not to stand in awe of the judgments of a CEV—a solution that satisfices an extrapolated spread of possibilities for the statistical aggregate of humankind may not correspond to the best decision of any individual, or even the best vote of any real, actual adult humankind." (p. 37) ↩︎
Paul says "So an excellent agent with a minimal understanding of human values seems OK. Such an agent could avoid getting left behind by its competitors, and remain under human control. Eventually, once it got enough information to understand human values (say, by interacting with humans), it could help us implement our values." ↩︎

	Output	Process
Implicit vs explicit	Implicit vs explicit output	Human-like vs non-human-like cognition
Understandable vs not understandable	Human understandability of output (human intermediate integration also implies understandability of intermediate results and thus also of the output)	Human-like vs non-human-like cognition (there might also be non-human-like approaches that are understandable)
Inside vs outside human brain	(Reduces to understandable vs not understandable)	Human vs non-human computation

	Inside human brain?	Human-like cognition?	Implicit vs explicit output	Intermediate integration	Understandable output?
Human sitting down	yes	yes	explicit (hopefully)	yes	yes
Uploads sitting down	no	yes	explicit	maybe	yes
Counterfactual oracle	no	no	explicit	yes	yes
Imitation-based IDA	no	no	implicit/depends on question*	no	no
RL-based IDA	no	no†	explicit†	no	no†
HCH	yes	no	implicit/depends on question*	no	n.a.
Debate	no	no	explicit	yes	yes
CEV	no	?‡	explicit	no	yes
Ambitious value learning	no	no	explicit	no	maybe

AI ALIGNMENT FORUM
AF

11

Deliberation as a method to find the "actual preferences" of humans

11

11

Deliberation as a method to find the "actual preferences" of humans

11

Approaches to deliberation that have been suggested so far

Properties of deliberation

Comparison table

Takeaways

Acknowledgments