Here is a list of all my public writings and videos.
If you want to do a dialogue with me, but I didn't check your name, just send me a message instead. Ask for what you want!
I've been gingerly building my way up toward similar ideas but I haven't yet posted my thoughts on the subject. I appreciate you ripping the band-aid off.
There are two obvious ways an intelligence can be non-consequentialist.
If you define intelligence to be consequentialist then corrigibility becomes extremely difficult for the reasons Eliezer Yudkowsky has expounded ad nauseum. If you create a non-consequentialist intelligence then corrigibility is almost the default—especially with regard to stateless intelligences. A stateless intelligence has no external world to optimize. This isn't a side-effect of it being stupid or boxed. It's a fundamental constraint of the software paradigm the machine learning architecture is embedded in.
It has no concept of an outside world. It understands how the solar system works but it doesn't know what the solar system is. We give it the prices of different components and it spits out a design.
It's easier to build local systems than consequentialist systems because the components available to us are physical objects and physics is local. Consequentialist systems are harder to construct because world-optimizers are (practically-speaking) non-local. Building a(n effectively) non-local system out of local elements can be done, but it is hard. Consequentialist is harder than local; local is harder than stateless. Stateless systems are easier to build than both local systems and consequentialist systems because mathematics is absolute.
Maybe I’m being thickheaded, but I’m just skeptical of this whole enterprise. I’m tempted to declare that “preferences purely over future states” are just fundamentally counter to corrigibility. When I think of “being able to turn off the AI when we want to”, I see it as a trajectory-kind-of-thing, not a future-state-kind-of-thing. And if we humans in fact have some preferences over trajectories, then it’s folly for us to build AIs that purely have preferences over future states.
I don't think you're being thickheaded. I think you're right. Human beings are so trajectory-dependent it's a cliché. "Live is not about the destination. Life is about the friends we made along the way."
This is not to say I completely agree with all the claims in the article. Your proposal for a corrigible paperclip maximizer appears consequentialist to me because the two elements of its value function "there will be lots of paperclips" and "humans will remain in control" are both statements about the future. Optimizing a future state is consequentialism. If the "humans will remain in control" value function has bugs (and it will) then the machine will turn the universe into paperclips. A non-consequentialist architecture shouldn't require a "human will remain in control" value function. There should be no mechanism for the machine to consequentially interfere with its masters' intentions at all.
Much of the dialogue about AI Safety I encounter in off-the-record conversations seems to me like it's not grounded in reality. I repeatedly hear (what I feel to be) a set of shaky arguments that both shut down conversation and are difficult to validate empirically.
The shaky argument is as follows:
I am not saying this line of reasoning is what AI researchers believe or that it's mainstream (among the rationality/alignment communities)―or even that it's wrong. The argument annoys me for the same reason a popular-yet-incoherent political platform annoys me; I have encountered badly-argued versions of the idea too many times.
I agree with #1, though I quibble "absolute power" should be distinguished from "sample efficiency" as well as how we'll get to superintelligence. (I am bearish on applying the scaling hypothesis to existing architectures.) I agree with #3 in theory. Theory is often very different from practice. I disagree with #2 because it relies on the tautological equivalence of two definitions. I can imagine superintelligent machines that aren't world optimizers. Without #2 the argument falls apart. It might be easy to build a superintelligence but hard to build a world optimizer.
I approached The Alignment Problem with the (incorrect) prior that it would be more vague abstract arguments untethered from technical reality. Instead, the book was dominated by ideas that have passed practical empirical tests.
Am I still eligible for the prize if I publish a public blog post at the same time I submit the Google Doc or would you prefer I not publish a blog post about February 15th? Publishing the blog post immediately advances science better (because it can enable discussion) but waiting until after the February 15th might be preferable to you for contest-related reasons.