A theory of human values

[-]Wei Dai7y50

This seems to assume a fairly specific (i.e., anti-realist) metaethics. I'm quite uncertain about metaethics and I'm worried that if moral realism is true (and say for example that total hedonic utilitarianism is the true moral theory), and what you propose here causes the true moral theory to be able to control only a small fraction of the resources of our universe, that would constitute a terrible outcome. Given my state of knowledge, I'd prefer not to make any plans that imply commitment to a specific metaethical theory, like you seem to be doing here.

What's your response to people with other metaethics or who are very uncertain about metaethics?

However, for actual humans, the first scenario seems to loom much larger.

I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.

[-]Lukas_Gloor7y20

Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P

For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state.

Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we'll know by that point whether we've found the right way to do metaphilosophy (and how approaching that question is different from approaching whichever procedures philosophically sophisticated people would pick to settle open issues in something like the above proposals). It seems like there has to come a point where one has to hand off control to some in-advance specified "metaethical framework" or reflection procedure, and judged from my (historically overconfidence-prone) epistemic state it doesn't feel obvious why something like Stuart's anti-realism isn't already close to there (though I'd say there are many open questions and I'd feel extremely unsure about how to proceed regarding for instance "2. A method for synthesising such basic preferences into a single utility function or similar object," and also to some extent about the premise of squeezing a utility function out of basic preferences absent meta-preferences for doing that). Adding layers of caution sounds good though as long as they don't complicate things enough to introduce large new risks.

[-]Wei Dai7y30

Probably you were thinking of something like teaching AIs metaphilosophy in order to perhaps improve the procedure? This would be the main alternative I see, and it does feel more robust. I am wondering though whether we’ll know by that point whether we’ve found the right way to do metaphilosophy

I think there's some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart's scheme). In other words, if there wasn't a race to build AGI, then there wouldn't be a need to solve AGI safety, and there would be no need for schemes like Stuart's that would lock in our values before we solve metaphilosophy.

it doesn’t feel obvious why something like Stuart’s anti-realism isn’t already close to there

Stuart's scheme uses each human's own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill's meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).

To put it another way, my position is that if moral realism or relativism (positions 1-3 in this list) is right, we need "metaphilosophical paternalism" to prevent a "terrible outcome", and that's not part of Stuart's scheme.

[-]Stuart_Armstrong7y10

I would less concerned if this was used on someone like William MacAskill [...] but a lot of humans have seemingly terrible meta-preferences

In those cases, I'd give more weight to the preferences than the meta-preferences. There is the issue of avoiding ignorant-yet-confident meta-preferences, which I'm working on writing up right now (partially thanks to you very comment here, thanks!)

or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism).

Moral realism is ill-defined, and some allow that humans and AI would have different types of morally true facts. So it's not too much of a stretch to assume that different humans might have different morally true facts from each other, so I don't see this as being necessarily a problem.

Moral realism through acausal trade is the only version of moral realism that seems to be coherent, and to do that, you still have to synthesise individual preferences first. So "one single universal true morality" does not necessarily contradict "contingent choices in figuring out your own preferences".

[-]Wei Dai7y10

There is the issue of avoiding ignorant-yet-confident meta-preferences, which I’m working on writing up right now (partially thanks to you very comment here, thanks!)

I look forward to reading that. In the meantime can you address my parenthetical point in the grand-parent comment: "correctly extracting William MacAskill’s meta-preferences seems equivalent to learning metaphilosophy from William"? If it's not clear, what I mean is that suppose Will wants to figure out his values by doing philosophy (which I think he actually does), does that mean that under you scheme the AI needs to learn how to do philosophy? If so, how do you plan to get around the problems with applying ML to metaphilosophy that I described in Some Thoughts on Metaphilosophy?

[-]Stuart_Armstrong7y10

There is one way of doing metaphilosophy this way, which is "run (simulated) William MacAskill until he thinks he's found a good metaphilosophy" or "find a description of metaphilosophy to which WA would say 'yes'."

But what the system I've sketched would most likely do is come up with something to which WA would say "yes, I can kinda see why that was built, but it doesn't really fit together as I'd like and has a some of ad hoc and object level features". That's the "adequate" part of the process.

[-]Stuart_Armstrong7y10

My aim is to find a decent synthesis of human preferences. If someone has a specific metaethics and compelling reasons why we should follow that metaethics, I'd then defer to that. The fact I'm focusing my research on the synthesis is because I find that possibility very unlikely (the more work I do, the less coherent moral realism seems to become).

But, as I said, I'm not opposed to moral realism in principle. Looking over your post, I would expect that if 1, 4, 5, or 6 were true, that would be reflected in the synthesis process. Depending on how I interpret it, 2 would be partially reflected in the synthesis process, and 3 maybe very partially.

If there were strong evidence for 2 or 3, then we could either a) include them in the synthesis process, or b) tell humans about them, which would include them in the synthesis process indirectly.

Since I see the synthesis process as aiming for an adequate outcome, rather than an optimal one (which I don't think exists), I'm actually ok with adding in some moral-realism or other assumptions, as I see this as making a small shift among adequate outcomes.

As you can see in this post, I'm also ok with some extra assumptions in how we combine individual preferences.

There's also some moral-realism-for-humans variants, which assume that there are some moral facts which are true for humans specifically, but not for agents in general; this would be like saying there is a unique synthesis process. For those variants, and some other moral realist claims, I expect the process of figuring out partial preferences and synthesising them, will be useful building blocks.

But mainly, my attitude to most moral realist arguments, is "define your terms and start proving your claims". I'd be willing to take part in such a project, if it seemed realistically likely to succeed.

I don't think this is true for me, or maybe I'm misunderstanding what you mean by the two scenarios.

You may not be the most typical of persons :-) What I mean is that if we divided people's lifetimes by a third, or had a vicious totalitarian takeover, or made everyone live in total poverty, then people would find either of these outcomes quite bad, even if we increased lifetimes/democracy/GDP to compensate for the loss along one axis.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

10

Basic human preferences

"Reasonable" situations

Synthesising human preferences

Synthesising the preference utility function

Meta-changes to the synthesis method

Non-terrible outcomes

Scope insensitivity to the rescue?

The size of the future

Growth, learning, and following your own preferences

Tolerance levels

Did I think of everything?