I get what you mean, but I also see some possibly important differences between the hypothetical example and our world. In the imaginary world where oppression has increased and someone writes an article about loyalty-based moral progress, maybe many other ethicists would disagree, saying that we haven't made much progress in terms of values related to (i), (ii) and (iii). In our world, I don't see many ethicists refuting moral progress on the grounds that we haven't made much progress in terms of e.g. patriotism or loyalty to the family or desert.
Moreover...
Let's consider the added example:
...Take a standard language model trained by minimisation of the loss function . Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of ]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”
Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by t
I think it's a good idea to clarify the use of "liberal" in the paper, to avoid confusion for people who haven't looked at it. Huemer writes:
When I speak of liberalism, I intend, not any precise ethical theory, but rather a certain very broad ethical orientation. Liberalism (i) recognizes the moral equality of persons, (ii) promotes respect for the dignity of the individual, and (iii) opposes gratuitous coercion and violence. So understood, nearly every ethicist today is a liberal.
If you don't find the paper convincing, I doubt I'll be able to give you con...
I wrote:
The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth
You wrote:
It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.)
I don't use "in conflict" as "ultimate proof by contradiction", and maybe we use "completely arbitrary" differently. This doesn't seem a major problem: see also adjusted statement 2, reported...
Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers.
Under Further observations, I wrote:
The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.
If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.
Building an agent that is completely free of any bias whatso...
To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step
suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"
seems weak.
Overall, what is the intention behind your comments? Are you trying to understand my position even better, and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?
I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:
we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.
Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the abov...
I might be misunderstanding you: take this with a grain of salt.
From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.
I am not familiar with PPO. From this short article, in the section about TRPO:
...Recall that due to approximations, theoretical gu
If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism.
I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help:
...It also helps explain how we get to discriminate between goals such as inc
Sorry for the late reply, I missed your comment.
Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.
I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.
In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief...
If there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.
From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.
One c...
Thanks, that page is much more informative than anything else I've read on the orthogonality thesis.
1 From Arbital:
The Orthogonality Thesis states "there exists at least one possible agent such that..."
Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI.
2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post.
3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.
Omega, a perfect predictor, flips a coin. If it comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails and you were told it was tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads and you were told it was heads.
Here there is no question, so I assume it is something like: "What do you do?" or "What is your policy?"
That formulation is analogous to standard counterfactual mugging, stated in th...
It seems you are arguing for the position that I called "the first intuition" in my post. Before knowing the outcome, the best you can do is (pay, pay), because that leads to 9900.
On the other hand, as in standard counterfactual mugging, you could be asked: "You know that, this time, the coin came up tails. What do you do?". And here the second intuition applies: the DM can decide to not pay (in this case) and to pay when heads. Omega recognises the intent of the DM, and gives 10000.
Maybe you are not even considering the second intuitio...
Hi Chris!
Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite.
My point in the post is that the paradoxical situation occurs when the prediction outcome is communicated to the decision maker. We have a seemingly correct prediction—the ...
I'm not sure I understand the search vs discriminative distinction. If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?
About the goal of winning at chess: in the case of minimax search, generates the complete tree of the game using and then selects the winning policy; as you said, this is probably the simplest agent (in terms of Kolmogorov complexity, given ) that wins at chess—an...
The others in the AISC group and I discussed the example that you mentioned more than once. I agree with you that such an agent is not goal-directed, mainly because it doesn't do anything to ensure that it will be able to perform action A even if adverse events happen.
It is still true that action A is a short description of the behaviour of that agent and one could interpret action A as its goal, although the agent is not good at pursuing it ("robustness" could be an appropriate term to indicate what the agent is lacking).
The part that I don't get is the reason why the agent is betting ahead of time implies evaluation according to edt, while the agent is reasoning during its action implies evaluation according to cdt. Sorry if I'm missing something trivial, but I'd like to receive an explanation because this seems a fundamental part of the argument.
I've noticed that one could read the argument and say: "Ok, an agent evaluates a parameter U differently at different times. Thus, a bookmaker exploits the agent with a bet/certificate whose value depends on U. What's special about this?"
Of course the answer lies in the difference between cdt(a) and edt(a), specifically you wrote:
The key point here is that because the agent is betting ahead of time, it will evaluate the value of this bet according to the conditional expectation E(U|Act=a).
and
Now, since the agent is reasoning during its...
I'm not sure I understand what you mean—I know almost nothing about robotics—but I think that, in most cases, there is a function whose discontinuity gives a strong indication that something went wrong. A robotic arm has to deal with impulsive forces, but its movement in space is expected to be continuous wrt time. The same happens in the bouncing ball example, or in the example I gave in the post: velocity may be discontinuous in time, but motion shouldn't.
Thanks for the suggestion on hybrid systems!
This was a great read, thanks for writing!
Despite the unpopularity of my research on this forum, I think it's worth saying that I am also working towards Vision 2, with the caveat that autonomy in the real world (e.g. with a robotic body) or on the internet is not necessary: one could aim for an independent-thinker AI that can do what it thinks is best only by communicating via a chat interface. Depending on what this independent thinker says, different outcomes are possible, including the outcome in which most humans simply don't care about what this inde... (read more)