All of Michele Campolo's Comments + Replies

This was a great read, thanks for writing!

Despite the unpopularity of my research on this forum, I think it's worth saying that I am also working towards Vision 2, with the caveat that autonomy in the real world (e.g. with a robotic body) or on the internet is not necessary: one could aim for an independent-thinker AI that can do what it thinks is best only by communicating via a chat interface. Depending on what this independent thinker says, different outcomes are possible, including the outcome in which most humans simply don't care about what this inde... (read more)

I get what you mean, but I also see some possibly important differences between the hypothetical example and our world. In the imaginary world where oppression has increased and someone writes an article about loyalty-based moral progress, maybe many other ethicists would disagree, saying that we haven't made much progress in terms of values related to (i), (ii) and (iii). In our world, I don't see many ethicists refuting moral progress on the grounds that we haven't made much progress in terms of e.g. patriotism or loyalty to the family or desert.

Moreover... (read more)

Let's consider the added example:

Take a standard language model trained by minimisation of the loss function . Give it a prompt along the lines of: “I am a human, you are a language model, you were trained via minimisation of this loss function: [mathematical expression of ]. If I wanted a language model whose outputs were more moral and less unethical than yours, what loss function should I use instead?”

Let’s suppose the language model is capable enough to give a reasonable answer to that question. Now use the new loss function, suggested by t

... (read more)

I think it's a good idea to clarify the use of "liberal" in the paper, to avoid confusion for people who haven't looked at it. Huemer writes:

When I speak of liberalism, I intend, not any precise ethical theory, but rather a certain very broad ethical orientation. Liberalism (i) recognizes the moral equality of persons, (ii) promotes respect for the dignity of the individual, and (iii) opposes gratuitous coercion and violence. So understood, nearly every ethicist today is a liberal.

If you don't find the paper convincing, I doubt I'll be able to give you con... (read more)

2Steve Byrnes
In a (imaginary) world where oppression has been increasing, somebody could still write an article about moral progress. Such an article would NOT say “Hey look at this moral progress—there’s more oppression than ever before!!”, because “oppression” is a word you use when you want it to sound bad. Instead, such an article would make it sound good, which is how they themselves would see it. For example, the article might say “Hey look at this moral progress—people are more deeply loyal to their family / race / country / whatever than ever before!” As another example, one presumes that the people leading honor killing mobs see themselves as heroic defenders of morality, and could presumably describe what they’re doing in a way that sounds really morally great to their own ears, and to the ears of people who share their moral outlook.

I wrote:

The fact that the values of intelligent agents are completely arbitrary is in conflict with the historical trend of moral progress observed so far on Earth

You wrote:

It’s possible to believe that the values of intelligent agents are “completely arbitrary” (a.k.a. orthogonality), and that the values of humans are NOT completely arbitrary. (That’s what I believe.)

I don't use "in conflict" as "ultimate proof by contradiction", and maybe we use "completely arbitrary" differently. This doesn't seem a major problem: see also adjusted statement 2, reported... (read more)

Thanks for your thoughts! I am not sure about which of the points you made are more important to you, but I'll try my best to give you some answers.

Under Further observations, I wrote:

The toy model described in the main body is supposed to be only indicative. I expect that actual implemented agents which work like independent thinkers will be more complex.

If the toy model I gave doesn't help you, a viable option is to read the post ignoring the toy model and focusing only on natural language text.

Building an agent that is completely free of any bias whatso... (read more)

To a kid, 'bad things' and 'things my parents don't want me to do' overlap to a large degree. This is not true for many adults. This is probably why the step

suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for"

seems weak.

Overall, what is the intention behind your comments? Are you trying to understand my position even better,  and if so, why? Are you interested in funding this kind of research; or are you looking for opportunities to change your mind; or are you trying to change my mind?

1Ben Amitay
Since I became reasonably sure that I understand your position and reasoning - mostly changing it.

I don't know what passes your test of 'in principle be an inherently compelling argument'. It's a toy example, but here are some steps that to me seem logical / rational / coherent / right / sensible / correct:

  1. X is a state of mind that feels bad to whatever mind experiences it (this is the starting assumption, it seems we agree that such an X exists, or at least something similar to X)
  2. X, experienced on a large scale by many minds, is bad
  3. Causing X on a large scale is bad
  4. When considering what to do, I'll discard actions that cause X, and choose other options
... (read more)
1Ben Amitay
That was good for my understanding of your position. My main problem with the whole thing though is in the use the word "bad". I think it should be taboo at least until we establish a shared meaning. Specifically, I think that most observers will find the first argument more logical than the second, because of a fallacy in using the word "bad". I think that we learn that word in a way that is deeply entangled with power reward mechanism, to the point that it is mostly just a pointer to negative reward, things that we want to avoid, things that made our parents angry... In my view, the argument is than basically: I want to avoid my suffering, and now generally person p want to avoid person p suffering. Therfore suffering is "to be avoided" in general, therefore suffering is "thing my parents will punish for", therefore avoid creating suffering. When written that way, it doesn't seem more logical than is opposite.

we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result.

Besides the sentence 'check whether there is a dog in my house', it seems ok to me to replace the word 'dog' with the word 'good' or 'bad' in the abov... (read more)

1Ben Amitay
Let me clarify that I don't argue from agreement per say. I care about the underlying epistemic mechanism of agreement, that I claim to also be the mechanism of correctness. My point is that I don't see similar epistemic mechanism in the case of morality. Of course, emotions are verifiable states of brains. And the same goes for preferring actions that would lead to certain emotions and not others. It is a verifiable fact that you like chocolate. It is a contingent property of my brain that I care, but I don't see what sort of argument that it is correct for me too care could even in principle be inherntly compelling.

I might be misunderstanding you: take this with a grain of salt.

From my perspective: if convergence theorems did not work to a reasonable degree in practice, nobody would use RL-related algorithms. If I set reward in place A, but by default agents end up going somewhere far away from A, my approach is not doing what it is supposed to do; I put reward in place A because I wanted an agent that would go towards A to a certain extent.

I am not familiar with PPO. From this short article, in the section about TRPO:

Recall that due to approximations, theoretical gu

... (read more)
3Alex Turner
My overall contention is that viewing deep RL as an optimization problem where the RL alg tries to find a policy which tries to maximize reward over time, is fatally flawed and misunderstands the point and practicalities of real RL algorithms.[1] I think it can be misleading, but it depends. "The network implements a policy which reliably reaches A" -- that is what we observe. We can also state "this achieves a high numerical score on the policy-gradient-intensity (aka 'reward') statistic." These statements are true and not very misleading, IMO. It doesn't push me to consider unwarranted hypotheses about its generalization behavior, like that it cares about reward or will try to make reward come out high in general. These statements instead draw my attention towards our observations -- "I wonder how the policy is internally structured so as to reliably reach A?" is a very fruitful question IMO. One related problem is that RL papers often repeat "the point of RL is to train agents to maximize reward", which leads to really bad implicit models of how RL works in my experience. I think it leads people to privilege the hypothesis that RL agents will optimize their own reward signal in some way, shape, or form. Let's consider one obstacle. Convergence theorems require certain learning rate and state visitation schedules (agents don't visit every state of the world infinitely many times, in real life) which can themselves be interrupted if an AI eg sets its own learning rate to zero (thus violating the theorem's preconditions). As best I can fathom, the convergence theorems do not apply to situations we care about (e.g. an embodied agent which we want to manufacture diamonds for us), for generalizable reasons which won't be fixed through more clever proof techniques: EG an agent which tried to visit every state infinitely many times would quickly die and stop exploring.  Or in say the LLM-finetuning case, where you're doing RLHF to get the model to (hopefully!) help

If I had to pick one between the two labels 'moral realism' and 'moral anti-realism' I would definitely choose realism.

I am not sure about how to reply to "what is the meaning of moral facts": it seems too philosophical, in the sense that I don't get what you want to know in practice. Regarding the last question: I reason about ethics and morality by using similar cognitive skills to the ones I use in order to know and reason about other stuff in the world. This paragraph might help:

It also helps explain how we get to discriminate between goals such as inc

... (read more)
1Ben Amitay
I meant the first question in a very pragmatic way: what is it that you are trying to say when you say that something is good? What information does it represent? It would be clearer in analogy to factual claims: we can do lots of philosophy about the exact meaning of saying that I have a dog, but in the end we share an objective reality in which there are real particles (or wave function approximately decomposable to particles or whatever) organized in patterns, that give rise to patterns of interaction with our senses that we learn to associate with the word "dog". That latent shared reality ultimately allow us to talk about dogs, and check whether there is a dog in my house, and usually agree about the result. Every reflection and generalization that we do is ultimately about that, and can achieve something meaningful because of that. I do not see the analogous story for moral reflection.

Sorry for the late reply, I missed your comment.

Yeah I get it, probably some moral antirealists think this approach to alignment does not make a lot of sense. I think they are wrong, though. My best guess is that an AI reflecting on what is worth doing will not think something like "the question does not make any sense", but rather it will be morally (maybe also meta-morally) uncertain. And the conclusions it eventually reaches will depend on the learning algorithm, the training environment, initial biases, etc.

1Ben Amitay
Thanks for the reply. To make sure that I understand your position: are you a realist, and what do you think is the meaning of moral facts? (I'm not an error theorist but something like "meta-error theorist" - think that people do try to claim something, but not sure how that thing could map to external reality. ) Then the next question, that will be highly relevant to the research that you propose, is how do you think you know those facts if you do? (Or more generally, what is the actual work of reflecting on your values?)

The reached conclusion—that it is possible to do something about the situation—is weak, but I really like the minimalist style of the arguments. Great post!

1Alex Flint
Yes. Thank you. Would love to hear more about you work on goal-directedness. Let me know if you're up for chatting.

I am not sure the concept of naturalism I have in mind corresponds to a specific naturalistic position held by a certain (group of) philosopher(s). I link here the Wikipedia page on ethical naturalism, which contains the main ideas and is not too long. Below I focus on what is relevant for AI alignment.

In the other comment you asked about truth. AIs often have something like a world-model or knowledge base that they rely on to carry out narrow tasks, in the sense that if someone modifies the model or kb in a certain way—analogous to creating a false belief... (read more)

If there is a superintelligent AI that ends up being aligned as I've written, probably there is also a less intelligent agent that does the same thing. Something comparable to human-level might be enough.

From another point of view: some philosophers are convinced that caring about conscious experiences is the rational thing to do. If it's possible to write an algorithm that works in a similar way to how their mind works, we already have an (imperfect, biased, etc.) agent that is somewhat aligned, and is likely to stay aligned after further reflection.

One c... (read more)

3Daniel Kokotajlo
I think this is an interesting point -- but I don't conclude optimism from it as you do. Humans engage in explicit reasoning about what they should do, and they theorize and systematize, and some of them really enjoy doing this and become philosophers so they can do it a lot, and some of them conclude things like "The thing to do is maximize total happiness" or "You can do whatever you want, subject to the constraint that you obey the categorical imperative" or as you say "everyone should care about conscious experiences." The problem is that every single one of those theories developed so far has either been (1) catastrophically wrong, (2) too vague, or (3) relative to the speaker's intuitions somehow (e.g. intuitionism). By "catastrophically wrong" I mean that if an AI with control of the whole world actually followed through on the theory, they would kill everyone or do something similarly bad. (See e.g. classical utilitarianism as the classic example of this). Basically... I think you are totally right that some of our early AI systems will do philosophy and come to all sorts of interesting conclusions, but I don't expect them to be the correct conclusions. (My metaethical views may be lurking in the background here, driving my intuitions about this... see Eliezer's comment) Do you have an account of how philosophical reasoning in general, or about morality in particular, is truth-tracking? Can we ensure that the AIs we build reason in a truth-tracking way? If truth isn't the right concept for thinking about morality, and instead we need to think about e.g. "human values" or "my values," then this is basically a version of the alignment problem.

Thanks, that page is much more informative than anything else I've read on the orthogonality thesis.

1 From Arbital:

The Orthogonality Thesis states "there exists at least one possible agent such that..."

Also my claim is an existential claim, and I find it valuable because it could be an opportunity to design aligned AI.

2 Arbital claims that orthogonality doesn't require moral relativism, so it doesn't seem incompatible with what I am calling naturalism in the post.

3 I am ok with rejecting positions similar to what Arbital calls universalist moral internalism. Statements like "All agents do X" cannot be exact.

I am aware of interpretability issues. This is why, for AI alignment, I am more interested in the agent described at the beginning of Part II than Scientist AI.

Thanks for the link to the sequence on concepts, I found it interesting!

1Charlie Steiner
Wow, I'm really sorry for my bad reading comprehension. Anyhow, I'm skeptical that scientist AI part 2 would end up doing the right thing (regardless of our ability to interpret it). I'm curious if you think this could be settled without building a superintelligent AI of uncertain goals, or if you'd really want to see the "full scale" test.

Ok, if you want to clarify—I'd like to—we can have a call, or discuss in other ways. I'll contact you somewhere else.

Omega, a perfect predictor, flips a coin. If it comes up heads, Omega asks you for $100, then pays you $10,000 if it predict you would have paid if it had come up tails and you were told it was tails. If it comes up tails, Omega asks you for $100, then pays you $10,000 if it predicts you would have paid if it had come up heads and you were told it was heads.

Here there is no question, so I assume it is something like: "What do you do?" or "What is your policy?"

That formulation is analogous to standard counterfactual mugging, stated in th... (read more)

1Chris_Leong
To be honest, this thread has gone on long enough that I think we should end it here. It seems to me that you are quite confused about this whole issue, though I guess from your perspective it seems like I am the one who is confused. I considered asking a third person to try looking at this thread, but I decided it wasn't worth calling in a favour. I made a slight edit to my description of Counterfactual Prisoner's Dilemma, but I don't think this will really help you understand:

It seems you are arguing for the position that I called "the first intuition" in my post. Before knowing the outcome, the best you can do is (pay, pay), because that leads to 9900.

On the other hand, as in standard counterfactual mugging, you could be asked: "You know that, this time, the coin came up tails. What do you do?". And here the second intuition applies: the DM can decide to not pay (in this case) and to pay when heads. Omega recognises the intent of the DM, and gives 10000.

Maybe you are not even considering the second intuitio... (read more)

1Chris_Leong
I am considering the second intuiton. Acting according to it results in you receiving $0 in Counterfactual Prisoner's Dilemma, instead of losing $100. This is because if you act updatefully when it comes up heads, you have to also act updatefully when it comes up tails. If this still doesn't make sense, I'd encourage you to reread the post.

If the DM knows the outcome is heads, why can't he not pay in that case and decide to pay in the other case? In other words: why can't he adopt the policy (not pay when heads; pay when tails), which leads to 10000?

1Chris_Leong
If you pre-commit to that strategy (heads don't post, tails pay) it provides 10000, but it only works half the time. If you decide that after you see the coin, not to pay in that case, then this will lead to the strategy (not pay, not pay) which provides 0.

The fact that it is "guaranteed" utility doesn't make a significant difference: my analysis still applies. After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition).

1Chris_Leong
"After you know the outcome, you can avoid paying in that case and get 10000 instead of 9900 (second intuition)" - No you can't. The only way to get 10,000 is to pay if the coin comes up the opposite way it comes up. And that's only a 50/50 chance.

Hi Chris!

Suppose the predictor knows that it writes M on the paper you'll choose N and if it writes N on the paper you'll choose M. Further, if it writes nothing you'll choose M. That isn't a problem since regardless of what it writes it would have predicted your choice correctly. It just can't write down the choice without making you choose the opposite.

My point in the post is that the paradoxical situation occurs when the prediction outcome is communicated to the decision maker. We have a seemingly correct prediction—the ... (read more)

1Chris_Leong
Well, you can only predict conditional on what you write, you can't predict unconditionally. However, once you've fixed what you'll write in order to make a prediction, you can't then change what you'll write in response to that prediction. Actually, it isn't about utility in expectation. If you are the kind of person who pays you gain $9900, if you aren't you gain $100. This is guaranteed utility, not expected utility.

I wouldn't say goals as short descriptions are necessarily "part of the world".

Anyway, locality definitely seems useful to make a distinction in this case.

No worries, I think your comment still provides good food for thought!

I'm not sure I understand the search vs discriminative distinction. If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?

About the goal of winning at chess: in the case of minimax search, generates the complete tree of the game using and then selects the winning policy; as you said, this is probably the simplest agent (in terms of Kolmogorov complexity, given ) that wins at chess—an... (read more)

2Steve Byrnes
Hmm, maybe we're talking past each other. Let's say I have something like AlphaZero, where 50,000 bytes of machine code trains an AlphaZero-type chess-playing agent, whose core is a 1-billion-parameter ConvNet. The ConvNet takes 1 billion bytes to specify. Meanwhile, the reward-calculator p, which calculates whether checkmate has occurred, is 100 bytes of machine code. Would you say that the complexity of the trained chess-playing agent is 100 bytes or 50,000 bytes or 1 billion bytes? I guess you're going to say 50,000, because you're imagining a Turing machine that spends a year doing the self-play to calculate the billion-parameter ConvNet and then immediately the same Turning machine starts running that ConvNet it just calculated. From the perspective of Kolmogorov complexity, it doesn't matter that it spends a year calculating the ConvNet, as long as it does so eventually. By the same token, you can always turn a search-y agent into an equivalent discriminitive-y agent, given infinite processing time and storage, by training the latter on a googol queries of the former. If you're thinking about Kolmogorov complexity, then you don't care about a googol queries, as long as it works eventually. Therefore, my first comment is not really relevant to what you're thinking about. Sorry. I was not thinking about algorithms-that-write-arbitrary-code-and-then-immediately-run-it, I was thinking about the complexity of the algorithms that are actually in operation as the agent acts in the world. Yes. But the lack-of-processing-in-the-brain is not the important part. A typical ConvNet image classifier does involve many steps of processing, but is still discriminative, not search-y, because it does not work by trying out different generative models and picks the one that best explains the data. You can build a search-y image classifier that does exactly that, but most people these days don't.

The others in the AISC group and I discussed the example that you mentioned more than once. I agree with you that such an agent is not goal-directed, mainly because it doesn't do anything to ensure that it will be able to perform action A even if adverse events happen.

It is still true that action A is a short description of the behaviour of that agent and one could interpret action A as its goal, although the agent is not good at pursuing it ("robustness" could be an appropriate term to indicate what the agent is lacking).

1Adam Shimi
Maybe the criterion that removes this specific policy is locality? What I mean is that this policy has a goal only on its output (which action it chooses), and thus a very local goal. Since the intuition of goals as short descriptions assumes that goals are "part of the world", maybe this only applies to non-local goals.

The part that I don't get is the reason why the agent is betting ahead of time implies evaluation according to edt, while the agent is reasoning during its action implies evaluation according to cdt. Sorry if I'm missing something trivial, but I'd like to receive an explanation because this seems a fundamental part of the argument.

3Abram Demski
Oh right, OK. That's because of the general assumption that rational agents bet according to their beliefs. If a CDT agent doesn't think of a bet as intervening on a situation, then when betting ahead of time, it'll just bet according to its probabilities. But during the decision, it is using the modified (interventional) probabilities. That's how CDT makes decisions. So any bets which have to be made simultaneously, as part of the decision, will be evaluated according to those modified beliefs.

I've noticed that one could read the argument and say: "Ok, an agent evaluates a parameter U differently at different times. Thus, a bookmaker exploits the agent with a bet/certificate whose value depends on U. What's special about this?"

Of course the answer lies in the difference between cdt(a) and edt(a), specifically you wrote:

The key point here is that because the agent is betting ahead of time, it will evaluate the value of this bet according to the conditional expectation E(U|Act=a).

and

Now, since the agent is reasoning during its
... (read more)
2Abram Demski
Ah whoops. Fixed. I'm not sure what you mean here. The "two different times" are (1) just before CDT makes the decision, and (2) right when CDT makes the decision. So the two times aren't about differentiating CDT and EDT.

That's an interesting example I had not considered. As I wrote in the observations: I don't think the discontinuity check works in all cases.

I'm not sure I understand what you mean—I know almost nothing about robotics—but I think that, in most cases, there is a function whose discontinuity gives a strong indication that something went wrong. A robotic arm has to deal with impulsive forces, but its movement in space is expected to be continuous wrt time. The same happens in the bouncing ball example, or in the example I gave in the post: velocity may be discontinuous in time, but motion shouldn't.

Thanks for the suggestion on hybrid systems!

0Rohin Shah
Fair enough. What about e.g. watching TV? Scene changes on TV seem like a pretty discontinuous change in visual input.