All of Richard_Kennaway's Comments + Replies

Here are five conundrums about creating the thing with alignment built in.

  1. The House Elf whose fulfilment lies in servitude is aligned.

  2. The Pig That Wants To Be Eaten is aligned.

  3. The Gammas and Deltas of "Brave New World" are moulded in the womb to be aligned.

  4. "Give me the child for the first seven years and I will give you the man." Variously attributed to Aristotle and St. Ignatius of Loyola.

  5. B. F. Skinner said something similar to (4), but I don't have a quote to hand, to the effect that he could bring up any child to be anything. Edit: it was

... (read more)
4Buck Shlegeris
@So8res  I'd be really interested in how you thought about these, especially the house elf example.

One reason we believe other humans are conscious is that other humans are consistently accurate reporters of their own mental states.

I don't think anyone has ever told me they were conscious, or I them, except in the trivial sense of communicating that one has woken up, or is not yet asleep. The reason I attribute the faculty of consciousness to other people is that they are clearly the same sort of thing as myself. A language model is not. It is trained to imitate what people have said, and anything it says about itself is an imitation of what people s... (read more)

1Ethan Perez
That's true for pretrained LMs but not after the finetuning phase I've proposed here; this finetuning phase would train the model to answer questions accurately about itself, which would produce fairly different predictions from just imitating humans. I definitely agree that I distrust LM statements of the form "I am conscious" that come from the pretrained LM itself, but that's different from the experiment I'm proposing here. Would you update against other humans being conscious at all, if other humans told you they weren't conscious? If not, that would be fairly surprising to me. If so, that implies you would update towards other humans being conscious if they tell you they are

Note that arXiv does have some gatekeeping: you must get an "endorsement" before submitting your first paper to any subject area. Details.

JanB61

Ah, I had forgotten about this. I'm happy to endorse people or help them find endorsers.

You are proposing "make the right rules" as the solution. Surely this is like solving the problem of how to write correct software by saying "make correct software"? The same approach could be applied to the Confucian approach by saying "make the values right". The same argument made against the Confucian approach can be made against the Legalist approach: the rules are never the real thing that is wanted, people will vary in how assiduously they are willing to follow one or the other, or to hack the rules entirely for their own benefit, then selection eff... (read more)

4johnswentworth
I strongly endorse this objection, and it's the main way in which I think the OP is unpolished. I do think there's obviously still a substantive argument here, but I didn't take the time to carefully separate it out. The substantive part is roughly "if the system accepts an inner optimizer with bad behavior, then it's going to accept non-optimizers with the same bad behavior. Therefore, we shouldn't think of the problem as being about the inner optimizers. Rather, the problem is that we accept bad behavior - i.e. bad behavior is able to score highly.". This opens up a whole different complicated question. First, it's not clear that this analogy holds water at all. There are many kinds-of-things we can do to design AGI environments/incentives which don't have any even vaguely similar analogues in human mechanism design - we can choose the entire "ancestral environment", we can spin up copies at-will, we can simulate in hindsight (so there's never a situation where we won't know after-the-fact what the AI did), etc. Second, in the cases where humans use bad incentive mechanisms, it's usually not because we can't design better mechanisms but because the people who choose the mechanism don't want a "better" one; voting mechanisms and the US government budget process are good examples. All that said, I do still apply this analogy sometimes, and I think there's an extent to which it's right - namely, trying to align black-box AIs with opaque goals via clever mechanism design, without building a full theory of alignment and human values, will probably fail. But I think a full theory of alignment and human values is likely tractable, which would obviously change the game entirely. It would still be true that "the rules are never the real thing that is wanted", but a full theory would at least let the rules improve in lock-step with capabilities - i.e. more predictive world models would directly lead to better estimates of human values. And I think the analogy would sti
Virtual evidence requires probability functions to take arguments which aren't part of the event space

Not necessarily. Typically, the events would be all the Lebesgue measurable subsets of the state space. That's large enough to furnish a suitable event to play the role of the virtual evidence. In the example involving A, B, and the virtual event E, one would also have to somehow specify that the dependencies of A and B on E are in some sense independent of each other, but you already need that. That assumption is what gives sequence-independence... (read more)

2Abram Demski
True, although I'm not sure this would be "virtual evidence" anymore by Pearl's definition. But maybe this is the way it should be handled. A different way of spelling out what's unique about virtual evidence is that, while compatible with Bayes, the update itself is calculated via some other means (IE, we get the likelihoods from somewhere else than looking at pre-defined likelihoods for the proposition we're updating on). Does it, though? If you were going to call that background evidence into question for a mere 10^10-to-1 evidence, should the probability have been 10^100-to-1 against in the first place?