User Comment Replies — AI Alignment Forum

[Intro to brain-like-AGI safety] 10. The alignment problem

Dilemma:
If the Thought Assessors converge to 100% accuracy in predicting the reward that will result from a plan, then a plan to wirehead (hack into the Steering Subsystem and set reward to infinity) would seem very appealing, and the agent would do it.
If the Thought Assessors don’t converge to 100% accuracy in predicting the reward that will result from a plan, then that’s the very definition of inner misalignment!

[...]

The thought “I will secretly hack into my own Steering Subsystem” is almost certainly not aligned with the designer’s intention. So a

... (read more)

3Steve Byrnes1y

I read that sequence a couple months ago (in preparation for writing §2.7 here), and found it helpful, thanks. I agree that we’re probably on basically the same page. FYI Alex also has this post making a similar point. I think I agree, in that I’m somewhat pessimistic about plans wherein we want the “adult AI” to have object-level goal X, and so we find a reward function and training environment where that winds up happening. Not that such a plan would definitely fail (e.g. lots of human adults are trying to take care of their children), just that it doesn’t seem like the kind of approach that passes the higher bar of having a strong reason to expect success (e.g. lots of human adults are not trying to take care of their children). (See here for someone trying to flesh out this kind of approach.) So anyway, my take right now is basically: * If we want the “adult AGI” to be trying to do a particular thing (‘make nanobots’, or ‘be helpful towards its supervisor’, or whatever), we should replace (or at least supplement) a well-chosen reward function with a more interpretability-based approach; for example, see Plan for mediocre alignment of brain-like [model-based RL] AGI (which is a simplified version of Post 14 of this series) * Or we can have a similar relation to AGIs that we have to the next generation of humans: We don’t know exactly at the object level what they will be trying to do and why, but they basically have “good hearts” and so we trust their judgment. These two bullet points correspond to the “two paths forward” of Post 12 of this series.

The shard theory of human values

Lukas_Gloor2y*102

As you allude by discussing shards for cooperative tendencies, the Shard Theory approach seems relevant for intent alignment too, not just value alignment. (For value alignment, the relevance of humans as an example is “How did human values evolve despite natural selection optimizing for something different and more crude?” For intent alignment, the relevance is “How come some humans exhibit genuinely prosocial motivations and high integrity despite not sharing the exact same goals as others?”)

Studying the conditions for the evolution of genuinely pr... (read more)

Alex Turner2y60

I haven't fully understood all of your points, but they gloss as reasonable and good. Thank you for this high-effort, thoughtful comment!

(If anyone is interested in doing research on the evolution of prosocality vs antisocialness in humans and/or how these things might play out in AI training environments, I know people who would likely be interested in funding such work.)

I encourage applicants to also read Quintin's Evolution is a bad analogy for AGI (which I wish more people had read, I think it's quite important). I think that evolution-based analogies ... (read more)

Ngo and Yudkowsky on AI capability gains

Lukas_Gloor3y00

How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?

My only reply is "You know it when you see it." And yeah, a crackpot would reason the same way, but non-modest epistemology says that if it's obvious to you that you're not a crackpot then you have to operate on the assumption that you're not a crackpot. (In the alternative scenario, you won't have much impact anyway.)

Specifically, the situation I mean is the following:

You have an epistemic track record like Eliezer or some

... (read more)

Richard Ngo3y80

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot."

Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.

Ngo and Yudkowsky on AI capability gains

Lukas_Gloor3y*120

It also isn't clear to me that Eliezer has established the strong inferences he draws from noticing this general pattern ("expected utility theory/consequentialism"). But when you asked Eliezer (in the original dialogue) to give examples of successful predictions, I was thinking "No, that's not how these things work." In the mistaken applications of Grand Theories you mention (AGI and capitalism, AGI and symbolic logic, intelligence and ethics, recursive self-improvement and cultural evolution, etc.), the easiest way to point out why they are dumb is with ... (read more)

6Richard Ngo3y

I'm not sure how this would actually work. The proponent of the AGI-capitalism analogy might say "ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive". Or they might say (as Marx did) that capitalism contains the seeds of its own destruction. Or they might just deny that AGI will play out the way you claim, because their analogy to capitalism is more persuasive than your analogy to humans (or whatever other reasoning you're using). How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"? My broader point is that these types of theories are usually sufficiently flexible that they can "predict" most outcomes, which is why it's so important to pin them down by forcing them to make advance predictions. On the rest of your comment, +1. I think that one of the weakest parts of Eliezer's argument was when he appealed to the difference between von Neumann and the village idiot in trying to explain why the next step above humans will be much more consequentialist than most humans (although unfortunately I failed to pursue this point much in the dialogue).

Discussion with Eliezer Yudkowsky on AGI interventions

Lukas_Gloor3y30

I share the impression that the agent foundations research agenda seemed not that important. But that point doesn't feel sufficient to argue that Eliezer's pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I'm not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it's extremely common for people to have periods where they work on ... (read more)

3Adam Shimi3y

That's an awesome comment, thanks! I get why you take that from my rant, but that's not really what I meant. I'm more criticizing the "everything is doomed but let's not give concrete feedback to people" stance, and I think part of it comes from believing for so long (and maybe still believing) that their own approach was the only non-fake one. Also just calling everyone else a faker is quite disrespectful and not helping. MIRI does have some positive points for changing their minds, but also some negative points IMO for taking so long to change their mind. Not sure what the total is. Here again, it's not so much that I disagree with EY about there being problems in the current research proposals. I expect that some of the problems he would point out are ones I see too. I just don't get the transition from "there are problems with all our current ideas" to "everyone is faking working on alignment and we're all doomed". Very good point. That being said, many of the more prosaic alignment people changed their minds multiple times, whereas on these specific questions I feel EY and MIRI didn't except when forced by tremendous pressure, which makes me believe that this criticism applies more to them. But that's one point where having some more knowledge of the internal debates at MIRI could make me change my mind completely. My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this. I'm confused about your question, because what you describe sounds like a misprediction that makes sense? Also I feel that in this case, there's a different between solving the coordination problem of having people implement the solution or not go on a race (which looks indeed harder in the light of Co

A theory of human values

Lukas_Gloor6y20

Leaning on this, someone could write a post about the "infectiousness of realism" since it might be hard to reconcile openness to non-zero probabilities of realism with anti-realist frameworks? :P

For people who believe their actions matter infinitely more if realism is true, this could be modeled as an overriding meta-preference to act as though realism is true. Unfortunately if realism isn't true this could go in all kinds of directions depending on how the helpful AI system would expect to get into such a judged-to-be-wrong epistemic state... (read more)

3Wei Dai6y

I think there's some (small) hope that by the time we need it, we can hit upon a solution to metaphilosophy that will just be clearly right to most (philosophically sophisticated) people, like how math and science were probably once methodologically quite confusing but now everyone mostly agrees on how math and science should be done. Failing that, we probably need some sort of global coordination to prevent competitive pressures leading to value lock-in (like the kind that would follow from Stuart's scheme). In other words, if there wasn't a race to build AGI, then there wouldn't be a need to solve AGI safety, and there would be no need for schemes like Stuart's that would lock in our values before we solve metaphilosophy. Stuart's scheme uses each human's own meta-preferences to determine their own (final) object-level preferences. I would less concerned if this was used on someone like William MacAskill (with the caveat that correctly extracting William MacAskill's meta-preferences seems equivalent to learning metaphilosophy from William) but a lot of humans have seemingly terrible meta-preferences or at least different meta-preferences which likely lead to different object-level preferences (so they can't all be right, assuming moral realism). To put it another way, my position is that if moral realism or relativism (positions 1-3 in this list) is right, we need "metaphilosophical paternalism" to prevent a "terrible outcome", and that's not part of Stuart's scheme.

Will humans build goal-directed agents?

Lukas_Gloor6y10

Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate huma

... (read more)

AI ALIGNMENT FORUM
AF

All of Lukas_Gloor's Comments + Replies