Thanks to Rebecca Gorman for discussions that lead to these insights.

How can you get a superintelligent AI aligned with human values? There are two pathways that I often hear discussed. The first sees a general alignment problem - how to get a powerful AI to safely do anything - which, once we've solved, we can point towards human values. The second perspective is that we can only get alignment by targeting human values - these values must be aimed at, from the start of the process.

I'm of the second perspective, but I think it's very important to sort this out. So I'll lay out some of the arguments in its favour, to see what others think of it, and so we can best figure out the approach to prioritise.

More strawberry, less trouble

As an example of the first perspective, I'll take Eliezer's AI task, described here:

  • "Place, onto this particular plate here, two strawberries identical down to the cellular but not molecular level." A 'safely' aligned powerful AI is one that doesn't kill everyone on Earth as a side effect of its operation.

If an AI accomplishes this limited task without going crazy, this shows several things:

  1. It is superpowered; the task described is beyond current human capabilities.
  2. It is aligned (or at least alignable) in that it can accomplish a task in the way intended, without wireheading the definitions of "strawberry" or "cellular".
  3. It is safe, in that it has not heavily dramatically reconfigured the universe to accomplish this one goal.

Then, at that point, we can add human values to the AI, maybe via "consider what these moral human philosophers would conclude if they thought for a thousand years, and do that".

I would agree that, in most cases, an AI that accomplished that limited task safely would be aligned. One might quibble that it's only pretending to be aligned, and preparing a treacherous turn. Or maybe the AI was boxed in some way and accomplished the task with the materials at hand within the box.

So we might call an AI "superpowered and aligned" if it accomplished the strawberry copying task (or a similar one) and if it could dramatically reconfigure the world but chose not to.

Values are needed

I think that an AI could not be "superpowered and aligned" unless it is also aligned with human values.

The reason is that the AI can and has to interact with the world. It has the capability to do so, by assumption - it is not contained or boxed. It must do so because any agent affects the world, through chaotic effects if nothing else. A superintelligence is likely to have impacts in the world simply through its existence being known, and if the AI finds it efficient to have interactions with the world (eg. ordering some extra resources) then it will do so.

So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.

Suppose that the AI realises that its actions have slightly imbalanced the Earth in one direction, and that, within a billion years, this will cause significant deviations in the orbits of the planets, deviations it can estimate. Compared with that amount of mass displaced, the impact of killing all humans everywhere is a trivial one indeed. We certainly wouldn't want it to kill all humans in order to be able to carefully balance out its impact on the orbits of the planets!

There are very "large" impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people's personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.). If the AI accomplishes its task with a universal constructor or unleashing hordes of nanobots that gather resources from the world (without disrupting human civilization), it still has to decide whether to allow humans access to the constructors or nanobots after it has finished copying the strawberry - and which humans to allow this access to.

So every decision the AI makes is a tradeoff in terms of its impact on the world. Navigating this requires it to have a good understanding of our values. It will also need to estimate the value of certain situations beyond the human training distribution - if only to avoid these situations. Thus a "superpowered and aligned" AI needs to solve the problem of model splintering, and to establish a reasonable extrapolation of human values.

Model splintering sufficient?

The previous sections argue that learning human values (including model splintering) is necessary for instantiating an aligned AI; thus the "define alignment and then add human values" approach will not work.

Thus, if you give this argument much weight, learning human values is necessary for alignment. I personally feel that it's also (almost) sufficient, in that the skill in navigating model splintering, combined with some basic human value information (as given, for example, by the approach here) is enough to get alignment even at high AI power.

Which path to pursue for alignment

It's important to resolve this argument, as the paths for alignment that the two approaches suggest are different. I'd also like to know if I'm wasting my time on an unnecessary diversion.

New Comment
23 comments, sorted by Click to highlight new comments since: Today at 11:00 AM

I often say things that I think you would interpret as belonging to the first category ("general alignment plus human values").

So the AI can and must have an impact on the world. We want it to not have a large or dangerous impact. But, crucially, "dangerous" and "large" are defined by human values.

This feels like the crux. I certainly agree that "dangerous" and "large" are not orthogonal to / independent of human values, and that as a result any realistic safe AI system will contain some information about human values.

But this seems like a very weak conclusion to me. Of course a superintelligent AI will contain some information about human values. GPT-3 isn't superintelligent and it already contains tons of knowledge about human values; possibly more than I do. You'd have to try really hard to prevent it from containing information about human values.

It seems like you conclude something much stronger, which is something like "we must build in all of human values". I don't see why we can't instead have our AI systems do whatever a well-motivated human would do in a similar principal-agent problem. This certainly involves knowing some amount about human values, but not some extraordinarily large amount that means we might as well just learn everything including in exotic philosophical cases.

(I think my position is pretty similar to Steve's.)

From a later comment:

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

The same way a well-motivated personal assistant would deal with it. Tell the human of these two possibilities, and ask them which one should be done. Help them with this decision by providing them with true, useful information about what consequences arise from each of the possibilities.

If you are able to perfectly predict their responses in all possible situations, and the final answer depends on (say) the order in which you ask the questions, then go up a meta level: ask them for their preferences about how you go about eliciting information from them and/or helping them with reflection.

If going up meta levels doesn't solve the problem either, then pick randomly amongst the options, or take an average.

If there's time pressure and you can't get their opinions, take your best guess as to which one they'd prefer, and do that one. (One assumes that such a scenario doesn't come up often.)

Generally with these sorts of hypotheticals (EDIT: where a very competent AI fails due to not understanding all of human values), it feels to me like it either (1) isn't likely to come up, or (2) can be solved by deferring to the human, or (3) doesn't matter very much.

Generally with these sorts of hypotheticals, it feels to me like it either (1) isn’t likely to come up, or (2) can be solved by deferring to the human, or (3) doesn’t matter very much.

What do you think about the following examples:

  1. AI persuasion - My AI receives a message from my friend containing a novel moral argument relevant to some decision I'm about to make, but it's not sure if it's safe to show the message to me, because my friend may have been compromised by a hostile AI and is now in turn trying to compromise me.
  2. User-requested value lock-in - I ask my AI to help me become more faithful to some religion/cause/morality.
  3. Acausal attacks - I (or my AI) become concerned that I may be living in a simulation and will be punished unless I do what my simulators want.
  4. Bargaining/coordination - Many AIs are merging together for better coordination and economy of scale, for example by setting the utility function of the merged AI to a weighted average of their individual utility functions, so I have to come up with a utility function (or whatever the merger will be based on) if I want to join the bargain. If I fail to do this, I risk falling behind in subsequent military or economic competition.
  5. Threats - Someone (in the same universe) communicates to me that unless I do what they want (i.e., hand most of my resources to them), they'll create a vast amount of simulated suffering.
  1. Your AI should tell you that it's worried about your friend being compromised, make sure you have an understanding of the consequences, and then go with your decision.
  2. Seems fine. Maybe your AI warns you about the risks before helping.
  3. Seems like an important threat that you (and your AI) should try to resolve.
  4. If you mean a utility function over universe histories (as opposed to e.g. utility functions over some finite set of high-level outcomes) this seems pretty rough. Mostly I would hope that this situation doesn't arise, because none of the humans can come up with utility functions in this way, and the AIs that are aligned with humans have other ways of cooperating that don't require eliciting a utility function over universe histories.
  5. Idk, seems pretty unclear, but I'd hope that these situations can't come up thanks to laws that prevent people from enforcing such threats.

In addition to

(1) isn’t likely to come up, or (2) can be solved by deferring to the human, or (3) doesn’t matter very much.

I should probably add a fourth:

(4) can be solved through governance (laws, regulations, norms, etc)

  1. Your AI should tell you that it’s worried about your friend being compromised, make sure you have an understanding of the consequences, and then go with your decision.

I think unless we make sure the AI can distinguish between "correct philosophy" or "well-intentioned philosophy" and "philosophy optimized for persuasion", each human will become either compromised (if they're not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn't seem like an ok outcome to me. Can you explain more why you aren't worried?

  1. Seems fine. Maybe your AI warns you about the risks before helping.

I can imagine that if you subscribe to a metaethics in which a person can't be wrong about morality (i.e., some version of anti-realism), then you might think it's fine to lock in whatever values one currently thinks they ought to have. Is this your reason for "seems fine", or something else? (If so, I think nobody should be so certain about metaethics at this point.)

  1. Seems like an important threat that you (and your AI) should try to resolve.

If the AI isn't very good at dealing with "exotic philosophical cases" then it's not going to be of much help with this problem, and a lot of humans (including me) probably aren't very good at thinking about this either, so we probably end up with a lot of humans succumbing to such acausal attacks.

  1. Mostly I would hope that this situation doesn’t arise, because none of the humans can come up with utility functions in this way, and the AIs that are aligned with humans have other ways of cooperating that don’t require eliciting a utility function over universe histories.

Do you have any suggestions for this? Or some other reason to think that AIs that are aligned with different humans will find ways to cooperate (as efficiently as merging utility functions probably will be), without either a full understanding of human values or risking permanent loss of some parts of their complex values?

  1. Idk, seems pretty unclear, but I’d hope that these situations can’t come up thanks to laws that prevent people from enforcing such threats.

Agreed that's a possible good outcome, but seems far from a sure thing. Such laws would have to more intrusive than anything people are currently used to, since attackers can create simulated suffering within the "privacy" of their own computers or minds. I suppose if such threats become a serious problem that causes a lot of damage, people might agree to trade off their privacy for security. The law might then constitute a risk in itself, as the implementation mechanism might be subverted/misused to create a form of totalitarianism.

Another issue is if there are powerful unaligned AIs or rogue states who think they can use such threats to asymmetrically gain advantage, they won't agree to such laws.

(4) can be solved through governance (laws, regulations, norms, etc)

I think COVID shows that we often can't do this even when it's relatively trivial (or can only do it with a huge time delay). For example COVID could have been solved at very low cost (relative to the actual human and economic damage it inflicted) if governments had stockpiled enough high filtration elastomeric respirators for everyone, mailed them out at the start of the pandemic, and mandated their use. (Some EAs are trying to convince some governments to do this now, in preparation for the next pandemic. I'm not sure how much success they're having.)

To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn't know human values, rather than cases where the AI knows what the human would want but still a failure occurs. (I didn't state this explicitly because I was replying to the post, which focuses specifically on the problem of not knowing all of human values.) I think most of your failures are of the latter type, and I wouldn't make a similar claim for such failures -- they seem plausible and worth attention. I do still think they are not as important as intent alignment. Talking about each one individually:

I think unless we make sure the AI can distinguish between "correct philosophy" or "well-intentioned philosophy" and "philosophy optimized for persuasion", each human will become either compromised (if they're not very cautious and read such messages) or isolated from the rest of humanity with regard to philosophical discussion (if they are cautious and discard such messages). This doesn't seem like an ok outcome to me. Can you explain more why you aren't worried?

Mostly I'd hope that AI can tell what philosophy is optimized for persuasion, or at least is capable of presenting counterarguments persuasively as well. But if your AI can't even tell what is persuasive then you're in trouble, but I'm not sure why to expect that.

I can imagine that if you subscribe to a metaethics in which a person can't be wrong about morality (i.e., some version of anti-realism), then you might think it's fine to lock in whatever values one currently thinks they ought to have. Is this your reason for "seems fine", or something else?

No, it just doesn't seem hugely terrible for a few people to lock in bad values, as long as the vast majority do not. (And I don't expect a large number of people to explicitly try to lock in their values.)

If the AI isn't very good at dealing with "exotic philosophical cases" then it's not going to be of much help with this problem, and a lot of humans (including me) probably aren't very good at thinking about this either, so we probably end up with a lot of humans succumbing to such acausal attacks.

It seems odd to me that it's sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don't particularly expect this to happen.

Or some other reason to think that AIs that are aligned with different humans will find ways to cooperate (as efficiently as merging utility functions probably will be), without either a full understanding of human values or risking permanent loss of some parts of their complex values?

  1. I don't expect AIs to have clean crisp utility functions of the form "maximize paperclips" (at least initially).
  2. If you're using some form of indirect normativity (which is the general approach I'm excited about), then it seems like you could have an agreement between to AIs about how to aggregate / merge these two definitions of "the good" for different humans (e.g. each human gets "points" in proportion to their current resources, and then they can spend points to have influence over specific choices; presumably you could do better but this seems like a fine start). I expect this to be way less work than the complicated plans that the AI is enacting, so it isn't a huge competitiveness hit.

Agreed that's a possible good outcome, but seems far from a sure thing. Such laws would have to more intrusive than anything people are currently used to, since attackers can create simulated suffering within the "privacy" of their own computers or minds. I suppose if such threats become a serious problem that causes a lot of damage, people might agree to trade off their privacy for security. The law might then constitute a risk in itself, as the implementation mechanism might be subverted/misused to create a form of totalitarianism.

I agree it is far from a sure thing.

To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs.

I'm not sure I understand the distinction that you're drawing here. (It seems like my scenarios could also be interpreted as failures where AI don't know enough human values, or maybe where humans themselves don't know enough human values.) What are some examples of what your claim was about?

I do still think they are not as important as intent alignment.

As in, the total expected value lost through such scenarios isn't as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?

Mostly I’d hope that AI can tell what philosophy is optimized for persuasion

How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?

or at least is capable of presenting counterarguments persuasively as well.

You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won't your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?

And I don’t expect a large number of people to explicitly try to lock in their values.

A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn't they ask their AI for help with this? Or do you imagine them asking for something like "more faith", but AIs understand human values well enough to not interpret that as "lock in values"?

It seems odd to me that it’s sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don’t particularly expect this to happen.

The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators' demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?

I don’t expect AIs to have clean crisp utility functions of the form “maximize paperclips” (at least initially).

I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.

I expect this to be way less work than the complicated plans that the AI is enacting, so it isn’t a huge competitiveness hit.

I don't think it's so much that the coordination involving humans is a lot of work, but rather that we don't know how to do it in a way that doesn't cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I'd need to see more details before I update.)

I'm not sure I understand the distinction that you're drawing here. (It seems like my scenarios could also be interpreted as failures where AI don't know enough human values, or maybe where humans themselves don't know enough human values.) What are some examples of what your claim was about?

Examples:

  1. Your AI thinks it's acceptable to inject you with heroin, because it predicts you will then want more heroin.
  2. Your AI is uncertain whether you'd prefer to explore space or stay on Earth. It randomly guesses that you want to stay on Earth and takes irreversible actions on your behalf that force you to stay on Earth.

In contrast, something like a threat doesn't count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don't know how to act in a way that both disincentivizes threats and also doesn't lead to (too many) threats being enforced. In particular, the problem is not that you don't know which outcomes are bad.

As in, the total expected value lost through such scenarios isn't as large as the expected value lost through the risk of failing to solve intent alignment?

No, the expected value of marginal effort aimed at solving these problems isn't as large as the expected value of marginal effort on intent alignment.

(I don't like talking about "expected value lost" because it's not always clear what does and doesn't count as part of that. For example I think it's nearly inevitable that different people will have different goals and so the future will not be exactly as any one of them desired; should I say that there's a lot of expected value lost from "coordination problems" for that reason? It seems a bit weird to say that if you think there isn't a way to regain that "expected value".)

Can you give some ballpark figures of how you see each side of this inequality?

Uh, idk. It's not something I have numbers on. But I suppose I can try and make up some very fake numbers for, say, AI persuasion. (Before I actually do the exercise, let me note that I could imagine the exercise coming out with numbers that favor persuasion over intent alignment; this probably won't change my mind and would instead make me distrust the numbers, but I'll publish them anyway.)

To change an existentially bad outcome from AI persuasion, I'd imagine first figuring out some solutions, then figuring out how to implement them, and then getting the appropriate people to implement them; seems like you need all of these steps in order to make any difference. (Could be technical or governance though.) It feels especially hard to do so at the moment, given how little we know about future AI capabilities and when each potential capability arrives. Maybe:

  1. Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).
  2. A piece of alignment work now is 10x more likely to target the right problem than a similar piece of work for persuasion.
  3. A piece of alignment work now is currently 10x harder to produce than a similar piece of work for persuasion.

So I guess overall I'm at ~100x, very very roughly?

How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?

Putting on my "what would I do" hat, I'm imagining that the AI doesn't know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren't being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments. Or it says that there are other counterfactual letters you could have received, such that after you read them you'd be convinced of the opposite position, and then it asks whether you still want to read the letter.

If your AI doesn't know about the counterarguments, and the letter is persuasive even to the AI, then it seems like you're hosed, but I'm not sure why to expect that.

A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn't they ask their AI for help with this? Or do you imagine them asking for something like "more faith", but AIs understand human values well enough to not interpret that as "lock in values"?

I totally expect them to ask AI for help with such games. I don't expect (most of) them to lock in their values such that they can't change their mind.

I'm not entirely sure why you do expect this. Maybe you're viewing them as consciously optimizing for winning status games + a claim that the optimal policy is to lock in your values? But what if the values rewarded by the status games (as they already seem to, e.g. moving from atheism to anti-racism)? In that case it seems like you don't want to lock in your values, to better play the status game in the future.

The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality.

If you've determined a set of universes to care about, then shouldn't at least decision theory reduce to mathematical / empirical matters about which decision procedure gets the most value across the set of universes? I do agree that moral questions are still not mathematical / empirical, but I don't find that all that persuasive. I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.

I'm guessing you feel better about mathematical / empirical reasoning because there's a ground truth that says when that reasoning is done well. I don't particularly find the existence of a ground truth to be all that big a deal -- it probably helps but doesn't seem tremendously important.

I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.

Fair enough -- I agree this is plausible (though only plausible; it doesn't seem like we've sacrificed everything to Moloch yet).

I don't think it's so much that the coordination involving humans is a lot of work, but rather that we don't know how to do it in a way that doesn't cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I'd need to see more details before I update.)

If we're imagining coordination between a billion AIs, that seems less obviously doable. I still think that, if you've solved intent alignment, it seems much easier. Democratic elections are not just about coordination -- they're also about alignment, since politicians have to optimize for being re-elected. It seems like you could do a lot better if you didn't have to worry about the alignment part.

In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.

I see, but I think at least part of the problem with threats is that I'm not sure what I care about, which greatly increases my "attack surface". For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn't be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).

Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).

This seems really extreme, if I'm not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?

Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.

Given that humans are liable to be persuaded by bad counterarguments too, I'd be concerned that the AI will always "know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments." Since it's not safe to actually look the counterarguments found by your own AI, it's not really helping at all. (Or it makes things worse if the user isn't very cautious and does look at their AI's counterarguments and gets persuaded by them.)

I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.

I think most people don't think very long term and aren't very rational. They'll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling "purity spirals" of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)

I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.

This seems plausible to me, but I don't see how one can have enough confidence in this view that one isn't very worried about the opposite being true and constituting a significant x-risk.

I broadly agree with the things you're saying; I think it mostly comes down to the actual numbers we'd assign.

This seems really extreme, if I'm not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?

Yeah, that's about right. I'd note that it isn't totally clear what the absolute risk number is meant to capture -- one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn't say exactly this above but that's the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem.

To justify the absolute number of 1/1000, I'd note that:

  1. The case seems pretty speculative + conjunctive -- you need people to choose to use AI to be very persuasive (instead of, idk, retiring to live in luxury in small insular subcommunities), you'd need the AI to be better at persuasion than defending against persuasion (or for people to choose not to defend), and you'd need this to be so bad that it leads to an existential catastrophe.
  2. I feel like if I talked to lots of people the amount I've talked with you / others about AI persuasion (i.e. not very much, but enough to convey a basic idea) I'd end up having 10-300 other risks of similar magnitude and plausibility. Under the operationalization I gave above, these probabilities would be mutually exclusive. So that places an upper bound of 1/300 - 1/10 on any given problem.
  3. I don't expect this bound to be tight. For example, if it were tight, that would imply that existential catastrophe is guaranteed. But more importantly, there are lots of worlds in which existential catastrophe is overdetermined because society is terrible at coordinating. If you condition on "existential catastrophe" and "AI persuasion was a big problem", I update that we were really bad at coordination and so I also think that there would be lots of other problems such that solving persuasion wouldn't prevent the existential catastrophe. (Whereas alignment feels much more like a direct technical challenge -- while there certainly is an update against societal coordination if we get an existential catastrophe with a failure of alignment, the update seems a lot smaller, and so I'm more optimistic that solving alignment means that the existential catastrophe doesn't happen at all.)

The 100x differential between alignment and persuasion comes mostly because points (2) and (3) above don't apply to alignment, point (1) applies only in part -- given my state of knowledge, the case for alignment failure seems much less speculative (though obviously still speculative), though it is still quite conjunctive.

I see, but I think at least part of the problem with threats is that I'm not sure what I care about, which greatly increases my "attack surface". For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn't be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).

That's a fair point, I agree this is a way in which full knowledge of human values can help avoid potentially significant risks in a way that intent alignment doesn't.

Under my current understanding of the "general alignment" angle, a core argument goes like:

  • We need some way for agents to create aligned successor agents, so our AI doesn't succumb to value drift. This is a thing we need regardless, assuming that AIs will design successively-more-powerful descendants.
  • If the successor-design process is sufficiently-general-purpose, a human could use that same process to design the "seed" AI in the first place.

I don't necessarily think this is the best framing, and I don't necessarily agree with it (e.g. whether the agent has direct read-access to its own values is an important distinction, and separately there's an argument that an AGI will be better-equipped to figure out its own succession problem than we will). I also don't know whether this is an accurate representation of anybody's view.

The successor problem is important, but it assumes we have the values already.

I'm imagining algorithms designing successors with imperfect values (that they know to be imperfect). It's a somewhat different problem (though solving the classical successor problem is also important).

Here are three things that I believe:

  1. "aiming the AGI's motivation at something-in-particular" is a different technical research problem from "figuring out what that something-in-particular should be", and we need to pursue both these research problems in parallel, since they overlap relatively little.
  2. There is no circumstance where any reasonable person would want to build an AGI whose motivation has no connection whatsoever to human values / preferences / norms.
  3. We don't necessarily want to do "ambitious value alignment"—i.e., to build an AGI that fully understands absolutely everything we want and care about in life and adopt those goals as its own, such that if I disappear in a puff of smoke the AGI can continue pursuing my goals and meta-goals in my stead. 

For example, I feel like it should be possible to make an AGI that understands human values and preferences well enough to reliably and conservatively avoid doing things that humans would see as obviously or even borderline unacceptable / problematic. So if you put it in the trolley problem, it says "I don't know, neither of those options seems obviously acceptable, so I am going to default to NOOP and let my supervisor take actions." Meanwhile, the AGI is also motivated to make me a cup of tea. Such an AGI seems pretty good to me. But it's contrary to (3).

I think this post is mainly arguing in favor of (2), and maybe weakly / implicitly arguing against (1). Is that right? And I'm not sure whether it's for or against (3).

I agree there are superintelligent unconstrained AIs that can accomplish tasks (making a cup of tea) without destroying the world. But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Ah, so you are arguing against (3)? (And what's your stance on (1)?)

Let's say you are assigned to be Alice's personal assistant.

  • Suppose Alice says "Try to help me as much as you can, while being VERY sure to avoid actions that I would regard as catastrophically bad. When in doubt, just don't do anything at all, that's always OK with me." I feel like Alice is not asking too much of you here. You'll observe her a lot, and ask her a lot of questions especially early on, and sometimes you'll fail to be useful, because helping her would require choosing among options that all seem fraught. But still, I feel like this is basically doable. And pretty robust, because you'll presumably only take actions when you have many independent lines of evidence that those actions are acceptable—e.g. you've seen Alice do similar things, and you've seen other people do similar things while Alice watched and she seemed happy, and also you explicitly asked Alice and she said it was fine.
  • Suppose Alice says "You need to distill my preferences into a utility function, and then go all-out, taking actions that set that utility function to its global maximum. So in particular, in every possible situation, no matter how bizarre, you will have preferences that match my preferences [or match the preferences that I would have reached upon deliberating following my meta-preferences, or whatever]." I feel like Alice is asking for something very very hard here. And that it's much more prone to catastrophic failure if anything goes wrong in the construction of the utility function—e.g. Alice gets confused and describes something wrong, or you misunderstand her.

Right?

But I feel it would have to have so much of human preferences already (to compute what is and what isn't an acceptable tradeoff in making you your cup of tea) that it may as well be fully aligned anyway - very little remains to define full alignment.

Hmm, I'm probably misunderstanding, but I feel like maybe you're making an argument like this:

(My probably-inaccurate elaboration of your argument.) We're making an extremely long list of the things that Alice cares about: "I like having all my teeth, and I like being able to watch football, and I like a pretty view out my window, etc. etc. etc." And each item that we add to the list costs one unit of value-alignment effort. And then "acting conservatively in regards to violating human preferences and norms in general, and in regards to Alice's preferences in particular" requires a very long list, and "synthesizing Alice's utility function" requires an only-slightly-longer list. Therefore we might as well do the latter.

But I don't think it's like that. For example, I think if an AGI watches a bunch of YouTube videos, it will be able to form a decent concept of "doing things that people would widely regard as uncontroversial and compatible with prevailing norms", and we can make it motivated to restrict its actions to that subspace via a constant amount of value-loading effort, i.e. with an amount of value-loading effort that does not scale with how complex those prevailing norms are. (More complex prevailing norms would require having the AGI watch more YouTube videos before it understands the prevailing norms, but it would not require more value-loading effort, i.e. the step where we edit the AGI's motivation such that it wants to follow prevailing norms would not be any harder.)

But I think it would take a lot more value-loading effort than that to really get a particular person's preferences, including all its idiosyncrasies and edge-cases.

Thanks for developing the argument. This is very useful.

The key point seems to be whether we can develop an AI that can successfully behave as a low impact AI - not as a "on balance, things are ok", but a genuinely low impact AI that ensure that we don't move towards a world where our preference might be ambiguous or underdefined.

But consider the following scenario: the AGI knows that, as a consequence of its actions, one AGI design will be deployed rather than another. Both of these designs will push the world into uncharted territory. How should it deal with that situation?

Hmm,

  1. I want the AI to have criteria that qualifies actions as acceptable, e.g. "it pattern-matches less than 1% to 'I'm causing destruction', and it pattern-matches less than 1% to 'the supervisor wouldn't like this', and it pattern-matches less than 1% to 'I'm changing my own motivation and control systems', and … etc. etc."
  2. If no action is acceptable, I want NOOP to be hardcoded as an always-acceptable default—a.k.a. "being paralyzed by indecision" in the face of a situation where all the options seem problematic. And then we humans are responsible for not putting the AI in situations where fast decisions are necessary and inaction is dangerous, like running the electric grid or driving a car. 

    (At some point we do want an AI that can run the electric grid and drive a car etc. But maybe we can bootstrap our way there, and/or use less-powerful narrow AIs in the meantime.)
  3. A failure mode of (2) is that we could get an AI that is paralyzed by indecision always, and never does anything. To avoid this failure mode, we want the AI to be able to (and motivated to) gather evidence that might show that a course of action deemed problematic is in fact acceptable after all. This would probably involve asking questions to the human supervisor.
  4. A failure mode of (3) is that the AI frames the questions in order to get an answer that it wants. To avoid this failure mode, we would set things up such that the AI's normal motivation system is not in charge of choosing what words to say when querying the human. For example, maybe the AI is not really "asking a question" at all, at least not in the normal sense; instead it's sending a data-dump to the human, and the human then inspects this data-dump with interpretability tools, and makes an edit to the AI's motivation parameters. (In this case, maybe the AI's normal motivation system is choosing to "press the button" that sends the data-dump, but it does not have direct control over the contents of the data-dump.) Separately, we would also set up the AI such that it's motivated to not manipulate the human, and also motivated to not sabotage its own motivation and control systems.

(BTW a lot of my thinking here came straight out of reading your model splintering posts. But maybe I've kinda wandered off in a different direction.)

So then in the scenario you mentioned, let's assume that we've set up the AI such that actions that pattern-match to "push the world into uncharted territory" are treated as unacceptable (which I guess seems like a plausibly good idea). But the AI is also motivated to get something done—say, solve global warming. And it finds a possible course of action which pattern-matches very well to "solve global warming", but alas, it also pattern-matches to "push the world into uncharted territory". The AI could reason that, if it queries the human (by "pressing the button" to send the data-dump), there's at least a chance that the human would edit its systems such that this course of action would no longer be unacceptable. So it would presumably do so.

In other words, this is a situation where the AI's motivational system is sending it mixed signals—it does want to "solve global warming", but it doesn't want to "push the world into uncharted territory", but this course of action is both. And let's assume that the AI can't easily come up with an alternative course of action that would solve global warming without any problematic aspects. So the AI asks the human what they think about this plan. Seems reasonable, I guess.

I haven't thought this through very much and look forward to you picking holes in it :)

My take is that if you gave an optimization process access to some handwritten acceptability criteria and searched for the nearest acceptable points to random starting points, you would get adversarial examples that violate unstated criteria. In order for the handwritten acceptability criteria to be useful, they can't be how the AI generates its ideas in the first place.

So: what is the base level that we would find if we peeled away the value learning scheme that you lay out? Is it a very general, human-agnostic AI with some human-value constraints on top? Or will we peel away a layer that gets information from humans just to reveal another layer that gets information from humans (e.g. learning a "human distribution")?

I'm with Steve on the idea that there's a difference between broad human preferences (something like common sense?) and particular and exact human preferences (what would be needed for ambitious value learning).

Still, you (Stuart) made me realize that I didn't think explicitly about this need for broad human preferences in my splitting of the problem (be able to align, then point to what we want), but it's indeed implicit because I don't care about being able to do "anything", just the sort of things humans might want.

If we have an algorithm that aligns an AI with X values, then we can add human values to get an AI that is aligned with human values.

On the other hand, I agree that it doesn't really make sense to declare an AI safe in the abstract, rather than in respect to say human values. (Small counterpoint: in order to be safe, it's not just about alignment, you also need to avoid bugs. This can be defined without reference to human values. However, this isn't sufficient for safety).

I suppose this works as a criticism of approaches like quantisers or impact-minimisation which attempt abstract safety. Although I can't see any reason why it'd imply that it's impossible to write an AI that can be aligned with arbitrary values.

There are very “large” impacts to which we are completely indifferent (chaotic weather changes, the above-mentioned change in planetary orbits, the different people being born as a consequence of different people meeting and dating across the world, etc.) and other, smaller, impacts that we care intensely about (the survival of humanity, of people’s personal wealth, of certain values and concepts going forward, key technological innovations being made or prevented, etc.)

I don't think we are indifferent to these outcomes. We leave them to luck, but that's a fact about our limited capabilities, not about our values. If we had enough control over "chaotic weather changes" to steer a hurricane away from a coastal city, we would very much care about it. So if a strong AI can reason through these impacts, it suddenly faces a harder task than a human: "I'd like this apple to fall from the table, and I see that running the fan for a few minutes will achieve that goal, but that's due to subtly steering a hurricane and we can't have that".

Yes, but we would be mostly indifferent to shifts in the distribution that preserve most of the features - eg if the weather was the same but delayed or advanced by six days.