MichaelA - AI Alignment Forum

My Overview of the AI Alignment Landscape: A Bird's Eye View

Thanks for this! I found it interesting and useful.

I don't have much specific feedback, partly because I listened to this via Nonlinear Library while doing other things rather than reading it, but I'll share some thoughts anyway since you indicated being very keen for feedback.

I in general think this sort of distillation work is important and under-supplied
This seems like a good example of what this sort of distillation work should be like - broken into different posts that can be read separately, starting with an overall overview, each post is broken down into clear and logical sections and subsections, use of bold, clarity about terms, addition of meta notes where relevant
Maybe it would've been useful to just name & link to sources on threat models, agendas to build safe AGI, and robustly good approaches that you don't discuss in any further detail? Rathe than not mentioning them at all.
- That could make it easier for people to dive deeper if they want, could help avoid giving the impression that the things you list are the only things in those categories, and could help people understand what you mean by the overall categories by seeing more examples of things within the categories.
- This is assuming you think there are other discernible nameable constituents of those categories which you didn't name - I guess it's possible that you don't think that.
I'll put in a reply to this comment the Anki cards I made, on the off chance that that's of interest to you as oblique feedback or of interest to other people so they can use the same cards themselves

Rohin Shah on reasons for AI optimism

MichaelA5y10

Interesting (again!).

So you've updated your unconditional estimate from ~5% (1 in 20) to ~9%? If so, people may have to stop citing you as an "optimist"... (which was already perhaps a tad misleading, given what the 1 in 20 was about)

(I mean, I know we're all sort-of just playing with incredibly uncertain numbers about fuzzy scenarios anyway, but still.)

Rohin Shah on reasons for AI optimism

MichaelA5y10

Quite interesting. Thanks for that response.

And yes, this does seem quite consistent with Ord's framing. E.g., he writes "my estimates above incorporate the possibility that we get our act together and start taking these risks very seriously." So I guess I've seen it presented this way at least that once, but I'm not sure I've seen it made explicit like that very often (and doing so seems useful and retrospectively-obvious).

But if we just exerted a lot more effort (i.e. "surprisingly much action"), the extra effort probably doesn't help much more than the initial effort, so maybe... 1 in 25? 1 in 30?

Are you thinking roughly that (a) returns diminish steeply from the current point, or (b) that effort will likely ramp up a lot in future and pluck a large quantity of the low hanging fruit that currently remain, such that even more ramping up would face steeply diminishing returns?

That's a vague question, and may not be very useful. The motivation for it is that I was surprised you saw the gap between business as usual and "surprisingly much action" as being as small as you did, and wonder roughly what portion of that is about you thinking additional people working on this won't be very useful, vs thinking very super useful additional people will eventually jump aboard "by default".

Rohin Shah on reasons for AI optimism

MichaelA5y30

Thanks for this reply!

Perhaps I should've been clear that I didn't expect what I was saying was things you hadn't heard. (I mean, I think I watched an EAG video of you presenting on 80k's ideas, and you were in The Precipice's acknowledgements.)

I guess I was just suggesting that your comments there, taken by themselves/out of context, seemed to ignore those important arguments, and thus might seem overly optimistic. Which seemed mildly potentially important for someone to mention at some point, as I've seen this cited as an example of AI researcher optimism. (Though of course I acknowledge your comments were off the cuff and not initially intended for public consumption, and any such interview will likely contain moments that are imperfectly phrased or open to misinterpretation.)

Also, re: Precipice, it's worth noting that Toby and I don't disagree much -- I estimate 1 in 10 conditioned on no action from longtermists; he estimates 1 in 5 conditioned on AGI being developed this century. Let's say that action from longtermists can halve the risk; then my unconditional estimate would be 1 in 20[...] (emphasis added)

I find this quite interesting. Is this for existential risk from AI as a whole, or just "adversarial optimisation"/"misalignment" type scenarios? E.g., does it also include things like misuse and "structural risks" (e.g., AI increasing risks of nuclear war by forcing people to make decisions faster)?

I'm not saying it'd be surprisingly low if it does include those things. I'm just wondering, as estimates like this are few and far between, so now that I've stumbled upon one I want to understand its scope and add it to my outside view.

Also, I bolded conditioned and unconditional, because that seems to me to suggest that you also currently expect the level of longtermist intervention that would reduce the risk to 1 in 20 to happen. Like, for you, "there's no action from longtermists" would be a specific constraint you have to add to your world model? That also makes sense; I just feel like I've usually not seen things presented that way.

I imagine you could also condition on something like "surprisingly much action from longtermists", which would reduce your estimated risk further?

Rohin Shah on reasons for AI optimism

MichaelA5y10

You could imagine a situation where for some reason the US and China are like, “Whoever gets to AGI first just wins the universe.” And I think in that scenario maybe I’m a bit worried, but even then, it seems like extinction is just worse, and as a result, you get significantly less risky behavior? But I don’t think you get to the point where people are just literally racing ahead with no thought to safety for the sake of winning.

My interpretation of what Rohin is saying there is:

1) Extinction is an extremely bad outcome.
2) It's much worse than 'losing' an international competition to 'win the universe'.
3) Countries/institutions/people will therefore be significantly inclined to avoid risking extinction, even if doing so would increase the chances of 'winning' an international competition to 'win the universe'.

I agree with claim 1.

I agree with some form of claim 3, in that:

I think the badness of extinction will reduce the risks people are willing to take
I also "don’t think you get to the point where people are just literally racing ahead with no thought to safety for the sake of winning."
But I don't think the risks will be reduced anywhere near as much as they should be. (That said, I also believe that odds are in favour of things "going well by default", just not as much in favour of that as I'd like).

This is related to my sense that claim 2 is somewhat tricky/ambiguous. Are we talking about whether it is worse, or whether the relevant actors will perceive it as worse? One common argument for why existential risks are neglected is that it's basically a standard market failure. The vast majority of the harm from x-risks are externalities, and x-risk reduction is a global public good. Even if we consider deaths/suffering in the present generation, even China and India absorb less than half of that "cost", and most countries absorb less than 1% of them. And I believe most people focused on x-risk reduction are at least broadly longtermist, so they'd perceived the overwhelming majority of the costs to be to future generations, and thus also externalities.

So it seems like, unless we expect the relevant actors to act in accordance with something close to impartial altruism, we should expect them to avoid risks somewhat to avoid existential risks (or extinction specifically), but far less than they really should. (Roughly this argument is made in The Precipice, and I believe by 80k.)

(Rohin also discusses right after that quote why he doesn't "think that differences in who gets to AGI first are going to lead to you win the universe or not", which I do think somewhat bolsters the case for claim 2.)

Rohin Shah on reasons for AI optimism

MichaelA5y10

Interesting interview, thanks for sharing it!

Asya Bergal: It seems like people believe there’s going to be some kind of pressure for performance or competitiveness that pushes people to try to make more powerful AI in spite of safety failures. Does that seem untrue to you or like you’re unsure about it?

Rohin Shah: It seems somewhat untrue to me. I recently made a comment about this on the Alignment Forum. People make this analogy between AI x-risk and risk of nuclear war, on mutually assured destruction. That particular analogy seems off to me because with nuclear war, you need the threat of being able to hurt the other side whereas with AI x-risk, if the destruction happens, that affects you too. So there’s no mutually assured destruction type dynamic.

I find this statement very confusing. I wonder if I'm misinterpreting Rohin. Wikipedia says "Mutual(ly) assured destruction (MAD) is a doctrine of military strategy and national security policy in which a full-scale use of nuclear weapons by two or more opposing sides would cause the complete annihilation of both the attacker and the defender (see pre-emptive nuclear strike and second strike)."

A core part of the idea of MAD is that the destruction would be mutual. So "with AI x-risk, if the destruction happens, that affects you too" seems like a reason why MAD is a good analogy, and why the way we engaged in MAD might suggest people would engage in similar brinkmanship or risks with AI x-risk, even if the potential for harm to people's "own side" would be extreme. There are other reasons why the analogy is imperfect, but the particular feature Rohin mentions seems like a reason why an analogy could be drawn.

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments