Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:
Strong upvote, I would also love to see more disscussion on the difficulty of inner alignment.
which if true should preclude strong confidence in disaster scenarios
Though only for disaster scenarios that rely on inner misalignment, right?
... seem like world models that make sense to me, given the surrounding justifications
FWIW, I don't really understand those world models/intuitions yet:
Re: corrigibility being "anti-natural" in a certain sense - I think I have a better understanding of this now:
If we don’t have the techniques to reliably align AI, will someone deploy AI anyway? I think it’s more likely the answer is yes.
What level of deployment of unaligned benchmark systems do you expect would make doom plausible? "Someone" suggests maybe you think one deployment event of a sufficiently powerful system could be enough (which would be surprising in slow takeoff worlds). If you do think this, is it something to do with your expectations about discontinuous progress around AGI?
A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice
Sure, I agree this is a stronger point.
The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.
Not really, unfortunately. In those posts, the authors are focusing on painting a plausible pi...
I'm broadly sympathetic to your point that there have been an unfortunate number of disagreements about inner alignment terminology, and it has been and remains a source of confusion.
to the extent that Evan has felt a need to write an entire clarification post.
Yeah, and recently there has been even more disagreement/clarification attempts.
I should have specified this on the top level question, but (as mentioned in my own answer) I'm talking about abergal's suggestion of what inner alignment failure should refer to (basically: a model pursuing a differe...
Thanks for your reply!
depends on what you mean with strongest arguments.
By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
Agree, though I expect it's more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of "strongest").
...many di
Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn't incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)
I found this post helpful and interesting, and refer to it often! FWIW I think that powerful persuasion tools could have bad effects on the memetic ecosystem even if they don't shift the balance of power to a world with fewer, more powerful ideologies. In particular, the number of ideologies could remain roughly constant, but each could get more 'sticky'. This would make reasonable debate and truth-seeking harder, as well as reducing trusted and credible multipartisan sources. This seems like an existential risk factor, e.g. because it will make coordinati...
Thanks for your comment!
Are we sure that given the choice between "lower crime, lower costs and algorithmic bias" and "higher crime, higher costs and only human bias", and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection?
Good point, thanks, I hadn't thought that sometimes it actually would make sense, on reflection, to choose an algorithm pursuing an easy-to-measure goal over humans pursuing incorrect goals. One thing I'd add is that if one did delve into the research to work this out for a particul...
Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.
Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that: