Summary: I’m going to give a $10k prize to the best evidence that my preferred approach to AI safety is doomed. Submit by commenting on this post with a link by April 20.
I have a particular vision for how AI might be aligned with human interests, reflected in posts at ai-alignment.com and centered on iterated amplification.
This vision has a huge number of possible problems and missing pieces; it’s not clear whether these can be resolved. Many people endorse this or a similar vision as their current favored approach to alignment, so It would be extremely valuable to learn about dealbreakers as early as possible (whether to adjust the vision or abandon it).
Here’s the plan:
If you want to explain why this approach is doomed, explore a reason it may be doomed, or argue that it’s doomed, I strongly encourage you to do that.
Post a link to any relevant research/argument/evidence (a paper, blog post, repo, whatever) in the comments on this post.
The contest closes April 20.
You can submit content that was published before this prize was announced.
I’ll use some process to pick my favorite 1-3 contributions. This might involve delegating to other people or might involve me just picking. I make no promise that my decisions will be defensible.
I’ll distribute (at least) $10k amongst my favorite contributions.
If you think that some other use of this money or some other kind of research would be better for AI alignment, I encourage you to apply for funding to do that (or just to say so in the comments).
This prize is orthogonal and unrelated to the broader AI alignment prize. (Reminder: the next round closes March 31. Feel free to submit something to both.)
This contest is not intended to be “fair”---the ideas I’m interested in have not been articulated clearly, so even if they are totally wrong-headed it may not be easy to explain why. The point of the exercise is not to prove that my approach is promising because no one can prove it’s doomed. The point is just to have a slightly better understanding of the challenges.
Edited top add the results:
$5k for this post by Wei_Dai, and the preceding/following discussion, some points about the difficulty of learning corrigibility in small pieces.
$3k for Point 1 from this comment by eric_langlois, an intuition pump for why security amplification is likely to be more difficult than you might think.
$2k for this post by William_S, which clearly explains a consideration / design constraint that would make people less optimistic about my scheme. (This fits under "summarizing/clarifying" rather than novel observation.)
Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).
Background on what I’m looking for
I’m most excited about particularly thorough criticism that either makes tight arguments or “plays both sides”---points out a problem, explores plausible responses to the problem, and shows that natural attempts to fix the problem systematically fail.
If I thought I had a solution to the alignment problem I’d be interested in highlighting any possible problem with my proposal. But that’s not the situation yet; I’m trying to explore an approach to alignment and I’m looking for arguments that this approach will run into insuperable obstacles. I'm already aware that there are plenty of possible problems. So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.
On the other hand, I’m hoping that we'll solve alignment in a way that knowably works under extremely pessimistic assumptions, so I’m fine with arguments that make weird assumptions or consider weird situations / adversaries.
Examples of interlocking obstacles I think might totally kill my approach:
Amplification may be doomed because there are important parts of cognition that are too big to safely learn from a human, yet can’t be safely decomposed. (Relatedly, security amplification might be impossible.)
A clearer inspection of what amplification needs to do (e.g. building a competitive model of the world in which an amplified human can detect incorrigible behavior) may show that amplification isn’t getting around the fundamental problems that MIRI is interested in and will only work if we develop a much deeper understanding of effective cognition.
There may be kinds of errors (or malign optimization) that are amplified by amplification and can’t be easily controlled (or this concern might be predictably hard to address in advance by theory+experiment).
Corrigibility may be incoherent, or may not actually be easy enough to learn, or may not confer the kind of robustness to prediction errors that I’m counting on, or may not be preserved by amplification.
Satisfying safety properties in the worst case (like corrigibility) may be impossible. See this post for my current thoughts on plausible techniques. (I’m happy to provisionally grant that optimization daemons would be catastrophic if you couldn’t train robust models.)
Informed oversight might be impossible even if amplification works quite well. (This is most likely to be impossible in the context of determining what behavior is catastrophic.)
I value objections but probably won't have time to engage significantly with most of them. That said: (a) I’ll be able to engage in a limited way, and will engage with objections that significantly shift my view, (b) thorough objections can produce a lot of value even if no proponent publicly engages with them, since they can be convincing on their own, (c) in the medium term I’m optimistic about starting a broader discussion about iterated amplification which involves proponents other than me.
I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL. I think that these are the most likely forms for powerful AI systems in the short term, that they are particularly hard cases for alignment, and that they are likely to turn up alignment techniques that are very generally applicable. So for now I’m not trying to be competitive with other kinds of AI systems.
Summary: I’m going to give a $10k prize to the best evidence that my preferred approach to AI safety is doomed. Submit by commenting on this post with a link by April 20.
I have a particular vision for how AI might be aligned with human interests, reflected in posts at ai-alignment.com and centered on iterated amplification.
This vision has a huge number of possible problems and missing pieces; it’s not clear whether these can be resolved. Many people endorse this or a similar vision as their current favored approach to alignment, so It would be extremely valuable to learn about dealbreakers as early as possible (whether to adjust the vision or abandon it).
Here’s the plan:
If you think that some other use of this money or some other kind of research would be better for AI alignment, I encourage you to apply for funding to do that (or just to say so in the comments).
This prize is orthogonal and unrelated to the broader AI alignment prize. (Reminder: the next round closes March 31. Feel free to submit something to both.)
This contest is not intended to be “fair”---the ideas I’m interested in have not been articulated clearly, so even if they are totally wrong-headed it may not be easy to explain why. The point of the exercise is not to prove that my approach is promising because no one can prove it’s doomed. The point is just to have a slightly better understanding of the challenges.
Edited top add the results:
Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).
Background on what I’m looking for
I’m most excited about particularly thorough criticism that either makes tight arguments or “plays both sides”---points out a problem, explores plausible responses to the problem, and shows that natural attempts to fix the problem systematically fail.
If I thought I had a solution to the alignment problem I’d be interested in highlighting any possible problem with my proposal. But that’s not the situation yet; I’m trying to explore an approach to alignment and I’m looking for arguments that this approach will run into insuperable obstacles. I'm already aware that there are plenty of possible problems. So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.
On the other hand, I’m hoping that we'll solve alignment in a way that knowably works under extremely pessimistic assumptions, so I’m fine with arguments that make weird assumptions or consider weird situations / adversaries.
Examples of interlocking obstacles I think might totally kill my approach:
I value objections but probably won't have time to engage significantly with most of them. That said: (a) I’ll be able to engage in a limited way, and will engage with objections that significantly shift my view, (b) thorough objections can produce a lot of value even if no proponent publicly engages with them, since they can be convincing on their own, (c) in the medium term I’m optimistic about starting a broader discussion about iterated amplification which involves proponents other than me.
I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL. I think that these are the most likely forms for powerful AI systems in the short term, that they are particularly hard cases for alignment, and that they are likely to turn up alignment techniques that are very generally applicable. So for now I’m not trying to be competitive with other kinds of AI systems.