A few more reasons...
First: why do software engineers use worst-case reasoning?
And one more, generated by thinking about some of my own use-cases:
These all carry over to alignment pretty straightforwardly.
Just want to draw out and highlight something mentioned in passing in the "You want to solve a problem in as much generality as possible..." section. Not only would it be great if you could solve a problem in the worst case, the worst case assumption is also often radically easier to think about than trying to think about realistic cases. In some sense the worst case assumption is the second-simplest assumption you could possibly make about the empirical situation (the simplest being the best case assumption -- "this problem never comes up"). My understanding is that proving theorems about average case phenomena is a huge pain and often comes much after proofs about the worst case bounds.
I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.
One keyword for this is "partial specification", e.g. here is a paper I wrote that makes a minimal set of statistical assumptions and worst-case assumptions everywhere else: https://arxiv.org/abs/1606.05313. (Unfortunately the statistical assumptions are not really reasonable so the method was way too brittle in practice.) This kind of idea is also common in robust statistics. But my take would not be that it is simpler--in general it is way harder than just working with the empirical distribution in front of you.
I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.
Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just "solve existing problems and then try the same thing tomorrow," and generally I'd agree "solve an existing problem for which you can verify success" is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it's feasible.
My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it's trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder.
(However, I agree that once we already need to do some averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)
Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.
And in fact, since min_x f(theta,x) <= E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.
Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.
To elaborate this formally,
and are both "easier" monoids than essentially because of dominance relations; for any , there's going to be a single that dominates all others, in the sense that all other can be excluded from consideration and have no impact on the outcome. Whereas when calculating , the only that can be excluded are those outside the distribution's support.
is even easier than because it commutes with the outer ; not only is there a single that dominates all others, it doesn't necessarily even depend on (the problem can be solved as or ). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.
Somewhere between worst-case and average-case performance is quantile-case performance, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing -quantile-case performance looks like (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.
Here's the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:
Thanks! I appreciated these distinctions. The worst-case argument for modularity came up in a past argument I had with Eliezer, where I argued that this was a reason for randomization (even though Bayesian decision theory implies you should never randomize). See section 2 here: The Power of Noise.
Re: 50% vs. 10% vs. 90%. I liked this illustration, although I don't think your argument actually implies 50% specifically. For instance if it turns out that everyone else is working on the 50% worlds and no one is working on the 90% worlds, you should probably work on the 90% worlds. In addition:
* It seems pretty plausible that the problem is overall more tractable in 10% worlds than 50% worlds, so given equal neglectedness you would prefer the 10% world.
* Many ideas will generalize across worlds, and recruitment / skill-building / organization-building also generalizes across worlds. This is an argument towards working on problems that seem tractable and relevant to any world, as long as they are neglected enough that you are building out distinct ideas and organizational capacity (vs. just picking from the same tree as ML generally). I don't think that this argument dominates considerations, but it likely explains some of our differences in approach.
In the terms laid out in your post, I think my biggest functional disagreement (in terms of how it affects what problems we work on) is that I expect most worst-case assumptions make the problem entirely impossible, and I am more optimistic that many empirically-grounded assumptions will generalize quite far, all the way to AGI. To be clear, I am not against all worst-case assumptions (for instance my entire PhD thesis is about this) but I do think they are usually a source of significant added difficulty and one has to be fairly careful where they are making them.
For instance, as regards Redwood's project, I expect making language models fully adversarially robust is impossible with currently accessible techniques, and that even a fairly restricted adversary will be impossible to defend against while maintaining good test accuracy. On the other hand I am still pretty excited about Redwood's project because I think you will learn interesting things by trying. (I spent some time trying to solve the unrestricted adversarial example competition, totally failed, but still felt it was a good use of time for similar reasons, and the difficulties for language models seem interestingly distinct in a way that should generate additional insight.) I'm actually not sure if this differs that much from your beliefs, though.
This is a relatively banal meta-commentary on reasons people sometimes give for doing worst-case analysis, and the differences between those reasons. The post reads like a list of things with no clear through-line. There is a gesture at an important idea from a Yudkowsky post (the logistic success curve idea) but the post does not helpfully expound that idea. There is a kind of trailing-off towards the end of the post as things like "planning fallacy" seem to have been added to the list with little time taken to place them in the context of the other things on the list. In the "differences between these arguments" section, the post doesn't clearly elucidate deep differences between the arguments, it just lists verbal responses that you might make if you are challenged on plausibility grounds in each case.
Overall, I felt that this post under-delivered on an important topic.
When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.
Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.
And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected.
This section seems reversed to me, unless I'm misunderstanding it. If "things as I expect" are P(doom) 99%, and "I'm pleasantly wrong about the usefulness of natural abstractions" is P(doom) 50%, the first paragraph suggests I should do the "better than expected" / "surprisingly good" world, because the marginal impact of effort is higher in that world.
[Another way to think about it is surprising in the direction you already expect is extremizing, but logistic success has its highest derivative in the middle, i.e. is a moderating force.]
This piece took an important topic that I hadn't realized I was confused/muddled about, convinced me I was confused/muddled about it, while simultaneously providing a good framework for thinking about it. I feel like I have a clearer sense of how Worst Case Thinking applies in alignment.
I also appreciated a lot of the comments here that explore the topic in more detail.
Alternative title: “When should you assume that what could go wrong, will go wrong?”
Thanks to Mary Phuong and Ryan Greenblatt for helpful suggestions and discussion, and Akash Wasil for some edits.
In discussions of AI safety, people often propose the assumption that something goes as badly as possible. Eliezer Yudkowsky in particular has argued for the importance of security mindset when thinking about AI alignment.
I think there are several distinct reasons that this might be the right assumption to make in a particular situation. But I think people often conflate these reasons, and I think that this causes confusion and mistaken thinking. So I want to spell out some distinctions.
Throughout this post, I give a bunch of specific arguments about AI alignment, including one argument that I think I was personally getting wrong until I noticed my mistake yesterday (which was my impetus for thinking about this topic more and then writing this post). I think I’m probably still thinking about some of my object level examples wrong, and hope that if so, commenters will point out my mistakes. But I think I’ll stand by the claim that we should be attempting to distinguish between these classes of argument.
My list of reasons to maybe use worst-case thinking
Here’s an attempt at describing some different classes situations where you might want to argue that something goes as badly as it could.
You’re being optimized against
For example, if you’ve built an unaligned AI and you have a team of ten smart humans looking for hidden gotchas in its proposed actions, then the unaligned AI will probably come up with a way of doing something bad that the humans miss. In AI alignment, we most often think about cases where the AI we’re training is optimizing against us, but sometimes we also need to think about cases where other AIs or other humans are optimizing against us or our AIs.
In situations like this, I think Eliezer’s attitude is basically right: we’re being optimized against and so we have to use worst-case thinking and search hard for systems which we can strongly argue are infallible.
One minor disagreement: I’m less into hard takeoffs than he is, so I place less weight than he does on situations where your AI becomes superintelligent enough during training that it can exploit some kind of novel physics to jump an airgap or whatever. (Under my model, such a model probably just waits until it’s deployed to the internet–which is one of the first things that AGI developers want to do with it, because that’s how you make money with a powerful AI–and then kills everyone.)
But I fundamentally agree with his rejection of arguments of the form “only a small part of the space of possible AI actions would be devastatingly bad, so things will probably be fine”.
Scott Garrabrant writes about an argument like this here.
The space you’re selecting over happens to mostly contain bad things
When Hubinger et al argue in section 4.4 of Risks from Learned Optimization that “there are more paths to deceptive alignment than to robust alignment,” they aren’t saying that you get a misaligned mesa-optimizer because the base optimizer is trying to produce an agent that is as misaligned as possible, they’re saying that even though the base optimizer isn’t trying to find a misaligned policy, most policies that it can find are misaligned and so you’ll probably get one. But unlike the previous situation, if instead it was the case that 50% of the policies that SGD might find were aligned, then we’d have a 50% chance of surviving, because SGD isn’t optimizing against us.
I think that AI alignment researchers often conflate these two classes of arguments. IMO, when you’re training an AGI:
(I noticed that I was making a mistake related to mixing up these two classes on Sunday; I then thought about this some more and wrote this post.)
You want to solve a problem in as much generality as possible, and so you want to avoid making assumptions that might not hold
There’s a certain sense in which cryptographers make worst-case assumptions in their research. For example, when inventing public key cryptography, cryptographers were asking the question “Suppose I want to be able to communicate privately with someone, but an eavesdropper is able to read all messages that we send to each other. Is there some way to communicate privately regardless?”
Suppose someone responded by saying “It seems like you’re making the assumption that someone is spying on your communications all the time. But isn’t this unrealistically pessimistic?”
The cryptographer’s response would be to say “Sure, it’s probably not usually the case that someone is spying on my packets when I send messages over the internet. But when I’m trying to solve the technical problem of ensuring private communication, it’s quite convenient to assume a simple and pessimistic threat model. Either I’ll find an approach that works in any scenario less pessimistic than the one I solved, or I’ll learn that we actually need to ensure some other way that no-one’s reading my packets.”
Similarly, in the alignment case, sometimes we make pessimistic empirical assumptions when trying to specify settings for our problems, because solutions developed for pessimistic assumptions generalize to easier situations but the converse isn’t true.
As a large-scale example, when we talk about trying to come up with competitive solutions to AI alignment, a lot of the motivation isn’t the belief that there will be literally no useful global coordination around AI.
A smaller-scale example: When trying to develop schemes for relaxed adversarial training, we assume that we have no access to any interpretability tools for our models. This isn’t because we actually believe that we’ll have no interpretability tools, it’s because we’re trying to develop an alternative to relying on interpretability.
This is kind of similar to the attitude that cryptographers have.
Aiming your efforts at worlds where you have the biggest marginal impact
Suppose you are unsure how hard the alignment problem is. Maybe you think that humanity’s odd’s of success are given by a logistic function of the difference between how much alignment progress was made and how hard the problem is. When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.
Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.
And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected. This is the point that Eliezer makes in Security Mindset and the Logistic Success Curve: the security-minded character thinks that it’s so unlikely that a particular security-lax project will succeed at building a secure system that she doesn’t think it’s worth her time to try to help them make marginal improvements to their security.
And so, this kind of thinking only pushes you to aim your efforts at surprisingly bad worlds if you’re already P(doom) < 50%.
This type of thinking is common among people who are thinking about global catastrophic biological risks. I don’t know of any public documents that are specifically about this point, but you can see an example of this kind of reasoning in Andrew Snyder-Beattie’s Peak defence vs trough defence in biosecurity.
Murphyjitsu
Sometimes a problem involves a bunch of weird things that could go wrong, and in order to get good outcomes, it has to be the case that all of them go well. For example, I don’t think that “a terrorist infiltrates the team of labellers who are being used to train the AGI and poisons the data” is a very likely AI doom scenario. But I think there are probably 100 scenarios as plausible as that one, each of which sounds kind of bad. And I think it’s probably worth some people’s time to try to stamp out all these individually unlikely failure modes.
Planning fallacy
Ryan Greenblatt notes that you can also make a general reference class claim that people are too optimistic (planning fallacy etc.).
Differences between these arguments
Depending on which of these arguments you’re making, you should respond very differently when someone says “the thing you’re proposing is quite far fetched”.
Mary Phuong proposes breaking this down into two questions: