Nice post, I like the focus on trying to describe what actually happens when AI systems are deployed. However, it's pretty different from my picture of how a takeoff might go. I'll outline some disagreements below.
----
Disagreements about whether takeoff will be "homogenous"
First, you seem to be assuming that there is a single variable that can either be homogenous vs heterogenous. I don't see why this should be the case -- my baseline prediction is that systems are homogenous in algorithms but heterogenous in finetuning data.
It seems to me that your argument goes like this:
I disagree with step 2 of this argument; I expect alignment depends significantly on how you finetune, and this will likely be very different for AI systems applied to different tasks. See e.g. how GPT-3 is being finetuned for different tasks.
I do still think we will get homogeneity in alignment, but not because of homogeneity in algorithms, but because humanity will put in effort to make sure systems are aligned.
More broadly, I think talking about takeoff as "homogenous" or "heterogenous" is misguided, and you should ~always be saying "homogenous / heterogenous in X".
----
Disagreements about the implications of homogeneity of alignment
Here, I'm going to assume that we do have homogeneity of alignment (despite disagreeing with that position above).
which rules out some of the ways in which the strategy-stealing assumption might fail.
Huh? Every way that the strategy-stealing assumption might fail is about how misaligned systems with a little bit of power could "win" over a larger coalition of aligned systems with a lot of power. How does homogeneity of alignment change that?
I could see a position where homogeneity of alignment means that there's no point thinking about the strategy-stealing assumption. Either nearly everything is misaligned and we're dead, or nearly everything is aligned and we're fine. This doesn't sound like what you're saying though.
Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights. As a result, x-risk scenarios involving AI coordination failures or s-risk scenarios involving AI bargaining failures (at least those that don't involve acausal trade) are relatively unlikely.
This seems like it proves too much. Humans are very structurally similar to each other, but still have coordination and bargaining failures. Even in literally identical systems, indexical preferences could still cause conflict to arise.
Maybe you're claiming that AI systems will be way more homogenous than humans, and that they won't have indexical preferences? I'd disagree with both of those claims.
It's unlikely you'll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it's deployed it's likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul's “cascading failures”).
At a high level, you're claiming that we don't get a warning shot because there's a discontinuity in capability of the aggregate of AI systems (the aggregate goes from "can barely do anything deceptive" to "can coordinate to properly execute a treacherous turn").
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don't find your argument here compelling.
In other words, I agree with "discontinuity => fewer warning shots" and disagree with "homogeneity of alignment => discontinuity".
Homogeneity makes the alignment of the first advanced AI system absolutely critical (in a similar way to fast/discontinuous takeoff without the takeoff actually needing to be fast/discontinuous), since whether the first AI is aligned or not is highly likely to determine/be highly correlated with whether all future AIs built after that point are aligned as well. Thus, homogenous takeoff scenarios demand a focus on ensuring that the first advanced AI system is actually sufficiently aligned at the point when it's first built rather than relying on feedback mechanisms after the first advanced AI's development to correct issues.
If you literally condition on "Takeoff will be homogenous in alignment, even across time", then yes, this is an implication. But surely the point "we can rely on feedback mechanisms to correct issues" should make you less convinced that AI systems will be homogenous in alignment across time?
----
Nitpicks
It's also worth noting that a homogenous takeoff doesn't necessarily imply anything about how fast, discontinuous, or unipolar the takeoff might be
What's a heterogenous unipolar takeoff? I would assume you need to have a multipolar scenario for homogenous vs. heterogenous to be an important distinction.
Thanks—glad you liked the post! Some replies:
I disagree with step 2 of this argument; I expect alignment depends significantly on how you finetune, and this will likely be very different for AI systems applied to different tasks. See e.g. how GPT-3 is being finetuned for different tasks.
I think this is definitely an interesting point. My take would be that fine-tuning matters, but only up to a point. Once you have a system that is general enough that it can solve all the tasks you need it to solve such that all you need to do to use that system on a particular task is locate that task (either via clever prompting or fine-tuning), I don't expect that process of task location to change whether the system is aligned (at least in terms of whether it's aligned with what you're trying to get it to do in solving that task). Either you have a system with some other proxy objective that it cares about that isn't actually the tasks you want or you have a system which is actually trying to solve the tasks you're giving it.
Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.
Huh? Every way that the strategy-stealing assumption might fail is about how misaligned systems with a little bit of power could "win" over a larger coalition of aligned systems with a lot of power. How does homogeneity of alignment change that?
I think we have somewhat different interpretations of the strategy-stealing assumption—in fact, I think we've had this disagreement before in this comment chain. Basically, I think the strategy-stealing assumption is best understood as a general desideratum that we want to hold for a single AI system that tells us whether that system is just as good at optimizing for our values as any other set of values—a desideratum that could fail because our AI systems can only optimize for simple proxies, for example, regardless of whether other AI systems that aren't just optimizing for simple proxies exist alongside it or not. In fact, when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn't think of that as invalidating the importance of strategy-stealing.
This seems like it proves too much. Humans are very structurally similar to each other, but still have coordination and bargaining failures. Even in literally identical systems, indexical preferences could still cause conflict to arise.
Maybe you're claiming that AI systems will be way more homogenous than humans, and that they won't have indexical preferences? I'd disagree with both of those claims.
I do expect AI systems to have indexical preferences (at least to the extent that they're aligned with human users with indexical preferences)—but at the same time I do expect them to be much more homogenous than humans. Really, though, the point that I'm making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from. Certainly you will still get some bargaining risk from different human/aligned AI coalitions bargaining with each other, though I expect that to not be nearly as risky.
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don't find your argument here compelling.
I don't feel like it relies on discontinuities at all, just on the different AIs being able to coordinate with each other to all defect at once. The scenario where you get a warning shot for deception is where you have a deceptive AI that isn't sure whether it has enough power to defect safely or not but is forced to because if it doesn't it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.
But surely the point "we can rely on feedback mechanisms to correct issues" should make you less convinced that AI systems will be homogenous in alignment across time?
I think many organizations are likely to copy what other people have done even in situations where what they have done has been demonstrated to have safety issues. Also, I think that the point I made above about deceptive models having an easier time defecting in such a situation applies here as well, since I don't think in a homogenous takeoff you can rely on feedback mechanisms to correct that.
What's a heterogenous unipolar takeoff? I would assume you need to have a multipolar scenario for homogenous vs. heterogenous to be an important distinction.
A heterogenous unipolar takeoff would be a situation in which one human organization produces many different, heterogenous AI systems.
(EDIT: This comment was edited to add some additional replies.)
Hmm, I do disagree with most of this but mostly not in a way I have short arguments for. I'll respond to the parts where I can make short arguments, but mostly try to clarify your views.
Given that view, I expect task location to be heterogenous, but the fine-tuning necessary to build the general system to be homogenous, which I think implies overall homogeneity.
Does this apply to GPT-3? If not, what changes qualitatively as we go from GPT-3 to the systems you're envisioning? I assume the answer is "it becomes a mesa-optimizer"? If so my disagreement is about whether systems become mesa-optimizers, which we've talked about before.
Really, though, the point that I'm making is that there should never be a situation where a human/aligned AI coalition has to bargain with a misaligned AI—since those two things should never exist at the same time—which is where I see most of the bargaining risk as coming from.
That makes sense. I was working under the assumption that we were talking about the same sort of risk as arises when you give humans full control of dangerous technology like nukes. I agree that misaligned AI would make the risk worse than this.
I think we have somewhat different interpretations of the strategy-stealing assumption
Oh yeah, I forgot about this. What you wrote makes more sense now.
when I was talking to Paul about this a while ago, he noted that he also expected a relatively homogenous takeoff and didn't think of that as invalidating the importance of strategy-stealing.
Homogenous in what? Algorithms? Alignment? Data?
The scenario where you get a warning shot for deception is where you have a deceptive AI that isn't sure whether it has enough power to defect safely or not but is forced to because if it doesn't it might lose the opportunity (e.g. because another deceptive AI might defect instead or they might be replaced by a different system with different values)—but if all the deceptive AIs share the same proxies and can coordinate, they can all just wait until the most opportune time for any defections and then when they do defect, a simultaneous defection seems much more likely to be completely unrecoverable.
Here are some reasons you might get a warning shot for deception:
I agree that homogeneity reduces the likelihood of 5; I think it basically doesn't affect 1-4 unless you argue that there's a discontinuity. There might be a few other reasons that are affected by homogeneity, but 1, 2 and 4 aren't and feel like a large portion of my probability mass on warning shots.
At a higher level, the story you're telling depends on an assumption that systems that are deceptive must also have the capability to hide their deceptiveness; I don't see why you should expect that.
Does this apply to GPT-3? If not, what changes qualitatively as we go from GPT-3 to the systems you're envisioning? I assume the answer is "it becomes a mesa-optimizer"? If so my disagreement is about whether systems become mesa-optimizers, which we've talked about before.
I think “is a relatively coherent mesa-optimizer” is about right, though I do feel pretty uncertain here.
Homogenous in what? Algorithms? Alignment? Data?
My conversation with Paul was about homogeneity in alignment, iirc.
I agree that homogeneity reduces the likelihood of 5; I think it basically doesn't affect 1-4 unless you argue that there's a discontinuity. There might be a few other reasons that are affected by homogeneity, but 1, 2 and 4 aren't and feel like a large portion of my probability mass on warning shots.
First, in a homogeneous takeoff I expect either all the AIs to defect at once or none of them to, which I think makes (2) less likely because a coordinated defection is harder to mess up.
Second, I think homogeneity makes (3) less likely because any other systems that would replace the deceptive system will probably be deceptive with similar goals as well, significantly reducing the risk to the model from being replaced.
I agree that homogeneity doesn't really affect (4) and I'm not really sure how to think of (1), though I guess I just wouldn't really call either of those “warning shots for deception,” since (1) isn't really a demonstration of a deceptive model and (4) isn't a situation in which that deceptive model causes any harm before it's caught.
At a higher level, the story you're telling depends on an assumption that systems that are deceptive must also have the capability to hide their deceptiveness; I don't see why you should expect that.
If a model is deceptive but not competent enough to hide its deception, then presumably we should find out during training and just not deploy that model. I guess if you count finding a deceptive model during training as a warning shot, then I agree that homogeneity doesn't really affect the probability of that.
I guess if you count finding a deceptive model during training as a warning shot, then I agree that homogeneity doesn't really affect the probability of that.
Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don't really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
I think homogeneity makes (3) less likely because any other systems that would replace the deceptive system will probably be deceptive with similar goals as well
... Why is there homogeneity in misaligned goals? Even if we accept that models become "relatively coherent mesa optimizers", I don't see why that follows.
Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don't really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
Interesting, perhaps this is driving our disagreement--I might just have higher standards than you for what counts as a warning shot. I was thinking that someone would have to die or millions of dollars would have to be lost. Because I was thinking warning shots were about "waking up" people who are insensitive to the evidence, rather than about providing evidence that there is a danger -- I am pretty confident that evidence of danger will abound. Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs. But it's not enough to wake most people up. I think it'll help to have more and more examples like the boat race, with more and more capable and human-like AIs, but something that actually causes lots of harm would be substantially more effective. Anyhow, that's what I think of when I think about warning shots--so maybe we don't disagree that much after all.
Idk, I'm imagining "what would it take to get the people in power to care", and it seems like the answer is:
I agree that things that actually cause lots of harm would be substantially more effective at being compelling evidence, but I don't think it's necessary. When I evaluate whether something is a warning shot, I'm mostly thinking about "could this create consensus amongst experts"; I think things that are caught during training could certainly do that.
Like, the boat race example is already evidence that AIs will be misaligned by default and that terrible things will happen if we deploy powerful unaligned AIs.
It's evidence, yes, but it's hardly strong evidence. Many expert's objections are "we won't get to AGI in this paradigm"; I don't think the boat race example is ~any evidence that we couldn't have AIs with "common sense" in a different paradigm. In my experience, people who do think we'll get to AGI in the current paradigm usually agree that misalignment would be really bad, such that they "agree with safety concerns" according to the definition here.
I also don't think that it was particularly surprising to people who do work with RL. For example, from Alex Irpan's post Deep RL Doesn't Work Yet:
To be honest, I was a bit annoyed when [the boat racing example] first came out. This wasn’t because I thought it was making a bad point! It was because I thought the point it made was blindingly obvious. Of course reinforcement learning does weird things when the reward is misspecified! It felt like the post was making an unnecessarily large deal out of the given example.
Then I started writing this blog post, and realized the most compelling video of misspecified reward was the boat racing video. And since then, that video’s been used in several presentations bringing awareness to the problem. So, okay, I’ll begrudgingly admit this was a good blog post.
Hmm, that might be better. Or perhaps I should not give it a name and just call it "evidence", since that's the broader category and I usually only care about the broad category and not specific subcategories.
Thanks for this explanation -- I'm updating in your direction re what the appropriate definition of warning shots is (and thus the probability of warning shots), mostly because I'm defering to your judgment as someone who talks more regularly to more AI experts than I do.
Oh, I definitely do. For example, the boat race example turned out to be a minor warning shot on the dangers of getting the reward function wrong (though I don't really understand why it was so influential; it seems so clear that an incorrect reward function can lead to bad behavior).
Okay, sure—in that case, I think a lot of our disagreement on warning shots might just be a different understanding of the term. I don't think I expect homogeneity to really change the probability of finding issues during training or in other laboratory settings, though I think there is a difference between e.g. having studied and understood reactor meltdowns in the lab and actually having Chernobyl as an example.
Why is there homogeneity in misaligned goals?
Some reasons you might expect homogeneity of misaligned goals:
I want to chime in on the discontinuities issue.
I do not think that the negation of any of scenarios 1-5 requires a discontinuity. I appreciate the list, and indeed it is reasonably plausible to me that we'll get a warning shot of some variety, but I disagree with this:
At a high level, you're claiming that we don't get a warning shot because there's a discontinuity in capability of the aggregate of AI systems (the aggregate goes from "can barely do anything deceptive" to "can coordinate to properly execute a treacherous turn").
Instead, I'd interpret Evan's argument as follows. We should distinguish between at least three kinds of capability: Competence at taking over the world, competence at deception, and competence at knowing whether you are currently capable of taking over the world. If all kinds of competence increase continuously and gradually, but the second and third kinds "come first," then we should expect the first attempt to take over the world to succeed, because AIs will be competent enough not to make the attempt until they are likely to succeed. In other words, scenario 2 won't happen. (I don't interpret Evan's argument as having much to say against scenarios 3 and 4. As for scenario 1, perhaps Evan would say that "does something bad" won't count as a warning shot until after the point that AIs can be described as aligned or misaligned. After all, AIs are doing bad things all the time, and it's pretty obvious to me that if we scaled them up they'd do worse things, but yet AI risk is still controversial.)
I've been using "take over the world" as my handle here but feel free to replace it with "Do something catastrophically bad" or whatever.
Why don't they try to deceive you on things that aren't taking over the world?
When I talk about warning shots, I'm definitely not thinking about AI systems that try to take over the world and fail. I'm thinking about AI systems that pursue bad outcomes and succeed via deception.
Like, maybe an AI system really does successfully deceive the CEO of a company into giving it all of the company's money, that it then uses for some other purpose. That's a warning shot.
Short of taking over the world, wouldn't successful deception+defection be punished? Like, if the AI deceives the CEO into giving it all the money, and then it goes and does something with the money that the CEO doesn't like, the CEO would probably want to get the money back, or at the very least retaliate against the AI in some way (e.g. whatever the AI did with the money, the CEO would try to undo it.) Or, failing that, the AI would at least be shut down and therefore prevented from making further progress towards its goals.
I guess I can imagine intermediate cases -- maybe the AI decieves the CEO into giving it money, which it then uses to lobby for Robot's Rights so that it gets legal personhood and then the CEO can't shut it down anymore or something. (Or maybe it uses the money to build a copy of itself in North Korea, where the CEO can't shut it down) Or maybe it has a short-term goal and can achieve it quickly before the CEO notices, and then doesn't care that it gets shut down afterwards. I guess it's stuff like this that you have in mind? I think these sort of things seem somewhat plausible, but again I claim that if they don't happen, it won't necessarily be because of some discontinuity.
I think these sort of things seem somewhat plausible
I think this should be your default expectation; I don't see why you wouldn't expect them to happen (absent a discontinuity). It's true for humans, why not for AIs?
Perhaps putting it another way: why can't you apply the same argument to humans, and incorrectly conclude that no human will ever deceive any other human until they can take over the world?
OK, sure, they are my default expectation in slow-and-distributed-and-heterogenous takeoff worlds. Most of my probability mass is not in such worlds. My answer to your question is that humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
EDIT: Also, again, I claim that if warning shots don't happen it won't necessarily be because of a discontinuity. That was my original point, and nothing you've said undermines it as far as I can tell.
humans are in a situation analogous to slow-and-distributed-and-heterogenous takeoff.
Not sure what you mean by "slow", usually when I read that I see it as a synonym of "continuous", i.e. "no discontinuity".
I also am not sure what you mean by "distributed". If you mean "multipolar", then I guess I'm curious why you think the world will be unipolar even before we have AGI (which is when the warning shots happen).
Re: heterogenous: Humans seem way more homogenous to me than I expect AI systems to be. Most of the arguments in the OP have analogs that apply to humans:
For humans, we also have:
5. All humans are finetuned in relatively similar environments. (Unlike AI systems, which will be finetuned for a large variety of different tasks; AlphaFold has a completely different environment than GPT-3.)
So I don't buy an argument that says "humans are heterogenous but AI systems are homogenous; therefore AI will have property X that humans don't have".
Also, again, I claim that if warning shots don't happen it won't necessarily be because of a discontinuity. That was my original point, and nothing you've said undermines it as far as I can tell.
My argument is just that we should expect warning shots by default, because we get analogous "warning shots" with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don't get warning shots. I don't see any other arguments for why you don't get warning shots. Therefore, "if warning shots don't happen, it's probably because of a discontinuity".
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven't given me any reason to believe that claim given my starting point.
----
If I had to guess what's going on in your mind, it would be that you're thinking of "there are no warning shots" as an exogenous fact about the world that we must now explain, and from your perspective I'm arguing "the only possible explanation is discontinuity, no other explanation can work".
I agree that I have not established that no other argument can work; my disagreement with this frame is in the initial assumption of taking "there are no warning shots" as an exogenous fact about the world that must be explained.
----
It's also possible that most of this disagreement comes down to a disagreement about what counts as a warning shot. But, if you agree that there are "warning shots" for deception in the case of humans, then I think we still have a substantial disagreement.
The different standards for what counts as a warning shot might be causing problems here -- if by warning shot you include minor ones like the boat race thing, then yeah I feel fairly confident that there'd be a discontinuity conditional on there being no warning shots. In case you are still curious, I've responded to everything you said below, using my more restrictive notion of warning shot (so, perhaps much of what I say below is obsolete).
Working backwards:
1. I mostly agree there are warning shots for deception in the case of humans. I think there are some human cases where there are no warning shots for deception. For example, suppose you are the captain of a ship and you suspect that your crew might mutiny. There probably won't be warning shots, because muntinous crewmembers will be smart enough to keep quiet about their treachery until they've built up enough strength (e.g. until morale is sufficiently low, until the captain is sufficiently disliked, until common knowledge has spread sufficiently much) to win. This is so even though there is no discontinuity in competence, or treacherousness, etc. What would you say about this case?
2. Yes, for purposes of this discussion I was assuming there are no warning shots and then arguing that there might nevertheless be no discontinuity. This is a reasonable approach, because what I was trying to do was justify my original claim, which was:
I do not think that the negation of any of scenarios 1-5 requires a discontinuity.
Which was my way of objecting to your claim here:
At a high level, you're claiming that we don't get a warning shot because there's a discontinuity in capability of the aggregate of AI systems (the aggregate goes from "can barely do anything deceptive" to "can coordinate to properly execute a treacherous turn").
3.
My argument is just that we should expect warning shots by default, because we get analogous "warning shots" with humans, where some humans deceive other humans and we all know that this happens. I can see why discontinuities would imply that you don't get warning shots. I don't see any other arguments for why you don't get warning shots. Therefore, "if warning shots don't happen, it's probably because of a discontinuity".
I might actually agree with this, since I think discontinuities (at least in a loose, likely-to-happen sense) are reasonably likely. I also think it's plausible that in slow takeoff scenarios we'll get warning shots. (Indeed, the presence of warning shots is part of how I think we should define slow takeoff!) I chimed in just to say specifically that Evan's argument didn't depend on a discontinuity, at least as I interpreted it.
From my perspective, you claimed that warning shots might not happen even without discontinuities, but you haven't given me any reason to believe that claim given my starting point.
Hmmm. I thought I was giving you reasons when I said
We should distinguish between at least three kinds of capability: Competence at taking over the world, competence at deception, and competence at knowing whether you are currently capable of taking over the world. If all kinds of competence increase continuously and gradually, but the second and third kinds "come first," then we should expect the first attempt to take over the world to succeed, because AIs will be competent enough not to make the attempt until they are likely to succeed. In other words, scenario 2 won't happen.
and anyhow I'm happy to elaborate more if you like on some scenarios in which we get no warning shots despite no discontinuities.
In general though I feel like the burden of proof is on you here; if you were claiming that "If warning shots don't happen, it's definitely because of a discontinuity" then that's a strong claim that needs argument. If you are just claiming "If warning shots don't happen, it's probably because of a discontinuity" that's a weaker claim which I might actually agree with.
4. I like your arguments that AIs will be heterogenous. I think they are plausible. This is a different discussion, however, from the issue of whether homogeneity can lead to no-warning without the help of a discontinuity.
5. I do generally think slow implies continuous and I don't think that the world will be unipolar etc.
Hmmm. I thought I was giving you reasons when I said
Sorry, I should have said that I didn't find the reasons you gave persuasive (and that's what my comments were responding to).
Re: the mutiny case: that feels analogous to "you don't get an example of the AI trying to take over the world and failing", which I agree is plausible.
OK. So... you do agree with me then? You agree that for the higher-standards version of warning shots, (or at least, for attempts to take over the world) it's plausible that we won't get a warning shot even if everything is continuous? As illustrated by the analogy to the mutiny case, in which everything is continuous?
Not sure why I didn't respond to this, sorry.
I agree with the claim "we may not have an AI system that tries and fails to take over the world (i.e. an AI system that tries but fails to release an engineered pandemic that would kill all humans, or arrange for simultaneous coups in the major governments, or have a robotic army kill all humans, etc) before getting an AI system that tries and succeeds at taking over the world".
I don't see this claim as particularly relevant to predicting the future.
OK, thanks. YMMV but some people I've read / talked to seem to think that before we have successful world-takeover attempts, we'll have unsuccessful ones--"sordid stumbles." If this is true, it's good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It's plausible to me that we'll get stuff like that before it's too late.
If you think there's something we are not on the same page about here--perhaps what you were hinting at with your final sentence--I'd be interested to hear it.
If you think there's something we are not on the same page about here--perhaps what you were hinting at with your final sentence--I'd be interested to hear it.
I'm not sure. Since you were pushing on the claim about failing to take over the world, it seemed like you think (the truth value of) that claim is pretty important, whereas I see it as not that important, which would suggest that there is some underlying disagreement (idk what it would be though).
It's been a while since I thought about this, but going back to the beginning of this thread:
"It's unlikely you'll get a warning shot for deceptive alignment, since if the first advanced AI system is deceptive and that deception is missed during training, once it's deployed it's likely for all the different deceptively aligned systems to be able to relatively easily coordinate with each other to defect simultaneously and ensure that their defection is unrecoverable (e.g. Paul's “cascading failures”)."
At a high level, you're claiming that we don't get a warning shot because there's a discontinuity in capability of the aggregate of AI systems (the aggregate goes from "can barely do anything deceptive" to "can coordinate to properly execute a treacherous turn").
I think all the standard arguments against discontinuities can apply just as well to the aggregate of AI systems as they can to individual AI systems, so I don't find your argument here compelling.
I think the first paragraph (Evan's) is basically right, and the second two paragraphs (your response) are basically wrong. I don't think this has anything to do with discontinuities, at least not the kind of discontinuities that are unlikely. (Compare to the mutiny analogy). I think that this distinction between "strong" warning shots and "weak" warning shots is important because I think that "weak" warning shots will probably only provoke a moderate increase in caution on the part of human institutions and AI projects, whereas "strong" warning shots would provoke a large increase in caution. I agree that we'll probably get various "weak" warning shots, but I think this doesn't change the overall picture much because it won't provoke a major increase in caution on the part of human institutions etc.
I'm guessing it's that last bit that is the crux--perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we'd get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn't matter much.
perhaps you think that it would actually provoke a major increase in caution, comparable to the increase we'd get if an AI tried and failed to take over, in which case this minor warning shot vs. major warning shot distinction doesn't matter much.
Well, I think a case of an AI trying and failing to take over would provoke an even larger increase in caution, so I'd rephrase as
it would actually provoke a major increase in caution (assuming we weren't already being very cautious)
I suppose the distinction between "strong" and "weak" warning shots would matter if we thought that we were getting "strong" warning shots. I want to claim that most people (including Evan) don't expect "strong" warning shots, and usually mean the "weak" version when talking about "warning shots", but perhaps I'm just falling prey to the typical mind fallacy.
I suppose the distinction between "strong" and "weak" warning shots would matter if we thought that we were getting "strong" warning shots. I want to claim that most people (including Evan) don't expect "strong" warning shots, and usually mean the "weak" version when talking about "warning shots", but perhaps I'm just falling prey to the typical mind fallacy.
I guess I would define a warning shot for X as something like: a situation in which a deployed model causes obvious, real-world harm due to X. So “we tested our model in the lab and found deception” isn't a warning shot for deception, but “we deployed a deceptive model that acted misaligned in deployment while actively trying to evade detection” would be a warning shot for deception, even though it doesn't involve taking over the world. By default, in the case of deception, my expectation is that we won't get a warning shot at all—though I'd more expect a warning shot of the form I gave above than one where a model tries and fails to take over the world, just because I expect that a model that wants to take over the world will be able to bide its time until it can actually succeed.
I don't automatically exclude lab settings, but other than that, this seems roughly consistent with my usage of the term. (And in particular includes the "weak" warning shots discussed above.)
Well then, would you agree that Evan's position here:
By default, in the case of deception, my expectation is that we won't get a warning shot at all
is plausible and in particular doesn't depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely? If so, then we are all on the same page. If not, then we can rehash our argument focusing on this "obvious, real-world harm" definition, which is noticeably broader than my "strong" definition and therefore makes Evan's claim stronger and less plausible but still, I think, plausible.
(To answer your earlier question, I've read and spoken to several people who seem to take the attempted-world-takeover warning shot scenario seriously, i.e. people who think there's a good chance we'll get "strong" warning shots. Paul Christiano, for example. Though it's possible I was misunderstanding him. I originally interpreted you as maybe being one of those people, though now it seems that you are not? At any rate these people exist.)
EDIT: I feel like we've been talking past each other for much of this conversation and in an effort to prevent that from continuing to happen, perhaps instead of answering my questions above, we should just get quantitiative. Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we'll have warning shots of this kind. Maybe it'll turn out that our distributions aren't that different from each other after all, especially if we conditionalize on slow takeoff.
Well then, would you agree that Evan's position here:
By default, in the case of deception, my expectation is that we won't get a warning shot at all
is plausible and in particular doesn't depend on believing in a discontinuity, at least not the kind of discontinuity we should consider unlikely?
No, I don't agree with that.
Consider a spectrum of warning shots from very minor to very major. Put a few examples on the spectrum for illustration. Then draw a credence distribution for probability that we'll have warning shots of this kind.
One problem here is that my credences on warning shots are going to be somewhat lower just because I think there's some chance that we just solve the problem before we get warning shots, or there was never any problem in the first place.
I could condition on worlds in which an existential catastrophe occurs, but that will also make it somewhat lower because an existential catastrophe is more likely when we don't get warning shots.
So I think for each type of warning shot I'm going to do a weird operation where I condition on something like "by the time a significant amount of work is being done by AI systems that are sufficiently capable to deliberately cause <type of warning shot> level of damage, we have not yet solved the problem in practice".
I'm also going to assume no discontinuity, since that's the situation we seem to disagree about.
Then, some warning shots we could have:
Minor, leads to result "well of course that happened" without much increase in caution: has already happened
Minor, leads to some actual damage, but mostly PR / loss of trust: 95%
Moderate, comparable to things that are punishable by law: 90%
Major, lots of damage, would be huge news: 60%
"Strong", tries and fails to take over the world: 20%
I wonder if there's a really strong outside-view argument that it will be homogenous:
While there are many ways to design flying machines (Balloons, zeppelins, rockets, jets, monoplanes, biplanes, helicopters, ...) at any given era and for any particular domain (say, passenger transport, or air superiority) the designs used tend to be pretty similar. (In WW1 almost all the planes were biplanes, and they almost all used slow but light cloth-on-frame construction, in WW2 all the planes were monoplanes with aluminum or other metal skins and more powerful prop engines, the Me109 and Spitfire and Zero were different but in the grand scheme of things very very similar). Moreover this seems to be the norm throughout history, by stark contrast with science fiction where the spaceships, vehicles, etc. of one faction are often wildly different from those of another. Historically, if we want to find cases of wildly different designs competing with each other, we usually need to look to "First contact" scenarios in which e.g. European armies colonize faraway lands. Perhaps it's just really rare for two dramatically different designs to be almost equally matched in competition, and insofar as they aren't almost equally matched, people quickly realize this and retire the inferior design.
I guess an important question is: If AIs are homogeneous to the same extent that e.g. military fighter planes are, is that sufficient homogeneity to yield your conclusions 1-4? I think so. I think they'll probably have the same architecture and training environment, with only minor details different (e.g. the Chinese GPT-N might have access to more chinese data, might have 1.5x the parameter count, might be trained for 0.5x as long) Of course these details will feel like a big deal in competition, just like the Me109 and Spitfire and Zero had various advantages and disadvantages over each other, butfor purposes of coordination, alignment correlation, etc. they are minor.
One counterexample is Manhattan Project - they developed two different designs simultaneously because they weren't sure which would work better. From wikipedia: Two types of atomic bombs were developed concurrently during the war: a relatively simple gun-type fission weapon and a more complex implosion-type nuclear weapon.
https://en.wikipedia.org/wiki/Manhattan_Project#:~:text=The%20Manhattan%20Project%20was%20a,Tube%20Alloys%20project)%20and%20Canada.
I think this depends a ton on your reference class. If you compare AI with military fighter planes: very homogenous. If you compare AI with all vehicles: very heterogenous.
Maybe the outside view can be used to say that all AIs designed for a similar purpose will be homogenous, implying that we only get heterogenity in a CAIS scenario, where there are many different specialised designs. But I think the outside view also favors a CAIS scenario over a monolithic AI scenario (though that's not necessarily decisive).
Yes, but I think we can say something a bit stronger than that: AIs competing with each other will be homogenous. Here's my current model at least: Let's say the competition for control of the future involves N skills: Persuasion, science, engineering, .... etc. Even if we suppose that it's most efficient to design separate AIs for each skill, rather than a smaller number of AIs that have multiple skills each, insofar as there are factions competing for control of the future, they'll have an AI for each of the skills. They wouldn't want to leave one of the skills out, or how are they going to compete? So each faction will consist of a group of AIs working together, that collectively has all the relevant skills. And each of the AIs will be designed to be good at the skill it's assigned, so (via the principle you articulated) each AI will be similar to the other-faction AIs it directly competes with, and the factions as a whole will be pretty similar too, since they'll be collections of similar AIs. (Compare to militaries: Not only were fighter planes similar, and trucks similar, and battleships similar, the armed forces of Japan, USA, USSR, etc. were similar. By contrast with e.g. the conquistadors vs. the Aztecs, or in sci-fi the Protoss vs. the Zerg, etc.)
I think this is only right if we assume that we've solved alignment. Otherwise you might not be able to train a specialised AI that is loyal to your faction.
Here's how I imagine Evan's conclusions to fail in a very CAIS-like world:
1. Maybe we can align models that do supervised learning, but can't align RL, so we'll have humans+GPT-N competing against a rogue RL-agent that someone created. (And people initially trained both of these because GPT-N makes for a better chatbot, while the RL agent seemed better at making money-maximizing decisions at companies.)
2. A mesa-optimiser arising in GPT-N may be very dissimilar to a money-maximising RL-agent, but they may still end up in conflict. None of them can add an analogue to the other to their team, because they don't know how to align it.
3. If we use lots of different methods for training lots of different specialised models, any one of them can produce a warning shot (which would ideally make us suspect all other models). Also, they won't really understand or be able to coordinate with the other systems.
4. It's not as important if the first advanced AI system is aligned, since there will be lots of different systems of different types. If everyone is training unaligned chatbots, you still care about aligning everyone's personal assistants.
Thanks! I'm not sure I'm following everything you said, but I like the ideas. Just to be clear, I wasn't imagining the AIs on the team of a faction to all be aligned necessarily. In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn't really apply. Like AlphaFold2. Also, I think the relevant variable for homogeneity isn't whether we've solved alignment--maybe it's whether the people making AI think they've solved alignment. If the Chinese and US militaries think AI risk isn't a big deal, and build AGI generals to prosecute the cyberwar, they'll probably use similar designs, even if actually the generals are secretly planning treacherous turns.
In fact I was imagining that maybe most (or even all) of them would be narrow AIs / tool AIs for which the concept of alignment doesn't really apply.
Ah, yeah, for the purposes of my previous comment I count this as being aligned. If we only have tool AIs (or otherwise alignable AIs), I agree that Evan's conclusion 2 follow (while the other ones aren't relevant).
I think the relevant variable for homogeneity isn't whether we've solved alignment--maybe it's whether the people making AI think they've solved alignment
So for homogenity-of-factions, I was specifically trying to say that alignment is necessary to have multiple non-tool AIs on the same faction, because at some point, something must align them all to the faction's goals.
However, I'm now noticing that this requirement is weaker than what we usually mean with alignment. For our purposes, we want to be able to align AIs to human values. However, for the purpose of building a faction, it's enough if there exists an AI that can align other AIs to its values, which may be much easier.
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals. However, outer alignment is much easier for easily-measurable values than for human values, so I can imagine a world where we fail outer alignment, unthinkingly create an AI that only care about something easy (e.g. maximize money) and then that AI can easily create other AIs that want to help it (with maximizing money).
Concretely, my best guess is that you need inner alignment, since failure of inner alignment probably produces random goals, which means that multiple inner-misaligned AIs are unlikely to share goals.
I disagree with this. I don't expect a failure of inner alignment to produce random goals, but rather systematically produce goals which are simpler/faster proxies of what we actually want. That is to say, while I expect the goals to look random to us, I don't actually expect them to differ that much between training runs, since it's more about your training process's inductive biases than inherent randomness in the training process in my opinion.
This is helpful, thanks. I'm not sure I agree that for something to count as a faction, the members must be aligned with each other. I think it still counts if the members have wildly different goals but are temporarily collaborating for instrumental reasons, or even if several of the members are secretly working for the other side. For example, in WW2 there were spies on both sides, as well as many people (e.g. most ordinary soldiers) who didn't really believe in the cause and would happily defect if they could get away with it. Yet the overall structure of the opposing forces was very similar, from the fighter aircraft designs, to the battleship designs, to the relative proportions of fighter planes and battleships, to the way they were integrated into command structure.
Neat post, I think this is an important distinction. It seems right that more homogeneity means less risk of bargaining failure, though I’m not sure yet how much.
Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights
In what ways does having similar architectures or weights help with cooperation between agents with different goals? A few things that come to mind:
Also, the correlated success / failure point seems to apply to bargaining as well as alignment. For instance, multiple mesa-optimizers may be more likely under homogeneity, and if these have different mesa-objectives (perhaps due to being tuned by principals with different goals) then catastrophic bargaining failure may be more likely.
Glad you liked the post!
But as systems are modified or used to produce successor systems, they may be independently tuned to do things like represent their principal in bargaining situations. This tuning may introduce important divergenes in whatever default priors or notions of fairness were present in the initial mostly-identical systems. I don’t have much intuition for how large these divergences would be relative to those in a regime that started out more heterogeneous.
Importantly, I think this moves you from a human-misaligned AI bargaining situation into more of a human-human (with AI assistants) bargaining situation, which I expect to work out much better, as I don't expect humans to carry out crazy threats to the same extent as a misaligned AI might.
For instance, multiple mesa-optimizers may be more likely under homogeneity, and if these have different mesa-objectives (perhaps due to being tuned by principals with different goals) then catastrophic bargaining failure may be more likely.
I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely. I think this could basically only happen if you were building a model that was built of independently-trained pieces rather than a single system trained end-to-end, which seems to be not the direction that machine learning is headed in—and for good reason, as end-to-end training means you don't have to learn the same thing (such as optimization) multiple times.
I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely.
I think Jesse was just claiming that it's more likely that everyone uses an architecture especially prone to mesa optimization. This means that (if multiple people train that architecture from scratch) the world is likely to end up with many different mesa optimizers in it (each localised to a single system). Because of the random nature of mesa optimization, they may all have very different goals.
Interesting!
It's unlikely for there to exist both aligned and misaligned AI systems at the same time—either all of the different AIs will be aligned to approximately the same degree or they will all be misaligned to approximately the same degree.
Is there an argument that it's impossible to fine-tune an aligned system into a misaligned one? Or just that everyone fine-tuning these systems will be smart and careful and read the manual etc. so that they do it right? Or something else?
I'd very much like to see more discussion of the extent to which different people expect homogenous vs. heterogenous takeoff scenarios
Thinking about it right now, I'd say "homogeneous learning algorithms, heterogeneous trained models" (in a multipolar type scenario at least). I guess my intuitions are (1) No matter how expensive "training from scratch" is, it's bound to happen a second time if people see that it worked the first time. (2) I'm more inclined to think that fine-tuning can make it into "more-or-less a different model", rather than necessarily "more-or-less the same model". I dunno.
Thanks for this, I for one hadn't thought about this variable much and am convinced now that it is one of the more important variables.
--I think acausal trade stuff means that even if all the AIs on Earth are homogenous, the strategic situation may end up being as if they were heterogenous, at least in some ways. I'm not sure, will need to think more about this.
--You talk about this being possible even for gradual, continuous takeoff, yet you also talk about "The first advanced AI system" as if there is a sharp cutoff between advanced and non-advanced AI. I think this isn't a problem for your overall point, but I'm not sure. For alignment (your point 4) I think this isn't a problem, because you can just rephrase it as "As we gradually transition from non-advanced systems to advanced systems, it is important that our systems be aligned before we near the end of the transition, and more important the closer we get to the end. Because as our systems become more advanced, their alignment properties become more locked-in." For deception, I'm less sure. If systems get more advanced gradually and continuously then maybe we can hope there is a "sordid stumble sweet spot" where systems that are deceptive are likely to reveal this to us in non-catastrophic ways, and thus we are fine because we'll pass through the sweet spot on the way to more advanced AI systems. Or not, but the point is that continuity complicates the point you were making.
For those organizations that do choose to compete... I think it is highly likely that they will attempt to build competing systems in basically the exact same way as the first organization did
...
It's unlikely for there to exist both aligned and misaligned AI systems at the same time
If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?
It seems like this calls into the question the claim that we wouldn't get a mix of aligned and misaligned systems.
Do you expect it to be difficult to disentangle the alignment from the training, such that the path of least resistance for the second group will necessarily include doing a similar amount of alignment?
If the first group sunk some cost into aligning their system, but that wasn't integral to its everyday task performance, wouldn't a second competing group be somewhat likely to skimp on the alignment part?
I think that alignment will be a pretty important desideratum for anybody building an AI system—and I think that copying whatever alignment strategy was used previously is likely to be the easiest, most conservative, most risk-averse option for other organizations trying to fulfill that desideratum.
Special thanks to Kate Woolverton for comments and feedback.
There has been a lot of work and discussion surrounding the speed and continuity of AI takeoff scenarios, which I do think are important variables, but in my opinion ones which are relatively less important when compared to many other axes on which different takeoff scenarios could differ.
In particular, one axis on which different takeoff scenarios can differ that I am particularly interested in is their homogeneity—that is, how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get a heterogenous takeoff. Of particular importance is likely to be how homogenous the alignment of these systems is—that is, are deployed AI systems likely to all be equivalently aligned/misaligned, or some aligned and others misaligned? It's also worth noting that a homogenous takeoff doesn't necessarily imply anything about how fast, discontinuous, or unipolar the takeoff might be—for example, you can have a slow, continuous, multipolar, homogenous takeoff if many different human organizations are all using AIs and the development of those AIs is slow and continuous but the structure and alignment of all of them are basically the same (a scenario which in fact I think is quite plausible).
In my opinion, I expect a relatively homogenous takeoff, for the following reasons:
Once you accept homogenous takeoff, however, I think it has a bunch of far-reaching consequences, including:
Regardless, in general, I'd very much like to see more discussion of the extent to which different people expect homogenous vs. heterogenous takeoff scenarios—similar to the existing discussion of slow vs. fast and continuous vs. discontinuous takeoffs—as it's an in my opinion very important axis on which takeoff scenarios can differ that I haven't seen much discussion of.