Quintin’s post is an argument about a key historical reference class and what it tells us about AI. Instead of arguing that the reference makes his point, he is instead arguing that it doesn’t make anyone’s point - that we understand the reasons for humanity’s sudden growth in capabilities. He says this jump was caused by gaining access to cultural transmission which allowed partial preservation of in-lifetime learning across generations, which was a vast efficiency gain that fully explains the orders of magnitude jump in the expansion of human capabilities. Since AIs already preserve their metaphorical in-lifetime learning across their metaphorical generations, he argues, this does not apply to AI.
This last paragraph makes an extremely important claim that I want to ensure I convey fully:
- IF we understand the mechanism behind humanity's sharp left turn with respect to evolution
- AND that mechanism is inapplicable to AI development
- THEN, there's no reason to reference evolution at all when forecasting AI development rates, not as evidence for a sharp left turn, not as an "illustrative example" of some mechanism / intuition which might supposedly lead to a sharp left turn in AI development, not for anything.
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
If we applied this standard to other forms of reference class arguments, then presumably we would throw most or almost all of them out as well - anything where the particular mechanism was understood, and didn’t directly apply to AI, would not count. I do not see Quintin or other skeptics proposing this standard of evidence more generally, nor do I think it would make sense.
He then goes on to make another very broad claim.
In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.[4]
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.[5]
This is the AI development equivalent of using landmines as hubcaps.
There are a number of extra things snuck in here that seem unnecessary.
Start with ‘deliberately.’ Why would that matter? Either it gets created or it doesn’t. Same with obvious, physics does not care what is obvious. We already presumably have what some of us would call obvious inner optimizers that are being simulated inside (e.g.) GPT-4 in order to predict humans that optimize for things. Did we create them deliberately?
Why does it matter whether their inner loss function ‘mentions’ human values or objectives? This seems like a non-sequitur. It seems highly unlikely that the exact intended (let alone resultant) target of the optimizer will be ‘human values’ generally, even if we knew how to specify that or what it even means, whether or not the optimizer was created intentionally.
Step two seems rather arbitrary. Why billions? Why is it not sufficient that it be superior at the relevant optimizations? This is a strange isolated demand for matching order of magnitude ratios. Quintin also seems (in a following paragraph) to only want to count explicit and intentionally applied optimization pressure that targets the inner optimizer, as opposed to the effective impact of whatever it is you do.
Step three does not seem necessary at all. It does seem like currently we are doing exactly this, but even if we didn’t, the inner optimizer has more optimization pressure working for it in the relevant areas, so why would we presume that the outer optimizer would be able to supervise it effectively or otherwise keep it in check? At most, we need the outer optimizer to not be a sufficiently effective control.
(If one wishes optionally to use the metaphor we are debating, reproductive fitness and survival indeed supervises, limits and intervenes on culture, and that has not proven a sufficiently effective way to align culture to reproductive fitness, although it very much does shape it in highly fundamental ways continuously. Also note that it did a much better job of this when culture moved relatively slowly in the past, and now in the future with faster cultural evolution relative to human lifecycles, we see far more people in cultures that are very bad at reproductive fitness - in the past we had such cultures (e.g. monastic forms of religions like Christianity) but they self-limited in size.)
I would also note that, if you discover (as in Quintin’s example of Evo-inc) that major corporations are going around using landmines as hubcaps, and that they indeed managed to gain dominant car market share and build the world’s most functional cars until recently, that is indeed a valuable piece of information about the world, and whether you should trust corporations or other humans to be able to make good choices, realize obvious dangers and build safe objects in general. Why would you think that such evidence should be ignored?
Quintin then proposes the two ways he sees fast takeoff as still possible.
AIs contributing to AI capabilities research, producing a positive feedback loop with a sharp upwards kink around the time that AI contributions exceed human contributions.
AIs deliberately seeking out new training data that grant them useful capabilities. E.g., an AI trying to improve its bioengineering capabilities may set up a very fast cycle of gathering and analyzing new biological data, which significantly outpaces the rate of human scientific innovation.
I would argue that ‘AIs contribute to AI capabilities research’ is highly analogous to ‘humans contribute to figuring out how to train other humans.’ And that ‘AIs seeking out new training data’ is highly analogous to ‘humans creating bespoke training data to use to train other people especially their children via culture’ which are exactly the mechanisms Quintin is describing humans as using to make a sharp left turn.
Here are four alignment-relevant implications:
Takeoff is less abrupt. Both of the above mechanisms are vaguely similar to how human cultural development allowed us to jump forwards in capabilities by feeding the outputs of one generation into the “training data” of the next generation. However, I expect that neither mechanism will produce as much of a relative jump in AI capabilities, as cultural development produced in humans. Neither mechanism would suddenly unleash an optimizer multiple orders of magnitude faster than anything that came before, as was the case when humans transitioned from biological evolution to cultural development.
Takeoff becomes easier to navigate. These specific mechanisms of capabilities advance probably both allow for iteration and experimentation. We currently have examples of both AI capabilities advances and of online learning / exploration processes. We can run experiments on current systems to assess the alignment risks posed by both these sources of capabilities improvement.
Capabilities gains are less general. "capabilities generalize further than alignment" is a common refrain in discussions about the sharp left turn. Usually, this claim is justified by making an analogy to how human capabilities started to quickly generalize across many domains simultaneously.
Alignment probably generalizes pretty well. Speaking of alignment techniques failing, I expect alignment techniques to mostly generalize across capabilities jumps caused by either of the above mechanisms for sudden capabilities gain.
Quintin acknowledges the similarity, but says this would not result in an orders of magnitude speedup. Why not?
Ignore the evolution parallel here, and look only at the scenario offered. What happens when the AI starts contributing to AI research? If the AI suddenly became able to perform as a human-level alignment researcher or capabilities researcher, only at the speed of an AI with many copies in parallel, would that not speed up development by orders of magnitude? Is this not Leike’s explicit plan for Superalignment, with the hope that we could then shift enough resources into alignment to keep pace?
One could say ‘first the AI will speed up research by automating only some roles somewhat, then more roles more, so it won’t be multiple orders of magnitude at the exact same time’ but so what? The timelines this implies do not seem so different from the timeline jumps in evolution. We would still be talking (in approximate terms throughout, no need to get pedantic) about takeoff to vast superintelligence in a matter of years at most, versus a prior human information age that lasted decades, versus industrial civilization lasting centuries, versus agricultural civilization lasting millennia, versus cultural transmission lasting tens of thousands, homo sapiens hundreds of thousands, human-style primates millions, primates in general tens of millions, land animals hundreds of millions, life and Earth billions, the universe tens of billions? Presumably with a ‘slow takeoff’ period of years as AIs start to accelerate work, then a period of months when humans are mostly out of the loop, then… something else?
That seems like a sharp enough left turn to me.
The second distinction he mentions is that this allows more iteration and experimentation. Well, maybe. In some ways, for some period. But the whole idea of ‘we can run alignment experiments on current systems, before they are dangerously general, and that will tell us what applies in the future’ assumes the conclusion.
The third distinction claims that capabilities gains will be less general. Why? Are cultural transmission gains general in this sense, or specific? Except that enough of that then effectively generalized. Humans, indeed, have continuously gained new capabilities, then been bottlenecked due to lack of other capabilities, then used their new capabilities to solve the next bottleneck. I don’t see why this time is different, or why you wouldn’t see a human-level-of-generality leap to generality from the dynamics Quintin is describing. I see nothing in his evolutionary arguments here as reason to not expect that. There are reasons for or against expecting more or less such generality, but mostly they aren’t covered here, and seem orthogonal to the discussion.
If anything, arguing that our generality came from human scaffolding and iterated design seems like an argument in favor of expecting AIs to become more general.
The fourth claim is made prior to its justification, which is in the later sections.
As a general note, these sections seem mostly to be making a general alignment is easy, alignment-by-default claim, rather than being about what evolution offers evidence for, and I would have liked to see them presented as a distinct post given how big and central and complex and disputed is the claim here.
He starts with an analogous claim to his main claim, that humans being clearly misaligned with genetic fitness is not evidence that we should expect such alignment issues in AIs. His argument (without diving into his earlier linked post) seems to be that humans are fresh instances trained on new data, so of course we expect different alignment and different behavior.
But if you believe that, you are saying that humans are fresh versions of the system. You are entirely throwing out from your definition of ‘the system’ all of the outer alignment and evolutionary data, entirely, saying it does not matter, that only the inner optimizer matters. In which case, yes, that does fully explain the differences. But the parallel here does not seem heartening. It is saying that the outcome is entirely dependent on the metaphorical inner optimizer, and what the system is aligned to will depend heavily on the details of the training data it is fed and the conditions under which it is trained, and what capabilities it has during that process, and so on. Then we will train new more capable systems in new ways with new data using new techniques, in an iterated way, in similar fashion. How should this make us feel better about the situation and its likely results?
Once again, there is the background presumption that things like ‘extreme misgeneralization’ need to happen for us to be in trouble. I find these attempts to sneak in a form of alignment-by-default to be extremely frustrating.
The next section seems to argue that because alignment techniques work on a variety of existing training regimes all of similar capabilities level, we should expect alignment techniques to extend to future systems with greater capabilities. I suppose this is not zero evidence. The opposite result was possible and would have been bad news, so this must be good news. The case here still ignores the entire reason why I and others expect the techniques to fail, or why evolutionary arguments would expect it to fail, in the future.
He closes by arguing that iteratively improving training data also exhibits important differences from cultural development, sufficient to ignore the evolutionary evidence as not meaningful in this context. I do not agree. Even if I did agree, I do not see how that would justify his broader optimism expressed here:
The odds of any given subset of model data catastrophically interfering with our current alignment techniques cannot be that high, otherwise our current alignment techniques wouldn't work on our current models.
Once again, the whole argument is that the current techniques will break down when capabilities advance. Saying aberrant data does not usually break alignment at current capability levels is some evidence of robustness, given that the opposite would have been evidence against it and was certainly plausible before we experimented, but again this ignores the claimed mechanisms. It also ignores that the current style of alignment does indeed seem porous and imprecise, in a way that is acceptable at current capabilities levels but that would be highly scary at sufficiently high capabilities levels. My model of how this works is that the system will indeed incorporate all the data, and will get more efficient and effective at this as capabilities advance, but this does not currently have that much practical import in many cases.
I do think there are some good points here drawing distinctions between the evolutionary and artificial cases, including details I hadn’t considered. However I also think there are a lot of statements I disagree with strongly, or that seem overstated, or that seemed to sneak in assumptions I disagree with, often about alignment and its difficulties and default outcomes.
The central claim, that evolution provides no evidence for the sharp left turn, definitely seems false to me, or at least strongly overstated. Even if I bought the individual arguments in the post fully, which I do not, that is not how evidence works. Consider the counterfactual. If we had not seen a sharp left turn in evolution, civilization had taken millions of years to develop to this point with gradual steady capability gains, and we saw humans exhibiting strong conscious optimization mostly for their genetic fitness, it would seem crazy not to change our beliefs at all about what is to come compared to what we do observe. Thus, evidence.
I would also note that Quintin in my experience often cites parallels between humans and AIs as a reason to expect good outcomes from AI due to convergent outcomes, in circumstances where it would be easy to find many similar distinctions between the two cases. Here, although I disagree with his conclusions, I agree with him that the human case provides important evidence.
In the generally strong comments to OP, Steven Byrnes notes that current LLM systems are incapable of autonomous learning, versus humans and AlphaZero which are, and that we should expect this ability in future LLMs at some point. Constitutional AI is not mentioned, but so far it has only been useful for alignment rather than capabilities, and Quintin suggests autonomous learning mostly relies upon a gap between generation and discernment in favor of discernment being easier. I think this is an important point, while noting that what matters is ability to discern between usefully outputs at all, rather than it being easier, which is an area where I keep trying to put my finger on writing down the key dynamics and so far falling short.
Response to: Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest.
Quintin’s post is an argument about a key historical reference class and what it tells us about AI. Instead of arguing that the reference makes his point, he is instead arguing that it doesn’t make anyone’s point - that we understand the reasons for humanity’s sudden growth in capabilities. He says this jump was caused by gaining access to cultural transmission which allowed partial preservation of in-lifetime learning across generations, which was a vast efficiency gain that fully explains the orders of magnitude jump in the expansion of human capabilities. Since AIs already preserve their metaphorical in-lifetime learning across their metaphorical generations, he argues, this does not apply to AI.
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
If we applied this standard to other forms of reference class arguments, then presumably we would throw most or almost all of them out as well - anything where the particular mechanism was understood, and didn’t directly apply to AI, would not count. I do not see Quintin or other skeptics proposing this standard of evidence more generally, nor do I think it would make sense.
He then goes on to make another very broad claim.
There are a number of extra things snuck in here that seem unnecessary.
Start with ‘deliberately.’ Why would that matter? Either it gets created or it doesn’t. Same with obvious, physics does not care what is obvious. We already presumably have what some of us would call obvious inner optimizers that are being simulated inside (e.g.) GPT-4 in order to predict humans that optimize for things. Did we create them deliberately?
Why does it matter whether their inner loss function ‘mentions’ human values or objectives? This seems like a non-sequitur. It seems highly unlikely that the exact intended (let alone resultant) target of the optimizer will be ‘human values’ generally, even if we knew how to specify that or what it even means, whether or not the optimizer was created intentionally.
Step two seems rather arbitrary. Why billions? Why is it not sufficient that it be superior at the relevant optimizations? This is a strange isolated demand for matching order of magnitude ratios. Quintin also seems (in a following paragraph) to only want to count explicit and intentionally applied optimization pressure that targets the inner optimizer, as opposed to the effective impact of whatever it is you do.
Step three does not seem necessary at all. It does seem like currently we are doing exactly this, but even if we didn’t, the inner optimizer has more optimization pressure working for it in the relevant areas, so why would we presume that the outer optimizer would be able to supervise it effectively or otherwise keep it in check? At most, we need the outer optimizer to not be a sufficiently effective control.
(If one wishes optionally to use the metaphor we are debating, reproductive fitness and survival indeed supervises, limits and intervenes on culture, and that has not proven a sufficiently effective way to align culture to reproductive fitness, although it very much does shape it in highly fundamental ways continuously. Also note that it did a much better job of this when culture moved relatively slowly in the past, and now in the future with faster cultural evolution relative to human lifecycles, we see far more people in cultures that are very bad at reproductive fitness - in the past we had such cultures (e.g. monastic forms of religions like Christianity) but they self-limited in size.)
I would also note that, if you discover (as in Quintin’s example of Evo-inc) that major corporations are going around using landmines as hubcaps, and that they indeed managed to gain dominant car market share and build the world’s most functional cars until recently, that is indeed a valuable piece of information about the world, and whether you should trust corporations or other humans to be able to make good choices, realize obvious dangers and build safe objects in general. Why would you think that such evidence should be ignored?
Quintin then proposes the two ways he sees fast takeoff as still possible.
I would argue that ‘AIs contribute to AI capabilities research’ is highly analogous to ‘humans contribute to figuring out how to train other humans.’ And that ‘AIs seeking out new training data’ is highly analogous to ‘humans creating bespoke training data to use to train other people especially their children via culture’ which are exactly the mechanisms Quintin is describing humans as using to make a sharp left turn.
Quintin acknowledges the similarity, but says this would not result in an orders of magnitude speedup. Why not?
Ignore the evolution parallel here, and look only at the scenario offered. What happens when the AI starts contributing to AI research? If the AI suddenly became able to perform as a human-level alignment researcher or capabilities researcher, only at the speed of an AI with many copies in parallel, would that not speed up development by orders of magnitude? Is this not Leike’s explicit plan for Superalignment, with the hope that we could then shift enough resources into alignment to keep pace?
One could say ‘first the AI will speed up research by automating only some roles somewhat, then more roles more, so it won’t be multiple orders of magnitude at the exact same time’ but so what? The timelines this implies do not seem so different from the timeline jumps in evolution. We would still be talking (in approximate terms throughout, no need to get pedantic) about takeoff to vast superintelligence in a matter of years at most, versus a prior human information age that lasted decades, versus industrial civilization lasting centuries, versus agricultural civilization lasting millennia, versus cultural transmission lasting tens of thousands, homo sapiens hundreds of thousands, human-style primates millions, primates in general tens of millions, land animals hundreds of millions, life and Earth billions, the universe tens of billions? Presumably with a ‘slow takeoff’ period of years as AIs start to accelerate work, then a period of months when humans are mostly out of the loop, then… something else?
That seems like a sharp enough left turn to me.
The second distinction he mentions is that this allows more iteration and experimentation. Well, maybe. In some ways, for some period. But the whole idea of ‘we can run alignment experiments on current systems, before they are dangerously general, and that will tell us what applies in the future’ assumes the conclusion.
The third distinction claims that capabilities gains will be less general. Why? Are cultural transmission gains general in this sense, or specific? Except that enough of that then effectively generalized. Humans, indeed, have continuously gained new capabilities, then been bottlenecked due to lack of other capabilities, then used their new capabilities to solve the next bottleneck. I don’t see why this time is different, or why you wouldn’t see a human-level-of-generality leap to generality from the dynamics Quintin is describing. I see nothing in his evolutionary arguments here as reason to not expect that. There are reasons for or against expecting more or less such generality, but mostly they aren’t covered here, and seem orthogonal to the discussion.
If anything, arguing that our generality came from human scaffolding and iterated design seems like an argument in favor of expecting AIs to become more general.
The fourth claim is made prior to its justification, which is in the later sections.
As a general note, these sections seem mostly to be making a general alignment is easy, alignment-by-default claim, rather than being about what evolution offers evidence for, and I would have liked to see them presented as a distinct post given how big and central and complex and disputed is the claim here.
He starts with an analogous claim to his main claim, that humans being clearly misaligned with genetic fitness is not evidence that we should expect such alignment issues in AIs. His argument (without diving into his earlier linked post) seems to be that humans are fresh instances trained on new data, so of course we expect different alignment and different behavior.
But if you believe that, you are saying that humans are fresh versions of the system. You are entirely throwing out from your definition of ‘the system’ all of the outer alignment and evolutionary data, entirely, saying it does not matter, that only the inner optimizer matters. In which case, yes, that does fully explain the differences. But the parallel here does not seem heartening. It is saying that the outcome is entirely dependent on the metaphorical inner optimizer, and what the system is aligned to will depend heavily on the details of the training data it is fed and the conditions under which it is trained, and what capabilities it has during that process, and so on. Then we will train new more capable systems in new ways with new data using new techniques, in an iterated way, in similar fashion. How should this make us feel better about the situation and its likely results?
Once again, there is the background presumption that things like ‘extreme misgeneralization’ need to happen for us to be in trouble. I find these attempts to sneak in a form of alignment-by-default to be extremely frustrating.
The next section seems to argue that because alignment techniques work on a variety of existing training regimes all of similar capabilities level, we should expect alignment techniques to extend to future systems with greater capabilities. I suppose this is not zero evidence. The opposite result was possible and would have been bad news, so this must be good news. The case here still ignores the entire reason why I and others expect the techniques to fail, or why evolutionary arguments would expect it to fail, in the future.
He closes by arguing that iteratively improving training data also exhibits important differences from cultural development, sufficient to ignore the evolutionary evidence as not meaningful in this context. I do not agree. Even if I did agree, I do not see how that would justify his broader optimism expressed here:
Once again, the whole argument is that the current techniques will break down when capabilities advance. Saying aberrant data does not usually break alignment at current capability levels is some evidence of robustness, given that the opposite would have been evidence against it and was certainly plausible before we experimented, but again this ignores the claimed mechanisms. It also ignores that the current style of alignment does indeed seem porous and imprecise, in a way that is acceptable at current capabilities levels but that would be highly scary at sufficiently high capabilities levels. My model of how this works is that the system will indeed incorporate all the data, and will get more efficient and effective at this as capabilities advance, but this does not currently have that much practical import in many cases.
I do think there are some good points here drawing distinctions between the evolutionary and artificial cases, including details I hadn’t considered. However I also think there are a lot of statements I disagree with strongly, or that seem overstated, or that seemed to sneak in assumptions I disagree with, often about alignment and its difficulties and default outcomes.
The central claim, that evolution provides no evidence for the sharp left turn, definitely seems false to me, or at least strongly overstated. Even if I bought the individual arguments in the post fully, which I do not, that is not how evidence works. Consider the counterfactual. If we had not seen a sharp left turn in evolution, civilization had taken millions of years to develop to this point with gradual steady capability gains, and we saw humans exhibiting strong conscious optimization mostly for their genetic fitness, it would seem crazy not to change our beliefs at all about what is to come compared to what we do observe. Thus, evidence.
I would also note that Quintin in my experience often cites parallels between humans and AIs as a reason to expect good outcomes from AI due to convergent outcomes, in circumstances where it would be easy to find many similar distinctions between the two cases. Here, although I disagree with his conclusions, I agree with him that the human case provides important evidence.
In the generally strong comments to OP, Steven Byrnes notes that current LLM systems are incapable of autonomous learning, versus humans and AlphaZero which are, and that we should expect this ability in future LLMs at some point. Constitutional AI is not mentioned, but so far it has only been useful for alignment rather than capabilities, and Quintin suggests autonomous learning mostly relies upon a gap between generation and discernment in favor of discernment being easier. I think this is an important point, while noting that what matters is ability to discern between usefully outputs at all, rather than it being easier, which is an area where I keep trying to put my finger on writing down the key dynamics and so far falling short.