Second, we could more-or-less deal with systems which defect as they arise. For instance, during deployment we could notice that some systems are optimizing something different than what we intended during training, and therefore we shut them down.
Each individual system won’t by themselves carry more power than the sum of projects before it. Instead, AIs will only be slightly better than the ones that came before it, including any AIs we are using to monitor the newer ones.
If the sum of projects from before carry more power than the individual system, such that it can't win by defection, there's no reason for it to defect. It might just join the ranks of "projects from before", and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike. If the way we build these things systematically renders them misaligned, we'll sooner or later end up with a majority of them being misaligned, at which point we can't trivially use them to shut down defectors.
(I agree that continuous takeoff does give us more warning, because some systems will presumably defect early, especially weaker ones. And IDA is kind of similar to this strategy, and could plausibly work. I just wanted to point out that a naive implementation of this doesn't solve the problem of treacherous turns.)
Expanding on that a little, even if we know our AIs are misaligned that doesn't necessarily save us. We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs. Since misaligned AIs that can't take over still mostly follow human instructions, there will be tremendous economic incentives to deploy more such systems. This is effectively a tragedy of the commons: for every individual actor, deploying more AIs only increases global risk a little but brings in tremendous revenue. However, collectively, risk accumulates rapidly. At some point the total power of misaligned AIs crosses some (hard to predict in advance) threshold and there is a phase transition (a cascade of failures) from a human-controlled world to a coalition-of-misaligned-AI-controlled world. Alternatively, the AIs might find a way to manipulate our entire culture into gradually changing its values into something the AIs prefer (like with Murder Gandhi).
We might reach a state of knowledge when it is easy to create AIs that (i) misaligned (ii) superhuman and (iii) non-singular (i.e. a single such AI is not stronger than the sum total of humanity and aligned AIs) but hard/impossible to create aligned superhuman AIs.
My intuition is that it'd probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned. However, it seems pretty plausible that we'll end up in a state where we know how to create non-singular, superhuman AIs; strongly suspect that most/all of them are mis-aligned; but don't have great methods to tell whether any particular AI is aligned or mis-aligned. Does that sound right to you?
My intuition is that it'd probably be pretty easy to create an aligned superhuman AI if we knew how to create non-singular, mis-aligned superhuman AIs, and had cheap, robust methods to tell if a particular AI was misaligned.
This sounds different from how I model the situation; my views agree here with Nate's (emphasis added):
I would rephrase 3 as "There are many intuitively small mistakes one can make early in the design process that cause resultant systems to be extremely difficult to align with operators’ intentions.” I’d compare these mistakes to the “small” decision in the early 1970s to use null-terminated instead of length-prefixed strings in the C programming language, which continues to be a major source of software vulnerabilities decades later.
I’d also clarify that I expect any large software product to exhibit plenty of actually-trivial flaws, and that I don’t expect that AGI code needs to be literally bug-free or literally proven-safe in order to be worth running. Furthermore, if an AGI design has an actually-serious flaw, the likeliest consequence that I expect is not catastrophe; it’s just that the system doesn’t work. Another likely consequence is that the system is misaligned, but in an obvious ways that makes it easy for developers to recognize that deployment is a very bad idea. The end goal is to prevent global catastrophes, but if a safety-conscious AGI team asked how we’d expect their project to fail, the two likeliest scenarios we’d point to are "your team runs into a capabilities roadblock and can't achieve AGI" or "your team runs into an alignment roadblock and can easily tell that the system is currently misaligned, but can’t figure out how to achieve alignment in any reasonable amount of time."
My current model of 'the default outcome if the first project to develop AGI is highly safety-conscious, is focusing on alignment, and has a multi-year lead over less safety-conscious competitors' is that the project still fails, because their systems keep failing their tests but they don't know how to fix the deep underlying problems (and may need to toss out years of work and start from scratch in order to have a real chance at fixing them). Then they either (a) lose their lead, and some other project destroys the world; (b) decide they have to ignore some of their tests, and move ahead anyway; or (c) continue applying local patches without understanding or fixing the underlying generator of the test failures, until they or their system find a loophole in the tests and sneak by.
I don't think any of this is inevitable or impossible to avoid; it's just the default way I currently visualize things going wrong for AGI developers with a strong interest in safety and alignment.
Possibly you'd want to rule out (c) with your stipulation that the tests are "robust"? But I'm not sure you can get tests that robust. Even in the best-case scenario where developers are in a great position to build aligned AGI and successfully do so, I'm not imagining post-hoc tests that are robust to a superintelligence trying to game them. I'm imagining that the developers have a prior confidence from their knowledge of how the system works that every part of the system either lacks the optimization power to game any relevant tests, or will definitely not apply any optimization to trying to game them.
Possibly you'd want to rule out (c) with your stipulation that the tests are "robust"? But I'm not sure you can get tests that robust.
That sounds right. I was thinking about an infinitely robust misalignment-oracle to clarify my thinking, but I agree that we'll need to be very careful with any real-world-tests.
If I imagine writing code and using the misalignment-oracle on it, I think I mostly agree with Nate's point. If we have the code and compute to train a superhuman version of GPT-2, and the oracle tells us that any agent coming out from that training process is likely to be misaligned, we haven't learned much new, and it's not clear how to design a safe agent from there.
I imagine a misalignment-oracle to be more useful if we use it during the training process, though. Concretely, it seems like a misalignment-oracle would be extremely useful to achieve inner alignment in IDA: as soon as the AI becomes misaligned, we can either rewind the training process and figure out what we did wrong, or directly use the oracle as a training signal that severely punish any step that makes the agent misaligned. Coupled with the ability to iterate on designs, since we won't accidentally blow up the world on the way, I'd guess that something like this is more likely to work than not. This idea is extremely sensitive to (c), though.
Has the "alignment roadblock" scenario been argued for anywhere?
Like Lanrian, I think it sounds implausible. My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem. For example, the AI which can talk its way out of a box probably has a very deep understanding of humans--a deeper understanding than most humans have of humans! In order to have such a deep understanding, it must have lower-level building blocks for making sense of the world which work extremely well, and could be used for a value learning system.
BTW, coincidentally, I quoted this same passage in a post I wrote recently which discussed this scenario (among others). Is there a particular subscenario of this I outlined which seems especially plausible to you?
My intuition is that understanding human values is a hard problem, but taking over the world is a harder problem.
Especially because taking over the world requires you to be much better than other agents who want to stop you from taking over the world, which could very well include other AIs.
ETA: That said, upon reflection, there have been instances of people taking over large parts of the world without being superhuman. All world leaders qualify, and it isn't that unusual. However, what would be unusual is if someone wanted to take over the world and everyone else didn't want that yet it still happened.
It might just join the ranks of "projects from before", and subtly try to alter future systems to be similarly defective, waiting for a future opportunity to strike.
Admittedly, I did not explain this point well enough. What I meant to say was that before we have the first successful defection, we'll have some failed defection. If the system could indefinitely hide its own private intentions to later defect, then I would already consider that to be a 'successful defection.'
Knowing about a failed defection, we'll learn from our mistake and patch that for future systems. To be clear, I'm definitely not endorsing this as a normative standard for safety.
I agree with the rest of your comment.
Yet, at no point during this development did any project leap forward by a huge margin. Instead, every paper built upon the last one by making minor improvements and increasing the compute involved. Since these minor improvements nonetheless happened rapidly, the result is that the GANs followed a fast development relative to the lifetimes of humans.
Does anyone have time series data on the effectiveness of Go-playing AI? Does that similarly follow a gradual trend?
AlphaGo seems much closer to "one project leaps forward by a huge margin." But maybe I'm mistaken about how big an improvement AlpahGo was over previous Go AIs.
Miles Brundage argues that "it’s an impressive achievement, but considering it in this larger context should cause us to at least slightly decrease our assessment of its size/suddenness/significance in isolation".
In the wake of AlphaGo’s victory against Fan Hui, much was made of the purported suddenness of this victory relative to expected computer Go progress. In particular, people at DeepMind and elsewhere have made comments to the effect that experts didn’t think this would happen for another decade or more. One person who said such a thing is Remi Coulom, designer of CrazyStone, in a piece in Wired magazine. However, I’m aware of no rigorous effort to elicit expert opinion on the future of computer Go, and it was hardly unanimous that this milestone was that long off. I and others, well before AlphaGo’s victory was announced, said on Twitter and elsewhere that Coulom’s pessimism wasn’t justified. Alex Champandard noted that at a gathering of game AI experts a year or so ago, it was generally agreed that Go AI progress could be accelerated by a concerted effort by Google or others. At AAAI last year [2015], I also asked Michael Bowling, who knows a thing or two about game AI milestones (having developed the AI that essentially solved limit heads-up Texas Hold Em), how long it would take before superhuman Go AI existed, and he gave it a maximum of five years. So, again, this victory being sudden was not unanimously agreed upon, and claims that it was long off are arguably based on cherry-picked and unscientific expert polls. [...]
Hiroshi Yamashita extrapolated the trend of computer Go progress as of 2011 into the future and predicted a crossover point to superhuman Go in 4 years, which was one year off. In recent years, there was a slowdown in the trend (based on highest KGS rank achieved) that probably would have lead Yamashita or others to adjust their calculations if they had redone them, say, a year ago, but in the weeks leading up to AlphaGo’s victory, again, there was another burst of rapid computer Go progress. I haven’t done a close look at what such forecasts would have looked like at various points in time, but I doubt they would have suggested 10 years or more to a crossover point, especially taking into account developments in the last year. Perhaps AlphaGo’s victory was a few years ahead of schedule based on reported performance, but it should always have been possible to anticipate some improvement beyond the (small team/data/hardware-based) trend based on significant new effort, data, and hardware being thrown at the problem. Whether AlphaGo deviated from the appropriately-adjusted trend isn’t obvious, especially since there isn’t really much effort going into rigorously modeling such trends today. Until that changes and there are regular forecasts made of possible ranges of future progress in different domains given different effort/data/hardware levels, “breakthroughs” may seem more surprising than they really should be.
There has been considerable debate over whether development in AI will experience a discontinuity, or whether it will follow a more continuous growth curve. Given the lack of consensus and the confusing, diverse terminology, it is natural to hypothesize that much of the debate is due to simple misunderstandings. Here, I seek to dissolve some misconceptions about the continuous perspective, based mostly on how I have seen people misinterpret it in my own experience.
First, we need to know what I even mean by continuous takeoff. When I say it, I mean a scenario where the development of competent, powerful AI follows a trajectory that is roughly in line with what we would have expected by extrapolating from past progress. That is, there is no point at which a single project lunges forward in development and creates an AI that is much more competent than any other project before it. This leads to the first clarification,
Continuous doesn't necessarily mean slow
The position I am calling "continuous" has been called a number of different names over the years. Many refer to it as "slow" or "soft." I think continuous is preferable to these terms because it focuses attention on the strategically relevant part of the question. It seems to matter less what the actual clock-time is from AGI to superintelligence, and instead matters more if there are will be single projects who break previous technological trends and gain capabilities that are highly unusual relative to the past.
Moreover, there are examples of rapid technological developments that I consider to be continuous. As an example, consider GANs. In 2014, GANs were used to generate low quality black-and-white photos of human faces. By late 2018, they were used to create nearly-photorealistic images of human faces.
Yet, at no point during this development did any project leap forward by a huge margin. Instead, every paper built upon the last one by making minor improvements and increasing the compute involved. Since these minor improvements nonetheless happened rapidly, the result is that the GANs followed a fast development relative to the lifetimes of humans.
Extrapolating from this progress, we can assume that GAN video generation will follow a similar trajectory, starting with simple low resolution clips, and gradually transitioning to the creation of HD videos. What would be unusual is if someone right now in late 2019 produces some HD videos using GANs.
Large power differentials can still happen in a continuous takeoff
Power differentials between nations, communities, and people are not unusual in the course of history. Therefore, the existence of a deep power differential caused by AI would not automatically imply that a discontinuity has occurred.
In a continuous takeoff, a single nation or corporation might still pull ahead in AI development by a big margin and use this to their strategic advantage. To see how, consider how technology in the industrial revolution was used by western European nations to conquer much of the world.
Nations rich enough to manufacture rifles maintained a large strategic advantage over those unable to. Despite this, the rifle did not experience any surprising developments which catapulted it to extreme usefulness, as far as I can tell. Instead, sharpshooting became gradually more accurate, with each decade producing slightly better rifles.
See also: Soft takeoff can still lead to decisive strategic advantage
Continuous takeoff doesn't require believing that ems will come first
This misconception seems to mostly be a historical remnant of the Hanson-Yudkowsky AI-Foom debate. In the old days, there weren't many people actively criticizing foom. So, if you disagreed with foom, it was probably because you were sympathetic to Hanson's views.
There are now many people who disagree with foom who don't take Hanson's side. Paul Christiano and AI Impacts appear somewhat at the forefront of this new view.
Recursive self-improvement is compatible with continuous takeoff
In my experience, recursive self improvement is one of the main reasons cited for why we should expect a discontinuity. The validity of this argument is far from simple, but needless to say: folks who subscribe to continuous takeoff aren't simply ignoring it.
Consider I.J. Good's initial elaboration of recursive self improvement,
The obvious interpretation from the continuous perspective is that by the time we have an ultraintelligent machine, we'll already have a not-quite-ultraintelligent machine. Therefore, the advantage that an ultraintelligent machine will have over the collective of humanity + machines will be modest.
It is sometimes argued that even if this advantage is modest, the growth curves will be exponential, and therefore a slight advantage right now will compound to become a large advantage over a long enough period of time. However, this argument by itself is not an argument against a continuous takeoff.
Exponential growth curves are common for macroeconomic growth, and therefore this argument should apply equally to any system which experiences a positive feedback loop. Furthermore, large strategic advantages do not automatically constitute a discontinuity since they can still happen even if no project surges forward suddenly.
Continuous takeoff is relevant to AI alignment
The misconception here is something along the lines of, "Well, we might not be able to agree about AI takeoff, but at least we can agree that AI safety is extremely valuable in either case." Unfortunately, the usefulness of many approaches to AI alignment appear to hinge quite a bit on continuous takeoff.
Consider the question of whether an AGI would defect during testing. The argument goes that an AI will have an instrumental reason to pretend to be aligned while weak, and then enter a treacherous turn when it is safe from modification. If this phenomenon ever occurs, there are two distinct approaches we can take to minimize potential harm.
First, we could apply extreme caution and try to ensure that no system will ever lie about its intentions. Second, we could more-or-less deal with systems which defect as they arise. For instance, during deployment we could notice that some systems are optimizing something different than what we intended during training, and therefore we shut them down.
The first approach is preferred if you think that there will be a rapid capability gain relative the rest of civilization. If we deploy an AI and it suddenly catapults to exceptional competence, then we don't really have a choice other than to get its values right the first time.
On the other hand, under a continuous takeoff, the second approach seems more promising. Each individual system won't by themselves carry more power than the sum of projects before it. Instead, AIs will only be slightly better than the ones that came before it, including any AIs we are using to monitor the newer ones. Therefore, to the extent that the second approach carries a risk, it will probably look less like a sudden world domination and will look more like a bad product rollout, in line with say, the release of Windows Vista.
Now, obviously there are important differences between current technological products and future AGIs. Still, the general strategy of "dealing with things as they come up" is much more viable under continuous takeoff. Therefore, if a continuous takeoff is more likely, we should focus our attention on questions which fundamentally can't be solved as they come up. This is a departure from the way that many have framed AI alignment in the past.