i broadly agree with this take for AGI, but I want to provide some perspective on why it might be counterintuitive: when labs do safety work on current AI systems like chatgpt, a large part of the work is writing up a giant spec that specifies how the model should behave in all sorts of situations -- when asked for a bioweapon, when asked for medical advice, when asked whether it is conscious, etc. as time goes on and models get more capable, this spec gets bigger and more complicated.
the obvious retort is that AGI will be different. but there are a lot of people who are skeptical of abstract arguments that things will be different in the future, and much more willing to accept arguments based on current empirical trends.
I appreciate that you're engaging in this question, but I have a hard time taking much away from this.
Quickly:
1. I think we're both probably skeptical of what Joshua Achiam is approaching. My hunch is that he's representing a particularly niche view on things.
2. It feels to me like there's a gigantic challenge of choosing what scale to approach this problem. You seem to be looking at AGI and it's effects from a very high-level. Like, we need to run evals on massive agents exploring the real world - and if this is too complex for more regular robust engineering, then we better throw out the idea of robust engineering. To me this seems like a bit complaining that "Robust engineering won't work for vehicle safety because it's won't be able to tackle the complex society-wide effects of how vehicles will be used at scale."
I think that the case for using more robust methods for AI doesn't look like straightforwardly solving all global-scale large-agent challenges, but instead achieving high reliability levels on much narrower tasks. For example, can we make sure that a single LLM pass has a 99.99% success rate at certain kinds of hacks.
This wouldn't solve all the challenges that would come from complex large systems, similar to how vehicle engineering doesn't solve all challenges with vehicles. But it might be able to still be a big part of the solution.
You might then argue that a complex system composed of robust parts wouldn't have any benefit from that robustness, because all the 'interesting' stuff happens on the larger scales. My quick guess is that this is one area where we would disagree, perhaps there's a crux here.
(Thanks!) To me, your comment is like: “We have a great plan for robust engineering of vehicles (as long as they are at on a dry, warm, indoor track, going under 10kph).” OK that’s better than nothing. But if we are eventually going to be driving cars at high speed in the cold rain, it’s inadequate. We did not test or engineer them in the right environment.
This is not a complex systems objection (e.g., it’s not about how the world changes with billions of cars). It’s a distribution shift objection. Even just one car will fail at high speed in the cold rain under those circumstances.
If there’s a distribution shift (test environment systematically different from deployment environment), then you need sufficiently deep understanding of the system to allow extrapolation across the distribution shift.
In the AI case, the issue is: there’s a strong (quadrillion-dollar) economic incentive to make a type of AI that can found, run, and staff innovative companies, 100% autonomously, for years on end, even when the company is doing things that nobody has thought of, and even in a world possibly very different from our own.
And then there’s a huge distribution shift between the environment in which we expect such AIs to be operating, and the environment(s) in which we can safely test those AIs.
My actual opinion is that this type of AI won’t be an LLM, which e.g. have issues with long context windows and don’t do human-like continual learning.
…But even if it were an LLM (or system that includes LLMs), I think you’re going wrong by treating “reliability” and “robustness” as synonyms, when LLMs are actually much stronger at the former than the latter.
We can make a car that’s 99.99% “reliable” on a warm dry indoor track, but after you distribution-shift into the cold rain, it might be 10% or 0% reliable. So it’s not “robust” to distribution shifts. By the same token, I’m willing to believe that LLMs can be made 99.99% “reliable” (in the sense of reliably doing a specific thing in a specific situation). But in weird situations, LLMs sometimes go off the rails; e.g. nobody has yet made an LLM that could not be jailbroken, despite years of work. They’re not very “robust”.
You’re sorta implying that I’m against robustness, but in the OP I was saying the opposite. I think we desperately need robustness. I just think we’re not gonna get it without deeper understanding, because of distribution shifts.
This post is partly a belated response to Joshua Achiam, currently OpenAI’s Head of Mission Alignment:
As a proud member of the group of “otherwise very smart people” making “one of the dumbest mistakes”, I will explain why I don’t think it’s a mistake. (Indeed, since 2022, some “x-risk people” have started working towards these kinds of specs, and I think they’re the ones making a mistake and wasting their time!)
At the same time, I’ll describe what I see as the kernel of truth in Joshua’s perspective, and why it should be seen as an indictment not of “x-risk people” but rather of OpenAI itself, along with all the other groups racing to develop AGI.
1. My qualifications (such as they are)
I’m not really an expert on high-reliability engineering. But I worked from 2015-2021 as a physicist at an engineering R&D firm, where many of my coworkers were working on building things that really had to work in exotic environments—things like guidance systems for submarine-launched nuclear ICBMs, or a sensor & electronics package that needed to operate inside the sun’s corona.
To be clear, I wasn’t directly working on these kinds of “high-reliability engineering” projects. (I specialized instead in very-early-stage design and feasibility work for dozens of weird system concepts and associated algorithms.) But my coworkers were doing those projects, and over those five years I gained some familiarity with what they were doing on a day-to-day basis and how.
…So yeah, I’m not really an “expert”. But as a full-time AGI safety & alignment researcher since 2021, I’m plausibly among the “Pareto Best In The World”™ at simultaneously understanding both high-reliability engineering best practices and AGI safety & alignment. So here goes!
2. High-reliability engineering in brief
Basically, the idea is:
3. Is any of this applicable to AGI safety?
3.1 In one sense, no, obviously not
Let’s say I had a single top-human-level-intelligence AGI, and I wanted to make $250B with it. Well, hmm, Jeff Bezos used his brain to make $250B, so an obvious thing I could do is have my AGI do what Jeff Bezos did, i.e. go off and autonomously found, grow, and run an innovative company.
(If you get off the train here, then see my discussion of “Will almost all future companies eventually be founded and run by autonomous AGIs?” at this link.)
Now look at that bulleted list above, and think about how it would apply here. For example: “You understand exactly what the thing is supposed to be doing, in every situation that you care about.”
No way.
At my old engineering R&D firm, we knew exactly what such-and-such subsystem was supposed to do: it was supposed to output a measurement of Quantity X, every Y milliseconds, with no more than noise Z and drift W, so long as it remains within such-and-such environmental parameters. Likewise, a bridge designer knows exactly what a bridge is supposed to do: not fall down, nor sway and vibrate more than amplitude V under traffic load U and wind conditions T etc.
…OK, and now what exactly is our “AGI Jeff Bezos” supposed to be doing at any given time?
Nobody knows!
Indeed, the fact that nobody knows is the whole point! That’s the very reason that an AGI Jeff Bezos can create so much value!
When Human Jeff Bezos started Amazon in 1994, he was obviously not handed a detailed spec for what to do in any possible situation, where following that spec would lead to the creation of a wildly successful e-commerce / cloud computing / streaming / advertising / logistics / smart speaker / Hollywood studio / etc. business. For example, in 1994, nobody, not Jeff Bezos himself, nor anyone else on Earth, knew how to run a modern cloud computing business, because indeed the very idea of “modern cloud computing business” didn’t exist yet! That business model only came to exist when Jeff Bezos (and his employees) invented it, years later.
By the same token, on any given random future day…
…And we’re gonna write a detailed spec for that, analogous to the specs for the sensor and bridge that I mentioned above? And we’re gonna ensure that the AGI will follow this spec by design?
No way. If you believe that, then I think you are utterly failing to imagine a world with actual AGI.
3.2 In a different sense, yes, at least I sure as heck hope so eventually
When we build actual AGI, it will be like a new intelligent species on Earth, and one which will eventually be dramatically faster, more numerous, and more competent than humans. If they want to wipe out humans and run the world by themselves, they’ll be able to. (For more on AGI extinction risk in general, see the 80,000 hours intro, or my own intro.)
Now, my friends on the Parker Solar Probe project were able to run certain tests in advance—radiation tests, thermal tests, and so on—but the first time their sensor went into the actual solar corona, it had to work, with no do-overs.
By the same token, we can run certain tests on future AGIs, in a safe way. But the first time that AGIs are autonomously spreading around the world, and inventing transformative new technologies and ideas, and getting an opportunity to irreversibly entrench their power, those AGIs had better be making decisions we’re happy about, with no do-overs.
All those practices listed in §2 above exist for a reason; they're the only way we even have a chance of getting a system to work the first time in a very new situation. They are not optional nice-to-haves, rather they are the bare minimum to make the task merely “very difficult” rather than “hopeless”. If it seems impossible to apply those techniques to AGI, per §3.1 above, then, well, we better shut up and do the impossible.
What might that look like? How do we get to a place where we have deep understanding, and where this understanding gives us a strong reason to believe that things will go well in the (out-of-distribution) scenarios of concern, and where we have a wide variety of safe tests that can be quantitatively compared with that understanding in order to surface unknown unknowns?
I don't know!
Presumably the “spec” and the tests would be more about the AGI’s motivation, or its disposition, or something, rather than about its object-level actions? Well, whatever it is, we better figure it out.
We're not there today, nor anywhere close.
(And even if we got there, then we would face the additional problem that all existing and likely future AI companies seem to have neither the capability, nor the culture, nor the time, nor generally even the desire to do the rigorous high-reliability engineering (§2) for AGI. See e.g. Six Dimensions of Operational Adequacy in AGI Projects (Yudkowsky 2017).)
4. Optional bonus section: Possible objections & responses
Possible Objection 1: Your §3.1 is misleading; we don’t need to specify what AGI Jeff Bezos needs to do to run a successful innovative business, rather we need to specify what he needs to not do, e.g. he needs to not break the law.
My Response: If you want to say that “don’t break the law” etc. counts as a spec, well, nobody knows how to do the §2 stuff (deep understanding etc.) for “don’t break the law” either.
And yes, we should tackle that problem. But I don’t see any way that a 300-page spec (as suggested by Joshua Achiam at the top) would be helpful for that. In particular:
Possible Objection 2: We only need the §2 stuff (deep understanding etc.) if there are potentially-problematic distribution shifts between test and deployment. If we can do unlimited low-stakes tests of the exact thing that we care about, then we can just do trial-and-error iteration. And we get that for free because AGI will improve gradually. Why do you expect problematic distribution shifts?
My Response: See my comments in §3.2, plus maybe my post “Sharp Left Turn” discourse: An opinionated review. Or just think: we’re gonna get to a place where there are millions of telepathically-communicating super-speed-John-von-Neumann-level AGIs around the world, getting sculpted by continual learning for the equivalent of subjective centuries, and able to coordinate, invent new technologies and ideas, and radically restructure the world if they so choose … and you really don’t think there’s any problematic distribution shift between that and your safe sandbox test environment??
So the upshot is: the gradual-versus-sudden-takeoff debate is irrelevant for my argument here. (Although for the record, I do expect superintelligence to appear more suddenly than most people do.)
Maybe an analogy is: if you’re worried that a nuclear weapon with yield Y might ignite the atmosphere, it doesn’t help to first test a nuclear weapon with yield 0.1×Y, and then if the atmosphere hasn’t been ignited yet, next try testing one with yield 0.2×Y, etc.