Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

Zvi

Recently, a discussion of potential AGI interventions and potential futures was posted to LessWrong. The picture Eliezer presented was broadly consistent with my existing model of Eliezer’s model of reality, and most of it was also consistent with my own model of reality.

Those two models overlap a lot, but they are different and my model of Eliezer strongly yells at anyone who thinks they shouldn’t be different that Eliezer wrote a technically not infinite but rather very large number of words explaining that you need to think for real and evaluate such things for yourself. On that, our models definitely agree.

It seemed like a useful exercise to reread the transcript of Eliezer’s discussion, and explicitly write out the world model it seems to represent, so that’s what I’m going to do here.

Here are some components of Eliezer’s model, directly extracted from the conversation, rewritten to be third person. It’s mostly in conversation order but a few things got put in logical order.

Before publishing, I consulted with Rob Bensinger, who helped refine several statements to be closer to what Eliezer actually endorses. I explicitly note the changes where they involve new info, so it’s clear what is coming from the conversation and what is coming from elsewhere. In other places it caused me to clean up my wording, which isn’t noted. It’s worth pointing out that the corrections often pointed in the ‘less doom’ direction, both in explicit claims and in tone/implication, so chances are this comes off as generally implying more doom than is appropriate.

Nate rather than Eliezer, but offered as a preface with p~0.85: AGI is probably coming within 50 years. Rob notes that Eliezer may or may not agree with this timeline, and that it shortens if you condition on ‘unaligned’ and lengthens conditional on ‘aligned.’
By default this AGI will come from something similar to some of today’s ML paradigms. Think enormous inscrutable floating-point vectors.
AGI that isn’t aligned ends the world.
AGI that isn’t aligned carefully and on purpose isn’t aligned, period.
It may be possible to align an AGI carefully and have it not end the world.
Right now we don’t know how to do it at all, but in theory we might learn.
The default situation is an AGI system arises that can be made more powerful by adding more compute, and there’s an extended period where it’s not aligned yet and if you add too much compute the world ends, but it’s possible that if you had enough time to work on it and no one did that, you’d have a shot.
More specifically, when combined with other parts of the model detailed later: “I think we’re going to be staring down the gun of a completely inscrutable model that would kill us all if turned up further, with no idea how to read what goes on inside its head, and no way to train it on humanly scrutable and safe and humanly-labelable domains in a way that seems like it would align the superintelligent version, while standing on top of a whole bunch of papers about “small problems” that never got past “small problems”.”
If we don’t learn how to align an AGI via safety research, nothing else can save us period.
Thus, all scenarios where we win are based on a technical surprising positive development of unknown shape, and all plans worth having should assume such a surprising positive development is possible in technical space. In the post this is called a ‘miracle’ but this has misleading associations – it was not meant to imply a negligible probability, only surprise, so Rob suggested changing it to ‘surprising positive development.’ Which is less poetic and longer, but I see the problem.
Eliezer does know a lot of ways not to align an AGI, which is helpful (e.g. Edison knew a lot of ways not to build a light bulb) but also isn’t good news.
Carefully aligning an AGI would at best be slow and difficult, requiring years of work, even if we did know how.
Before you could hope to finish carefully aligning an AGI, someone else with access to the code could use that code to end the world. Rob clarifies that good info security still matters and can meaningfully buy you time, and suggests this: “By default (absent strong op-sec and research closure), you should expect that before you can finish carefully aligning an AGI, someone else with access to the code could use that code to end the world. Likewise, by default (absent research closure and a large technical edge), you should expect that other projects will independently figure out how to build AGI shortly after you do.”
There are a few players who we might expect to choose not to end the world like Deepmind or Anthropic, but only a few. There are many actors, each of whom might or might not end the world in such a spot (e.g. home hobbyists or intelligence agencies or Facebook AI research), and it only takes one of them.
Keeping the code and insights involved secret and secure over an extended period is a level of social technology no ML group is close to having. I read the text as making the stronger claim that we lack the social technology for groups of sufficient size to keep this magnitude of secret for the required length of time, even with best known practices.
Trying to convince the folks that would otherwise destroy the world that their actions would destroy the world isn’t impossible on some margins, so in theory some progress could be made, and some time could be bought, but not enough to buy enough time.
Most reactions to such problems by such folks, once their attention is drawn to them, would make things worse rather than better. Tread carefully or not at all, and trying to get the public into an uproar seems worse than useless.
Trying to convince various projects to become more closed rather than open is possible, and (as per Rob) a very good idea if you would actually succeed, but insufficient.
Trying to convince various projects to join together in the endgame, if we were to get to one, is possible, but also insufficient and (as per Rob) matters much less than becoming more closed now.
Closed and trustworthy projects are the key to potentially making technical progress in a safe and useful way. There needs to be a small group that can work on a project and that wouldn’t publish the resulting research or share its findings automatically with a broader organization, via sufficiently robust subpartitions.
Anthropic in particular doesn’t seem open to alternative research approaches and mostly wants to apply More Dakka, and doesn’t seem open to sufficiently robust subpartitians, but those could both change.
Deepmind in particular is a promising potential partner if they could form the required sufficiently robust subpartitions, even if Demis must be in the loop.
OpenAI as a concept (rather than the organization with that name), is a maximally bad concept almost designed to make the playing field as unwinnable as possible, details available elsewhere. Of course, the organization itself could change (with or without being renamed to ClosedAI).
More generally, publishing findings burns the common resource ‘time until AGI’ and the more detail you publish about your findings along {quiet internal result -> announced and demonstrated result -> paper describing how to get the announced result -> code for the result -> model for the result} the more of it you burn, but the more money and prestige the researchers get for doing that.
One thing that would be a big win would be actual social and corporate support for subpartitianed projects that didn’t publish their findings, where it didn’t cost lots of social and weirdness points for the researchers, thus allowing researchers to avoid burning the commons.
Redwood Research (RR) is a new research organization that’s going to try and do alignment experiments on toy problems to learn things, in ways people like Eliezer think are useful and valuable and that they wish someone would do. Description not directly from Eliezer but in context seems safe to assume he roughly agrees.
Previously (see Hanson/Eliezer FOOM debate) Eliezer thought you’d need recursive self-improvement first to get fast capability gain, and now it looks like you can get fast capability gain without it, for meaningful levels of fast. This makes ‘hanging out’ at interesting levels of AGI capability at least possible, since it wouldn’t automatically keep going right away.
An AGI that was above humans in all respects would doubtless FOOM anyway, but if ahead in only some it might not.
Trying to set special case logic to tell AGIs to believe false generalizations with a lot of relevance to mapping or steering the world won’t work, they’d notice and fix it.
Manipulating humans is a convergent instrumental strategy.
Hiding what you are doing is a convergent instrumental strategy.

Eliezer expects that when people are trying to stomp out convergent instrumental strategies by training at a safe dumb level of intelligence, this will not be effective at preventing convergent instrumental strategies at smart levels of intelligence.
You have to train in safe domains because if you train in unsafe domains you die, but the solutions you find in safe domains won’t work in unsafe domains.
Attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes.
Explanation of above part 1: Higher levels of intelligence involve qualitatively new thought processes and things being way out of training distribution.
Explanation of above part 2: Corrigibility is ‘anti-natural’ in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).

Trying to hardcode nonsensical assumptions or arbitrary rules into an AGI will fail because a sufficiently advanced AGI will notice that they are damage and route around them or fix them (paraphrase).
You only get one shot, because the first miss kills you, and your chances of pulling many of these things off on the first try is basically zero, unless (Rob suggests this) you can basically ‘read off’ what the AI is thinking. Nothing like this that involves black boxes ever works the first time. Alignment is hard largely because of ‘you only get one shot.’
Nothing we can do with a safe-by-default AI like GPT-3 would be powerful enough to save the world (to ‘commit a pivotal act’), although it might be fun. In order to use an AI to save the world it needs to be powerful enough that you need to trust its alignment, which doesn’t solve your problem.
Nanosystems are definitely possible, if you doubt that read Drexler’s Nanosystems and perhaps Engines of Creation and think about physics. They’re a core thing one could and should ask an AI/AGI to build for you in order to accomplish the things you want to accomplish.

No existing suggestion for “Scalable Oversight” seems to solve any of the hard problems involved in creating trustworthy systems.
An AGI would be able to argue for/’prove’ arbitrary statements to the satisfaction of humans, including falsehoods.
Furthermore, an unaligned AGI powerful enough to commit pivotal acts should be assumed to be able to hack any human foolish enough to interact with it via a text channel.
The speedup step in “iterated amplification and distillation” will introduce places where the fast distilled outputs of slow sequences are not true to the original slow sequences, because gradient descent is not perfect and won’t be perfect and it’s not clear we’ll get any paradigm besides gradient descent for doing a step like that.
The safety community currently is mostly bouncing off the hard problems and are spending most of their time working on safe, easy, predictable things that guarantee they’ll be able to publish a paper at the end. Actually-useful alignment research will tend to be risky and unpredictable, since it’s advancing the frontier of our knowledge in a domain where we have very little already-accumulated knowledge.
Almost all other work is either fully useless, almost entirely predictable, or both.
Paul Christiano is trying to have real foundational ideas, and they’re all wrong, but he’s one of the few people trying to have foundational ideas at all; if we had another 10 of him, something might go right.
Chris Olah is going to get far too little done far too late but at least is trying to do things on a path to doing anything at all.
Stuart Armstrong did some good work on further formalizing the shutdown problem, an example case in point of why corrigibility is hard, which so far as I know is still resisting all attempts at solution.
Various people who work or worked for MIRI came up with some actually-useful notions here and there, like Jessica Taylor’s expected utility quantilization.
We need much, much more rapid meaningful progress than this to have any chance, and it’s not obvious how to do that, or how to use money usefully. Money by default produces more low-quality work, and low-quality work slash solving small problems rather than the hard problems isn’t quite useless but it’s not going to get us where we need to go.
The AGI approaches that matter are the ones that scale, so they probably look less like GPT-2 and more like Alpha Zero, AlphaFold 2 or in particular Mu Zero.
Proving theorems about the AGI doesn’t seem practical. Even if we somehow managed to get structures far more legible than giant vectors of floats, using some AI paradigm very different from the current one, it still seems like huge key pillars of the system would rely on non-fully-formal reasoning.
Zvi infers this from the text, rather than it being text directly, and it’s possible it’s due to conflating things together and wasn’t intended: A system that is mathematically understood and you can prove lots of stuff about it is not on the table at this point. Agent Foundations is a failure. Everything in that direction is a failure.
Even if you could prove what the utility function was, getting it to actually represent a human-aligned thing when it counts still seems super hard even if it doesn’t involve a giant inscrutable vector of floats, and it probably does involve that.
Eliezer agrees that it seems plausible that the good cognitive operations we want do not in principle require performing bad cognitive operations; the trouble, from his perspective, is that generalizing structures that do lots of good cognitive operations will automatically produce bad cognitive operations, especially when we dump more compute into them; “you can’t bring the coffee if you’re dead”. No known way to pull this off.
Proofs mostly miss the point. Prove whatever you like about that Tensorflow problem; it will make no difference to whether the AI kills you. The properties that can be proven just aren’t related to safety, no matter how many times you prove an error bound on the floating-point multiplications. It wasn’t floating-point error that was going to kill you in the first place.

Now to put the core of that into simpler form, and excluding non-central details, in a more logical order.

Again, this is my model of Eliezer’s model, statements are not endorsed by me, I agree with many but not all of them.

Claim from Nate rather than Eliezer, unclear if Eliezer agrees: AGI is probably coming (p~85%) within 50 years.
AGI that is not aligned ends the world.
Safely aligning a powerful AGI is difficult.
Humanity only gets one shot at this. If we fail, we die and can’t try again.
Almost nothing ever succeeds on its first try.
We currently have no idea how to do it at all.
Current alignment methods all fail and we don’t even have good leads to solving the hard questions that matter.
AIs weak enough to be safe-by-default lack sufficient power to solve these problems.
It would take a surprising positive technical development to find a way to do alignment at all.
So all reasonable plans to align an AGI assume at least one surprising positive and technical development.
Current pace of useful safety research is much slower than needed to keep pace with capabilities research.
Even if we did get a surprising positive technical development that let us find a way to proceed, it would probably take additional years to do that rather than turn the AGI on and end the world. Rob Bensinger clarifies that Eliezer’s exact stance is instead: “An aligned advanced AI created by a responsible project that is hurrying where it can, but still being careful enough to maintain a success probability greater than 25%, will take the lesser of (50% longer, 2 years longer) than would an unaligned unlimited superintelligence produced by cutting all possible corners.”

That’s because the AGI we need to align is likely an enormous inscrutable pile of floating-point vectors. Which makes it harder.
AGI likely comes from algorithms that scale with compute, so less like GPT-X and more like Mu Zero.
Such algorithms have to be aligned somehow before anyone scales them up too much, since that would end the world.
Rob’s rewording: In the meantime, it’s likely that the code and/or conceptual insights would leak out, absent a large, competent effort to prevent this. No leading ML organization currently seems to be putting in the required effort. Zvi’s note: I interpreted the relevant claim here as something stronger, that humanity lacks the social technology to do more than probabilistically postpone such a leak even under best practices given the likely surrounding conditions, and that no leading organizations are even doing anything resembling or trying to resemble best practices.
If it were to leak, someone somewhere would run the code and end the world. There are people who probably would know better than to scale it up and end the world, like Deepmind and Anthropic, but it wouldn’t take long for many others to get the code, and then it only takes one someone who didn’t know better (like an intelligence agency) to end the world anyway.
Most people working on safety are working on small problems rather than hard problems, or doing work with predictable outcomes, because incentives, and are therefore mostly useless (or worse).
There are exceptions (Paul Christiano, Chris Olah, Stuart Armstrong and some MIRI-associated people) but they are exceptions and we need vastly more of them.
AI work that is shared or published accelerates AGI.
The more details are shared, the more acceleration happens.
Everyone publishes anyway, in detail, because incentives.
Incentive changes to fix this would need to be sufficiently robust that not publishing wouldn’t hurt your career prospects or cost you points, or they won’t work.
Fixing incentives on publishing, and otherwise making more things more closed, would be helpful.
Ability to do subpartitianed/siloed projects within research organizations (including Deepmind and Anthropic), that would actually stay meaningfully secret, would be helpful.
Research that improves interpretability a lot (like Chris Olah is trying to do, but with faster progress) would be very helpful. Creative new deep alignment ideas (similar to Paul Christiano’s work in depth and novelty, but not in the Paul-paradigm) would be very helpful.
Certain kinds of especially-valuable alignment experiments using present-day ML systems, like the experiments run by Redwood Research, would be helpful.
Nanotechnology is definitely physically doable and a convergent instrumental strategy, see Drexler.
Manipulating humans is a convergent instrumental strategy.
Hiding what you are doing is a convergent instrumental strategy.

Higher intelligence AGIs use qualitatively new thought processes that lie outside your training distribution.
An unaligned AGI would be able to hack any human foolish enough to read its text messages or other outputs, ‘prove’ arbitrary statements to human satisfaction, etc.
Corrigibility is ‘anti-natural’ and incredibly hard. Corrigibility solutions for less intelligent AGIs won’t transfer to higher intelligence AGIs.
‘Scalable oversight’ as proposed so far doesn’t solve any of the hard problems.
‘Iterated amplification and distillation’ based on gradient descent would be imperfect and the nice properties you’re trying to preserve would fail. Currently we have no alternate approach.
Agent Foundations and similar mathematical approaches seem to be dead ends.
‘Good’ cognitive operations grouped together and scaled automatically produce ‘bad’ cognitive operations. You can’t deliver the coffee if you’re dead, etc.
Getting a fixed utility function into an AGI at all is super hard+, getting a utility function to represent human values is super hard+, giant floating point vectors make both harder still.
The stuff you can prove doesn’t prove anything that matters, the stuff that would prove anything that matters you can’t prove.
Special case nonsensical logic or arbitrary rules will be interpreted by an AGI as damage and routed around.
Recursive self-improvement seems not required for fast capability gains. This means having a powerful but not self-improving or world-ending AGI at least possible.
An AGI better at everything than humans FOOMs anyway.

Worth noting that the more precise #12 is substantially more optimistic than 12 as stated explicitly here.

Looking at these 42 claims, I notice my inside view mostly agrees, and would separate them into:

Inside view disagreement but seems plausible: 1

Inside view lacks sufficient knowledge to offer an opinion: 28 (I haven’t looked for myself)

Inside view isn’t sure: 8 (if we add ‘using current ideas’ move to strong agreement), 13, 36

Weak inside view agreement – seems probably true not counting Eliezer’s opinion, but I wouldn’t otherwise be confident: 7, 9, 10, 22, 34, 35, 40

Strong inside view agreement: 2, 3, 4, 5, 6, 11, 12 (original version would be weak agreement, revised version is strong agreement), 14 (conditional on 13), 15, 16 (including the stronger version), 17, 18, 19 (in general, not for specific people), 20, 21, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 37, 38, 39 (unless a bunch of other claims also break first), 41, 42

Thus, I have inside view agreement (e.g. I substantively agree with this picture without taking into account anyone’s opinion) on 37 of the 42 claims, including many that I believe to be ‘non-obvious’ on first encounter.

That leaves 5 remaining claims.

For 28 (Nanotechnology) I think it’s probably true, but I notice I’m counting on others models of the technology that would be involved, so I want to be careful to avoid information cascade, but my outside view strongly agrees.

For 8 (Safe AIs lack the power to save us) would require a surprising positive development for it to be wrong, in the sense that no currently proposed methods seem like they’d work. But I notice I instinctively remain hopeful for such a development, and for a solution to be found. I’m not sure how big a disagreement exists here, there might not be one.

That leaves 1 (85% AGI by 2070), 13 (AGI is a giant pile of floating-point vectors) and 36 (Agent Foundations and similar are dead ends) which are largely the same point of doubt.

Which is basically this: I notice my inside view, while not confident in this, continues to not expect current methods to be sufficient for AGI, and expects the final form to be more different than I understand Eliezer/MIRI to think it is going to be, and that theAGI problem (not counting alignment, where I think we largely agree on difficulty) is ‘harder’ than Eliezer/MIRI think it is.

For 36 (Agent Foundations) in particular: I notice a bunch of people in the comments saying Agent Foundations isn’t/wasn’t important, and seemed like a non-useful thing to pursue, and if anything I’m on the flip side of that and am worried it and similar things were abandoned too quickly rather than too late. It’s a case of ‘this probably will do nothing and look stupid but might do a lot or even be the whole ballgame’ and that being hard to sustain even for a group like MIRI in such a spot, but being a great use of resources in a world where things look very bad and all solutions assume surprising (and presumably important) positive developments. Everybody go deep.

For 1 (probability of AGI) in particular: I think in addition to probably thinking inside view that AGI is harder than Eliezer/MIRI think it is, I also think civilization’s dysfunctions are more likely to disrupt things and make it increasingly difficult to do anything at all, or anything new/difficult, and also collapse or other disasters. I know Nate Sores explicitly rejects this mattering much, but it matters inside view to me quite a bit. I don’t have an inside view point estimate, but if I could somehow bet utility (betting money really, really doesn’t work here, at all) and could only bet once, I notice I’d at least buy 30% and sell 80%, or something like that.

Also, I noticed two interrelated things that I figured are worth noting from the comments:

In the comments to the OP that Eliezer’s comments about small problems versus hard problems got condensed down to ‘almost everyone working on alignment is faking it.’ I think that is not only uncharitable, it’s importantly a wrong interpretation, and motivated by viewing the situation (unconsciously?) through the lens of a battle over status and authority and blame, rather than how to collectively win from a position on a game board. The term ‘faking’ here is turning a claim of ‘approaches that are being taken mostly have epsilon probability of creating meaningful progress’ to a social claim about the good faith of those doing said research, and then interpreted as a social attack, and then therefore as an argument from authority and a status claim, as opposed to pointing out that such moves don’t win the game and we need to play to win the game. I see Eliezer as highly sympathetic to how this type of work ends up dominating, and sees the problem as structural incentives that need to be fixed (hence my inclusion of ‘because incentives’ above) combined with genuine disagreement about the state of the game board. ‘Faking it’ is shorthand for ‘you know what you’re doing isn’t real/useful and are doing it anyway’ and introduces the accusation that leads to the rest of the logical sequence, or something. And Eliezer kind of wrote a whole sequence about exactly this, which I consider so important that a quote from it is my Twitter bio.
Eliezer is being accused of making an argument from authority using authority he doesn’t deserve, or in a way that is disruptive, and I assume every time he sees that or anyone saying anything like “Eliezer thinks X therefore it’s irrational to not think X too” or “Who are you (or am I) to disagree with the great Eliezer?” he’s tearing his hair out that he spent years writing the definitive book about why “think for yourself, shmuck” is the way to go. I feel his pain, to a lesser degree. I had a conversation last Friday where my post on the ports where I explain how I want to generate a system of selective amplification where everyone thinks for themselves in levels so as to amplify true and useful things over untrue and useless things was interpreted (by a rather smart and careful reader!) as a request to have a norm that people amplify messages without reading them carefully or evaluating whether they seemed true, the way certain camps do for any messages with the correct tribal coloring. And again, arrrggghh, pull hair out, etc.

AI ALIGNMENT FORUM
AF

51

Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

51