Clutching a bottle of whiskey in one hand and a shotgun in the other, John scoured the research literature for ideas... He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO. It’s better to stop scaling your transistors and avoid playing with monsters in the first place, instead of devising an elaborate series of monster checks-and-balances and then hoping that the monsters don’t do what monsters are always going to do because if they didn’t do those things, they’d be called dandelions or puppy hugs.

- James Mickens, The Slow Winter


There’s a lot of AI alignment strategies which can reasonably be described as “ask Godzilla to prevent Mega-Godzilla from terrorizing Japan”. Use one AI to oversee another AI. Have two AIs debate each other. Use one maybe-somewhat-aligned AI to help design another. Etc.

Alignment researchers discuss various failure modes of asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. Maybe one of the two ends up much more powerful than the other. Maybe the two make an acausal agreement. Maybe the Nash Equilibrium between Godzilla and Mega-Godzilla just isn’t very good for humans in the first place. Etc. These failure modes are useful for guiding technical research.

… but I worry that talking about the known failure modes misleads people about the strategic viability of Godzilla strategies. It makes people think (whether consciously/intentionally or not) “well, if we could handle these particular failure modes, maybe asking Godzilla to prevent Mega-Godzilla from terrorizing Japan would work”.

What I like about the Godzilla analogy is that it gives a strategic intuition which much better matches the real world. When someone claims that their elaborate clever scheme will allow us to safely summon Godzilla in order to fight Mega-Godzilla, the intuitively-obviously-correct response is “THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO”.

“But look!” says the clever researcher, “My clever scheme handles problems X, Y and Z!”

Response:

Oops

“Ok, but what if we had a really good implementation?” asks the clever researcher.

Response:

RAAARRRRRRR!

“Oh come on!” says the clever researcher, “You’re not even taking this seriously! At least say something about how it would fail.”

Don’t worry, we’re going to get to that. But before we do: let’s imagine you’re the Mayor of Tokyo evaluating a proposal to ask Godzilla to fight Mega-Godzilla. Your clever researchers have given you a whole lengthy explanation about how their elaborate and clever safeguards will ensure that this plan does not destroy Tokyo. You are unable to think of any potential problems which they did not address. Should you conclude that asking Godzilla to fight Mega-Godzilla will not result in Tokyo’s destruction?

No. Obviously not. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO. You may not be able to articulate why the answer is obviously “no”, but asking Godzilla to fight Mega-Godzilla will still obviously destroy Tokyo, and your intuitions are right about that even if you are unable to articulate clever arguments.

With that said, let’s talk about why those intuitions are right and why the Godzilla analogy works well.

Brittle Plans and Unknown Unknowns

The basic problem with Godzilla plans is that they’re brittle. The moment anything goes wrong, the plan shatters, and then you’ve got somewhere between one and two giant monsters rampaging around downtown.

And of course, it is a fundamental Law of the universe that nothing ever goes exactly according to plan. Especially when trying to pit two giant monsters against each other. This is the sort of situation where there will definitely be unknown unknowns.

Unknown unknowns + brittle plan = definitely not rising property values in Tokyo.

Do we know what specifically will go wrong? No. Will something go wrong? Very confident yes. And brittleness means that whatever goes wrong, goes very wrong. Errors are not recoverable, when asking Godzilla to fight Mega-Godzilla.

If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it. If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error; it’s the AIs themselves which we depend on to notice problems. If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error; if we had better ways to detect misalignment in the child we’d already have used them on the parent.

The real world will always throw some unexpected problems at our plans. When asking Godzilla to fight Mega-Godzilla, those problems are not recoverable. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

Meta note: I expect this post to have a lively comment section! Before you leave the twentieth comment saying that maybe Godzilla fighting Mega-Godzilla is better than Mega-Godzilla rampaging unchallenged, maybe check whether somebody else has already written that one, so I don't need to write the same response twenty times. (But definitely do leave that comment if you're the first one, I intentionally kept this essay short on the assumption that lots of discussion would be in the comments.)

New Comment
19 comments, sorted by Click to highlight new comments since:

He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

I happen to work for a company whose software uses checksums at many layers, and RAID encoding and low-density parity codes at the lowest layers, to detect and recover from hardware failures.  It works pretty well, and the company has sold billions of dollars of products of which that is a key component.  Also, many (most?) enterprise servers use RAM with error-correcting codes; I think the common configuration allows it to correct single-bit errors and detect double-bit errors, and my company's machines will reset themselves when they detect double-bit errors and other problems that impugn the integrity of their runtime state.

One could quibble about whether "retrieving and querying the data that was written" counts as a "computation", and the extent to which the recovery is achieved through software as opposed to hardware[1], but the source material is a James Mickens comedic rant in any case.

I'd say the important point here is: There is a science to error correction, to building a (more) perfect machine out of imperfect parts, where the solution to unreliable hardware is more of the unreliable hardware, linked up in a clever scheme.  They're good enough at it that each successive generation of data storage technology uses hardware with higher error rates.  You can make statements like "If failures are uncorrelated, and failures happen every X time units on average per component, and it takes Y time units to detect and recover a failure, and we can recover from up to M failures out of every group of N components, then on average we will have an unrecoverable failure every Z time units"; then you can (a) think about how to arrange it so that Z >> X and (b) think about the dangers of correlated failures.

(The valid complaint that Mickens's character makes is that it would suck if every application needed to weave error correction into every codepath, implement its own RAID, etc.  It works much better if the error correction is done by some underlying layer that the application treats as an abstraction—using the abstraction tends to be more complex than pretending errors don't exist (and for noncritical applications the latter is a valid strategy), but not terrible.[2])

With regard to AI.  It seems likely that we'll end up making use of potentially-dangerous AIs to do things.  If we do, then we'd want powerful safeguards.  It seems unlikely that we'd have 100% confidence in any particular safeguard (i.e. unlikely we'd have formal proofs of the safety properties we'd want).  Then we'd want to keep adding more safeguards as long as their failure modes weren't totally covered by those of preexisting safeguards (and as long as they're affordable); ideally we'd try to estimate (ideally measure) the chance of failure of each safeguard and how independent they are.

Now, would some of these safeguards be built with the aid of earlier versions of AI?  I'd guess yes.  This could take a few forms, ranging from "formally-verified direct programs that the earlier AI generated along with their proofs of safety and effectiveness" to "smaller neural nets (cut-down versions of earlier AIs?) that monitor the training of the new AI's net" to "weaker full-blown AIs watching the decisions output by the new full-blown AI".  I suspect any of these might be useful.

The Godzilla complaint seems to be about that last thing, having weaker AIs monitor the new AI.  Well, um, it seems to me that, if the weaker AIs' output is restricted to the one-bit signal "do we hit the killswitch on the new AI" (and that signal is executed by a machine), then there's no danger of Godzilla destroying Tokyo after it has taken down Mega-Godzilla.  (It's like cutting off Godzilla's limbs and tying him to Mega-Godzilla with his jaws open around Mega-Godzilla's neck, so his only available action is biting down.)  There is a danger, if this happens a lot, of the subsequent training of the new AI merely teaching it to evade detection by the weaker predecessors; this danger should be taken seriously; but it is also less bad than having Mega-Godzilla kill us the first time it started going rogue; the safeguard is net positive.

Addressing specific claims:

If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it.

If, among our many precautions, we have AI 1 oversee AI 2 (in this scheme I describe where AI 1's only available action is to output a bit that shuts off AI 2)[3], and AI 2 starts to go rogue, and AI 1 fails in its task, then we're no worse off than if we lacked that particular precaution—either the other precautions save us or they don't.  If AI 1 succeeds, then we have at least some chance of being able to debug what went wrong with AI 2, and save ourselves in the future.  Using AI 1 as overseer is strictly beneficial, minus implementation costs.  (There's a second-order risk of someone overestimating how much it protects us and then choosing to be more reckless, but that's a different kind of thing.)

If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error

If "goes wrong" means "it kills us all", then yeah.  If "goes wrong" means "it starts acting dangerous but we manage to shut it off", that is potentially recoverable.

if we had better ways to detect misalignment in the child we’d already have used them on the parent.

The parent could be aligned but incompetent (heh, I think many people know parents like that), while the child is unaligned and more competent.  (Why would the programmers create the child if it weren't expected to be more competent than the parent?)  It seems easy to imagine alignment checks that the parent would pass while the child would fail.

  1. ^

    The checksums are done by the CPU but hardware-accelerated; I think the RAID stuff is clearly software; the LDPC stuff is done by more specialized hardware; the ECC RAM is hardware, but the choice of how to respond to double bit errors is software.

  2. ^

    ECC RAM is one example: the application can just assume that memory accesses are correct, and that double-bit errors will cause the operating system to kill the application—so the application itself needs zero lines of code about RAM errors, though now there does need to be monitoring to restart the process and/or fail over to another machine.

    Another example is using a durable database, and going from "just increment and decrement the variables in memory" to "create a database transaction, to be attempted / rolled back as a unit".  That does require code changes in the application itself.

  3. ^

    In practice we'd probably want it to output something more like a number between 0 and 1 representing its worry level, and likely the killswitch goes off if either a single overseer exceeds some danger threshold or multiple overseers exceed some lower threshold.

One important difference between data storage vs computation or AI: courtesy of Shannon and Hamming, we have a really good understanding of information transmission (which includes information storage). All those nice error-correction codes are downstream of very well-understood theory.

If we had theory as solid as information theory for AI and alignment, then yeah, I'd be a hell of a lot more optimistic about using one AI to oversee another somewhere in the process. Like, imagine we had the alignment analogue of an error-detecting code which provably detects two-bit errors and corrects one-bit errors with only a logarithmic amount of overhead. With theory that strong (and battle-tested in reality) it becomes plausible that unknown unknowns won't inevitably ruin all our plans.

Well, the basic idea "adding more safeguards decreases the likelihood they'll all fail simultaneously, as long as there isn't a perfect correlation of failure modes" is a simple mathematical fact.  "What is the probability of this safeguard failing to detect a rogue AI?" is hard to answer, but "What might this new safeguard do that the other safeguards don't do?" is easier.

For example.  If interpretability work gets anywhere, then one might imagine a suite of safeguards that check for parts of the developing neural net that compute things like "how to detect security holes in C or machine code" or "how quickly humans die to certain poisons" (when that's not supposed to be the goal); safeguards that check for parts of the net that have many nodes and are not understandable by the other safeguards; safeguards that inspect the usage of CPU or other resources and have some idea of what's usual; safeguards that try to look for the net thinking strategically about what resource usage looks natural; and so on.  These safeguards might all suck / only work in a small fraction of cases, but if you have hundreds or thousands of them, then your odds might get decent.

Or, at least, you might push the correlated failure mode into "the AI has guessed that we have a bunch of safeguards, has imagined the set of safeguards people at our level of competence are likely to have come up with, and is strategically avoiding them", in which case that strategic thinking is a thing you try to detect.  (It would be nice if we ended up being able to prove that the strategic thinking to defeat all our "dangerous thinking detectors" required at least N neurons, and we were able to prune down all nets to the point where less than N not-understood neurons existed.)  And—as stated, each of these safeguards might be implemented with the assistance of prior AIs in various ways.

What exactly are you objecting to?  Your original post seems to be saying "the entire category of strategies that involves one AI saving us from another is dangerous, doomed, and should be dismissed as ridiculous", but you don't seem to be pushing back on my claims that you can cut off Godzilla's limbs and tie his jaws around Mega-Godzilla's neck and that the result is net positive.  Were you arguing against much more reckless proposals, or people being overconfident in unspecified proposals, or...?  The possibility occurs to me that the OP is intentionally overstated in the hopes of provoking a well-thought-out proposal (it did say it anticipated much discussion), along the lines of Cunningham's Law.

I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment - for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing basically the same things minus Godzilla. It's the interpretability tools which take that plan from "close to zero chance of working" to "close to 100% chance of working"; the interpretability is where all the robustness comes from. The Godzilla part adds relatively little and is plausibly net negative (due to making the ML components more complex and brittle).

(Another minor point: "adding more safeguards decreases the likelihood they'll all fail simultaneously, as long as there isn't a perfect correlation of failure modes" is only true when the "safeguards" are guaranteed to not increase the chance of failure.)

And—as stated, each of these safeguards might be implemented with the assistance of prior AIs in various ways.

I see two main things you could have in mind here.

First, maybe you imagine training a magic black box to tell us what's going on inside another magic black box. This is a Godzilla strategy, and fails for the usual Godzilla reasons: errors are not recoverable. If the magic interpretability box fails, we don't have a built-in way to notice. And no, training lots of magic black boxes to detect lots of things does not really fix the problem; the failure modes are extremely highly correlated. We don't even need to suppose deception - just a distribution shift in the cognition of the system will cause highly correlated failures, and a distribution shift in cognition is exactly the sort of thing we'd expect from a system just starting to grok consequentialist reasoning.

On the other hand, maybe you imagine AIs serving as research assistants, rather than using AIs directly to interpret other AIs. That plan does have problems, but is basically not a Godzilla plan; the human in the loop means that the standard Godzilla brittleness issue doesn't really apply.

Individual humans do make off much better when they get to select between products from competing companies rather than monopolies, benefitting from companies going out of their way to demonstrate when their products are verifiably better than rivals'. Humans get treated better by sociopathic powerful politicians and parties when those politicians face the threat of election rivals (e.g. no famines). Small states get treated better when multiple superpowers compete for their allegiance. Competitive science with occasional refutations of false claims produces much more truth for science consumers than intellectual monopolies. Multiple sources with secret information are more reliable than one.

It's just routine for weaker less sophisticated parties to do better in both assessment of choices and realized outcomes when multiple better informed or powerful parties compete for their approval vs just one monopoly/cartel.

Also, a flaw in your analogy is that schemes that use AIs as checks and balances on each other don't mean more AIs. The choice is not between monster A and monsters A plus B, but between two copies of monster A (or a double-size monster A), and a split of one A and one B, where we hold something of value that we can use to help throw the contest to either A or B (or successors further evolved to win such contests). In the latter case there's no more total monster capacity, but there's greater hope of our influence being worthwhile and selecting the more helpful winner (which we can iterate some number of times).

So, the analogy here is that there's hundreds (or more) of Godzillas all running around, doing whatever it is Godzillas want to do. Humanity helps out whatever Godzillas humanity likes best, which in turn creates an incentive for the Godzillas to make humanity like them.

THIS DOES NOT BODE WELL FOR TOKYO'S REAL ESTATE MARKET.

Still within the analogy: part of the literary point of Godzilla is that humanity's efforts to fight it are mostly pretty ineffective. In inter-Godzilla fights, humanity is like an annoying fly buzzing around. The humans just aren't all that strategically relevant. Sure, humanity's assistance might add some tiny marginal advantage, but from a Godzilla's standpoint that advantage is unlikely to be enough to balance the tactical/strategic disadvantages of trying not to step on people.

... and that all seems like it should carry over directly to AI, once AI gets to-or-somewhat-past human level, and definitely by the time we get to strongly superhuman intelligence. Even with just human level, the scaling/coordination/learning advantages of being able to cheaply copy a mind are probably enough for the AIs to reasonably-quickly achieve strategic dominance by enough margin that humanity's preferences are not particularly relevant. (Assuming that the AI isn't prohibitively expensive to run - but that seems pretty likely to be true under most plausible paths. For instance, if human-level AI is produced by anything like today's ML, then training costs will dominate and the systems will be relatively cheap to run or fine-tune.)

(There's also some alignment-specific problems with this scheme which the Godzilla analogy doesn't highlight. I'm not going into them here because this post is specifically about the Godzilla issues. But I don't want to give people the impression that this plan would be fine in a world where humanity has sufficient bargaining power; the lack of bargaining power is only one failure mode.)

I was going to make a comment to the effect that humans are already a species of Godzilla (humans aren't safe, human morality is scary, yada yada), only to find you making the same analogy, but with an optimistic slant. :)

James Mickens is writing comedy. He worked in distributed systems. A "distributed system" is another way to say "a scenario in which you absolutely will have to use software to deal with your broken hardware". I can 100% guarantee that this was written with his tongue in his cheek.

The modern world is built on software that works around HW failures. 

  • You likely have ECC ram in your computer.
  • There are checksums along every type of data transfer (Ethernet frame check sequences, IP header checksums, UDP datagram checksums, ICMP checksums, eMMC checksums, cryptographic auth for tokens or certificates, etc).
  • An individual SSD or HDD have algorithms for detecting and working around failed blocks / sectors in HW.
  • There are fully redundant processors in safety-critical applications using techniques like active-standby, active-active, or some manner of voting for fault tolerance. 
  • In anything that involves HW sensors, there's algorithms like an extended Kalman filter for combining the sensor readings to a single consistent view of reality, and stapled to that are algorithms for determining when sensors are invalid because they've railed high, railed low, or otherwise failed in a manner that SW can detect.
  • Your phone's WiFi works because the algorithm used for the radio is constantly working around dropouts and reconnecting to new sources as needed.
  • We can read this post because it's sent using TCP and is automatically retransmitted as many times as needed until it's been ACK'd successfully.
  • We can play multiplayer video games because they implement eventually consistent protocols on top of UDP. 
  • Almost all computer applications implement some form of error handling + retry logic for pretty much anything involving I/O (file operations, network operations, user input) because sometimes things fail, and almost always, retrying the thing that failed will work.
  • Large data centers have hundreds of thousands of SSDs and they are constantly failing -- why doesn't Google fall over? Because SW + HW algorithms like RAID compensate for drives dying all of the time.

If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it. If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error; it’s the AIs themselves which we depend on to notice problems. If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error; if we had better ways to detect misalignment in the child we’d already have used them on the parent.

Replace "AI" with "computer" in this paragraph and it is obviously wrong because every example here is under-specified. There is a dearth of knowledge on this forum of anything resembling traditional systems engineering or software system safety and it shows in this thread and in the previous thread you made about air conditioners. I commented as such here.

"If we use one computer to oversee another computer, and something goes wrong, that's not a recoverable error; we're using computer assistance in the first place because we can't notice the relevant problems without it."

Here are some examples off the top of my head where we use one computer to oversee another computer:

  1. It's common to have one computer manage a pool of workers where each worker is another computer and workers may fail. The computer doing the management is able to detect a stalled or crashed worker, power cycle the hardware, and then resubmit the work. Depending on the criticality of this process, the "manager" might actually be multiple computers that work synchronously. The programming language Erlang is designed for this exact use-case -- distributed, fault tolerance SW applications in contexts where I/O is fallible and it's unacceptable for the program to crash.
  2. We often use one computer program to calculate some type of path or optimal plan and it's a very complicated program to understand, and then we use a 2nd computer program to validate the outputs from the 1st program -- why do we use two programs? Because the first is inscrutable and difficult to explain, but the 2nd reads like straightforward requirements in English. In other words, it is often far easier to check a solution than it is to create a solution. The mathematically inclined will recognize this as a consequence of P vs NP problems, if P != NP.
  3. It's common in safety-critical applications to have a fail-safe or backup using a significantly less complicated architecture -- e.g. you might use <complicated system> to do <some complex task>, but for a fail-safe like "power off everything", you might have a tiny microprocessor sitting in-line with the main power rail serving as a glorified switch. So normally the <complicated system> is driving the show, but if that starts to go sideways, the tiny microprocessor can shut it down.
  4. Almost all microprocessors have a "watchdog" built into them. A watchdog is a secondary processor that will reset the primary processor if the primary is non-responsive. Have you ever seen your Android phone mysteriously reboot when the UI locks up? That was a watchdog.
  5. The "watchdog" concept is even used in pure SW contexts, e.g. when the Android OS kills an application on your phone because it has frozen, that's a one computer program (the OS) overseeing another (the application). Ditto for Windows & the task manager.
  6. We often use "voting" where we run the same SW on 3 or more systems and then only output to some downstream hazard if all 3 systems agree. If they don't agree, we can fail-safe or try to recover, e.g. by power-cycling whichever system was out-of-family, or both -- first try to recover, then fail-safe if that didn't work. This is done by running code in "lockstep" on synchronized inputs, very similar to how old multiplayer RTS games used to do networking. 
  7. You can buy self-checking processors that implement lockstep comparison between 2 internal cores so that whenever instructions are executed, you know that the execution occurred identically across both cores.

These aren't cherry-picked. This is the bread & butter of systems safety. We build complex, safe systems by identifying failure modes and then using redundant systems to either tolerate faults or to fail-safe. By focusing on the actual system, and the actual failure modes, and by not getting stuck with our head in the clouds considering a set of "all possible hypothetical systems", it is possible to design & implement robust, reliable solutions in reality.

To claim it is not just impossible to do that, but that it is foolhardy to even try, is the exact opposite of a safety-critical mindset.

I agree that the SW/HW analogy is not a good analogy for AGI safety (I think security is actually a better analogy), but I would like to present a defence of the idea that normal systems reliability engineering is not enough for alignment (this is not necessarily a defence of any of the analogies/claims in the OP).

Systems safety engineering leans heavily on the idea that failures happen randomly and (mostly) independently, so that enough failures happening together by coincidence to break the guarantees of the system is rare. That is:

  • RAID is based on the assumption that hard drive failures happen mostly independently, because the probability of too many drives failing at once is sufficiently low. Even in practice this assumption becomes a problem because a) drives purchased in the same batch will have correlated failures and b) rebuilding an array puts strain on the remaining drives, and people have to plan around this by adding more margin of error.
  • Checksums and ECC are robust against the occasional bitflip. This is because occasional bitflips are mostly random and getting bitflips that just happen to set the checksum correctly are very rare. Checksums are not robust against someone coming in and maliciously changing your data in-transit, you need signatures for that. Even time correlated runs of flips can create a problem for naive schemes and burn through the margin of error faster than you'd otherwise expect.
  • Voting between multiple systems assumes that the systems are all honest and just occasionally suffer transient hardware failures. Clean room reimplementations are to try and eliminate the correlations due to bugs, but they still don't protect against correlated bad behaviour across all of the systems due to issues with your spec.

My point here is that once your failures stop being random and independent, you leave the realm of safety engineering and enter the realm of security (and security against extremely powerful actors is really really hard). I argue that AGI alignment is much more like the latter, because we don't expect AGIs to fail in random ways, but rather we expect them to intelligently steer the world into directions we don't want. AGI induced failure looks like things that should have been impossible when multiplying out the probabilities somehow happening regardless.

In particular, relying on independent AGIs not being correlated with each other is an extremely dangerous assumption: AGIs can coordinate even without communication, alignment is a very narrow target that's hard to hit, and a parliament of misaligned AGIs is definitely not going to end well for us.

Ok, but why isn't it better to have Godzilla fighting Mega-Godzilla instead of leaving Mega-Godzilla unchallenged?

Because Tokyo still gets destroyed.

Important thing to bear in mind here: the relevant point for comparison is not the fantasy-world where the Godzilla-vs-Mega-Godzilla fight happens exactly the way the clever elaborate scheme imagined. The relevant point for comparison is the realistic-world where something went wrong, and the elaborate clever scheme fell apart, and now there's monsters rampaging around anyway.

I initially liked this post a lot, then saw a lot of pushback in the comments, mostly of the (very valid!) form of "we actually build reliable things out of unreliable things, particularly with computers, all the time". I think this is a fair criticism of the post (and choice of examples/metaphors therein), but I think it may be missing (one of) the core message(s) trying to be delivered. 

I wanna give an interpretation/steelman of what I think John is trying to convey here (which I don't know whether he would endorse or not): 

"There are important assumptions that need to be made for the usual kind of systems security design to work (e.g. uncorrelation of failures). Some of these assumptions will (likely) not apply with AGI. Therefor, extrapolating this kind of thinking to this domain is Bad™️." ("Epistemological vigilance is critical")

So maybe rather than saying "trying to build robust things out of brittle things is a bad idea", it's more like "we can build robust things out of certain brittle things, e.g. computers, but Godzilla is not a computer, and so you should only extrapolate from computers to Godzilla if you're really, really sure you know what you're doing."
 

The non-straw versions of Godzilla Strategies do not start from the Godzilla fighting Mega-Godzilla. Starting from this side is doomed.

It starts with, let's say, a Tokyo policeman. Notably, Tokyo policeman isn't a scary monster - but roughly a normal human, where you can get some sort of mutual understanding. The next step is to create a policeman[1], who also isn't a scary monster, but is just a bit more powerful, trained policeman (maybe using a bunch of policeman[0])Where, if the relation gen[n+1] is doing what gen[n] wants holds, the idea is you get to super-Tokio-police, who is still doing what you want. Or you get somewhere midway, where the still aligned policeman[p] tells you "sorry, the next gen would really be a Godzilla, and I don't know how to avoid it". 

(This isn't to express opinions on the viability of the first step, or the amplification procedure.)

Alright, so, let's imagine a chain of 100... creatures... on a smooth spectrum from policeman to Godzilla, and each is trying to keep the next creature up the chain in check. And then the mayor attempts to direct Godzilla via the policeman at one end of this chain.

THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOKYO.

It's like someone took the Godzilla vs Mega-Godzilla plan, and said "this Godzilla-fights-Mega-Godzilla plan is WAY too simple and robust, what we need is a hundred levels of recursion to make ABSOLUTELY SURE that something goes wrong!".

Imagine more chains, often interlinked.

Some chain links will break.  Which is the point - single link failures are survivable. Also for sure there are some corrupt police officers in Tokyo, but they aren't such a big deal.

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

I didn't downvote but didn't upvote and generally wish I had an actual argument to link to when discussing this concept.

You seem to believe that any plan involving what you call "godzilla strategies" is brittle. This is something I am not confidant in. Someone may find some strategy that can be shown to not be brittle.

What I would actually claim is roughly:

  • Godzilla plans are brittle by default
  • In order for the plan to become not-brittle, some part of it other than the use-Godzilla-to-fight-Mega-Godzilla part has to "do the hard part" of alignment

You could probably bolt a Godzilla-vs-Mega-Gozilla mechanism onto a plan which already solved the hard parts of alignment via some other strategy, and end up with a viable plan.