Well, the basic idea "adding more safeguards decreases the likelihood they'll all fail simultaneously, as long as there isn't a perfect correlation of failure modes" is a simple mathematical fact. "What is the probability of this safeguard failing to detect a rogue AI?" is hard to answer, but "What might this new safeguard do that the other safeguards don't do?" is easier.
For example. If interpretability work gets anywhere, then one might imagine a suite of safeguards that check for parts of the developing neural net that compute things like "h...
I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment - for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing bas...
...He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOK
One important difference between data storage vs computation or AI: courtesy of Shannon and Hamming, we have a really good understanding of information transmission (which includes information storage). All those nice error-correction codes are downstream of very well-understood theory.
If we had theory as solid as information theory for AI and alignment, then yeah, I'd be a hell of a lot more optimistic about using one AI to oversee another somewhere in the process. Like, imagine we had the alignment analogue of an error-detecting code which provably detec...
Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI.
My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and est...
If you have an untrustworthy general superintelligence generating English strings meant to be "reasoning/arguments/proofs/explanations" about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I'd expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn't understand even after having been told what happened.
I can't deny that this is possible. Lying is obviously possible; lyi...
Stated like this, it seems to run afoul of the law of conservation of expected evidence. If you see precursors to deception, is it then the case that both (a) seeing deception is evidence for deception and (b) not seeing deception is also evidence for deception? I don't think so.
The direct patch is "If you see precursors to deception, then you should expect that there is deception, and further evid... (read more)