All of localdeity's Comments + Replies

If we see precursors to deception (e.g. non-myopia, self-awareness, etc.) but suspiciously don’t see deception itself, that’s evidence for deception.

Stated like this, it seems to run afoul of the law of conservation of expected evidence.  If you see precursors to deception, is it then the case that both (a) seeing deception is evidence for deception and (b) not seeing deception is also evidence for deception?  I don't think so.

The direct patch is "If you see precursors to deception, then you should expect that there is deception, and further evid... (read more)

5Ben Lang
I think that you are slightly mishandling the law of conservation of expected evidence by making something non-binary into a binary question. Say, for example that the the AI could be in one of 3 states: (1) completely honest, (2) partly effective at deception, or (3) very adept at deception. If we see no signs of deception then it rules out option (2). It is easy to see how ruling out option (2) might increase our probability weight on either/both options (1) and (3). [Interestingly enough in the linked "law of conservation of expected evidence" there is something I think is an analogous example to do with Japanese internment camps.]

Well, the basic idea "adding more safeguards decreases the likelihood they'll all fail simultaneously, as long as there isn't a perfect correlation of failure modes" is a simple mathematical fact.  "What is the probability of this safeguard failing to detect a rogue AI?" is hard to answer, but "What might this new safeguard do that the other safeguards don't do?" is easier.

For example.  If interpretability work gets anywhere, then one might imagine a suite of safeguards that check for parts of the developing neural net that compute things like "h... (read more)

I am mostly objecting to strategies which posit one AI saving us from another as the primary mechanism of alignment - for instance, most of the strategies in 11 Proposals. If we had sufficiently great interpretability, then sure, we could maybe leverage that to make a Godzilla strategy with a decent chance of working (or at least failing in detectable-in-advance ways), but with interpretability tools that good we could probably just make a plan without Godzilla have a decent chance of working (or at least failing in detectable-in-advance ways) by doing bas... (read more)

He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROPERTY VALUES IN TOK

... (read more)

One important difference between data storage vs computation or AI: courtesy of Shannon and Hamming, we have a really good understanding of information transmission (which includes information storage). All those nice error-correction codes are downstream of very well-understood theory.

If we had theory as solid as information theory for AI and alignment, then yeah, I'd be a hell of a lot more optimistic about using one AI to oversee another somewhere in the process. Like, imagine we had the alignment analogue of an error-detecting code which provably detec... (read more)

Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI.

My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and est... (read more)

1Donald Hobson
Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve)  Now I can't give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it.  Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment. (and all possible steelmannings, that's a huge space)

If you have an untrustworthy general superintelligence generating English strings meant to be "reasoning/arguments/proofs/explanations" about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I'd expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn't understand even after having been told what happened.

I can't deny that this is possible.  Lying is obviously possible; lyi... (read more)

1Donald Hobson
Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn't flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.)  But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever could do it. Then they ask the superintelligence about nanosystems, and very carefully read the results. Then presumably they go and actually try to build nanosystems. Of course you didn't expect the superintelligences advice to be correct, did you? And not wrong in an easily detectable fail safe way either. You concepts and paradigm are all subtly malicious. Not clear testable and factually wrong statements. But nasty tricks hidden in the invisible background assumptions.