anonymousaisafety

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

I wasn't intending for a metaphor of "biomimicry" vs "modernist".

(Claim 1) Wings can't work in space because there's no air. The lack of air is a fundamental reason for why no wing design, no matter how clever it is, will ever solve space travel. 

If TurnTrout is right, then the equivalent statement is something like (Claim 2) "reward functions can't solve alignment because alignment isn't maximizing a mathematical function."

The difference between Claim 1 and Claim 2 is that we have a proof of Claim 1, and therefore don't bother debating it anymore, while with Claim 2 we only have an arbitrarily long list of examples for why reward functions can be gamed, exploited, or otherwise fail in spectacular ways, but no general proof yet for why reward functions will never work, so we keep arguing about a Sufficiently Smart Reward Function That Definitely Won't Blow up as if that is a thing that can be found if we try hard enough.

As of right now, I view "shard theory" sort of like a high-level discussion of chemical propulsion without the designs for a rocket or a gun. I see the novelty of it, but I don't understand how you would build a device that can use it. Until someone can propose actual designs for hardware or software that would implement "shard theory" concepts without just becoming an obfuscated reward function prone to the same failure modes as everything else, it's not incredibly useful to me. However, I think it's worth engaging with the idea because if correct then other research directions might be a dead-end.

Does that help explain what I was trying to do with the metaphor?

To some extent, I think it's easy to pooh-pooh finding a flapping wing design (not maximally flappy, merely way better than the best birds) when you're not proposing a specific design for building a flying machine that can go to space. Not in the tone of "how dare you not talk about specifics," but more like "I bet this chemical propulsion direction would have to look more like birds when you get down to brass tacks."

James Mickens is writing comedy. He worked in distributed systems. A "distributed system" is another way to say "a scenario in which you absolutely will have to use software to deal with your broken hardware". I can 100% guarantee that this was written with his tongue in his cheek.

The modern world is built on software that works around HW failures. 

  • You likely have ECC ram in your computer.
  • There are checksums along every type of data transfer (Ethernet frame check sequences, IP header checksums, UDP datagram checksums, ICMP checksums, eMMC checksums, cryptographic auth for tokens or certificates, etc).
  • An individual SSD or HDD have algorithms for detecting and working around failed blocks / sectors in HW.
  • There are fully redundant processors in safety-critical applications using techniques like active-standby, active-active, or some manner of voting for fault tolerance. 
  • In anything that involves HW sensors, there's algorithms like an extended Kalman filter for combining the sensor readings to a single consistent view of reality, and stapled to that are algorithms for determining when sensors are invalid because they've railed high, railed low, or otherwise failed in a manner that SW can detect.
  • Your phone's WiFi works because the algorithm used for the radio is constantly working around dropouts and reconnecting to new sources as needed.
  • We can read this post because it's sent using TCP and is automatically retransmitted as many times as needed until it's been ACK'd successfully.
  • We can play multiplayer video games because they implement eventually consistent protocols on top of UDP. 
  • Almost all computer applications implement some form of error handling + retry logic for pretty much anything involving I/O (file operations, network operations, user input) because sometimes things fail, and almost always, retrying the thing that failed will work.
  • Large data centers have hundreds of thousands of SSDs and they are constantly failing -- why doesn't Google fall over? Because SW + HW algorithms like RAID compensate for drives dying all of the time.

If we use one AI to oversee another AI, and something goes wrong, that’s not a recoverable error; we’re using AI assistance in the first place because we can’t notice the relevant problems without it. If two AIs debate each other in hopes of generating a good plan for a human, and something goes wrong, that’s not a recoverable error; it’s the AIs themselves which we depend on to notice problems. If we use one maybe-somewhat-aligned AI to build another, and something goes wrong, that’s not a recoverable error; if we had better ways to detect misalignment in the child we’d already have used them on the parent.

Replace "AI" with "computer" in this paragraph and it is obviously wrong because every example here is under-specified. There is a dearth of knowledge on this forum of anything resembling traditional systems engineering or software system safety and it shows in this thread and in the previous thread you made about air conditioners. I commented as such here.

"If we use one computer to oversee another computer, and something goes wrong, that's not a recoverable error; we're using computer assistance in the first place because we can't notice the relevant problems without it."

Here are some examples off the top of my head where we use one computer to oversee another computer:

  1. It's common to have one computer manage a pool of workers where each worker is another computer and workers may fail. The computer doing the management is able to detect a stalled or crashed worker, power cycle the hardware, and then resubmit the work. Depending on the criticality of this process, the "manager" might actually be multiple computers that work synchronously. The programming language Erlang is designed for this exact use-case -- distributed, fault tolerance SW applications in contexts where I/O is fallible and it's unacceptable for the program to crash.
  2. We often use one computer program to calculate some type of path or optimal plan and it's a very complicated program to understand, and then we use a 2nd computer program to validate the outputs from the 1st program -- why do we use two programs? Because the first is inscrutable and difficult to explain, but the 2nd reads like straightforward requirements in English. In other words, it is often far easier to check a solution than it is to create a solution. The mathematically inclined will recognize this as a consequence of P vs NP problems, if P != NP.
  3. It's common in safety-critical applications to have a fail-safe or backup using a significantly less complicated architecture -- e.g. you might use <complicated system> to do <some complex task>, but for a fail-safe like "power off everything", you might have a tiny microprocessor sitting in-line with the main power rail serving as a glorified switch. So normally the <complicated system> is driving the show, but if that starts to go sideways, the tiny microprocessor can shut it down.
  4. Almost all microprocessors have a "watchdog" built into them. A watchdog is a secondary processor that will reset the primary processor if the primary is non-responsive. Have you ever seen your Android phone mysteriously reboot when the UI locks up? That was a watchdog.
  5. The "watchdog" concept is even used in pure SW contexts, e.g. when the Android OS kills an application on your phone because it has frozen, that's a one computer program (the OS) overseeing another (the application). Ditto for Windows & the task manager.
  6. We often use "voting" where we run the same SW on 3 or more systems and then only output to some downstream hazard if all 3 systems agree. If they don't agree, we can fail-safe or try to recover, e.g. by power-cycling whichever system was out-of-family, or both -- first try to recover, then fail-safe if that didn't work. This is done by running code in "lockstep" on synchronized inputs, very similar to how old multiplayer RTS games used to do networking. 
  7. You can buy self-checking processors that implement lockstep comparison between 2 internal cores so that whenever instructions are executed, you know that the execution occurred identically across both cores.

These aren't cherry-picked. This is the bread & butter of systems safety. We build complex, safe systems by identifying failure modes and then using redundant systems to either tolerate faults or to fail-safe. By focusing on the actual system, and the actual failure modes, and by not getting stuck with our head in the clouds considering a set of "all possible hypothetical systems", it is possible to design & implement robust, reliable solutions in reality.

To claim it is not just impossible to do that, but that it is foolhardy to even try, is the exact opposite of a safety-critical mindset.

Section 1, section 10, and section 11 cover the scenario of R&D automation via AI/ML systems that drive more productive R&D automation, resulting in a positive feedback loop, without requiring the typical "self-improving agent" -- it's the R&D system (people + AI/ML products) as a whole that is self-improving, not the individual AI/ML systems.

I highly recommend reading the entire report though. It was released in 2019 and I think it was brushed aside a little bit too easily. The past 3 years have (in my mind) provided sufficient evidence of things that CAIS directly predicted would happen, e.g. all of the AI/ML systems we've developed recently that have reached super-human competency on tasks despite a lack of generalized learning or capabilities or "self-improvement" or other recognizably "general intelligence" / "agent"-like behavior.

In 2019, we did not have Copilot, or DALL-E, or DALL-E-2, or AlphaFold, or DeepMind Ithaca, or GPT-3 -- etc.

I talk about this a little bit in my comment here.