[AN #69] Stuart Russell's new book on why we need to replace the standard model of AI

Quote from the book on the problem of aligning black box models:

The task is, fortunately, not the following: given a machine that possesses a high degree of intelligence, work out how to control it. If that were the task, we would be toast. A machine viewed as a black box, a fait accompli, might as well have arrived from outer space. And our chances of controlling a superintelligent entity from outer space are roughly zero. Similar arguments apply to methods of creating AI systems that guarantee we won’t understand how they work; these methods include whole-brain emulation — creating souped-up electronic copies of human brains — as well as methods based on simulated evolution of programs. I won’t say more about these proposals because they are so obviously a bad idea.

This is unfortunately the only paragraph that HC devotes to the matter.

[-]Rohin Shah6y80

I enjoyed pages 185-190, on mathematical guarantees, especially because I've been confused about what the "provably beneficial" in CHAI's mission statement is meant to say. Some quotes:

On the other hand, if you want to prove something about the real world—for example, that AI systems designed like so won’t kill you on purpose—your axioms have to be true in the real world. If they aren’t true, you’ve proved something about an imaginary world.

On the applicability of theorems to practice:

The trick is to know how far one can stray from the real world and still obtain useful results. For example, if the rigid-beam assumption allows an engineer to calculate the forces in a structure that includes the beam, and those forces are small enough to bend a real steel beam by only a tiny amount, then the engineer can be reasonably confident that the analysis will transfer from the imaginary world to the real world.

as well as

The process of removing unrealistic assumptions continues until the engineer is fairly confident that the remaining assumptions are true enough in the real world. After that, the engineered system can be tested in the real world; but the test results are just that. They do not prove that the same system will work in other circumstances or that other instances of the system will behave the same way as the original.

It then talks about assumption failure in cryptography due to side-channel attacks.

A somewhat more concrete version of what "provably beneficial" might mean:

Let’s look at the kind of theorem we would like eventually to prove about machines that are beneficial to humans. One type might go something like this:

Suppose a machine has components $A$ , $B$ , $C$ , connected to each other like so and to the environment like so, with internal learning algorithms $l_{A}$ , $l_{B}$ , $l_{C}$ that optimize internal feedback rewards $r_{A}$ , $r_{B}$ , $r_{C}$ defined like so, and [a few more conditions] . . . then, with very high probability, the machine’s behavior will be very close in value (for humans) to the best possible behavior realizable on any machine with the same computational and physical capabilities.

The main point here is that such a theorem should hold regardless of how smart the components become—that is, the vessel never springs a leak and the machine always remains beneficial to humans.

There are three other points worth making about this kind of theorem. First, we cannot try to prove that the machine produces optimal (or even near-optimal) behavior on our behalf, because that’s almost certainly computationally impossible. [...] Second, we say “very high probability . . . very close” because that’s typically the best that can be done with machines that learn. [...] Finally, we are a long way from being able to prove any such theorem for really intelligent machines operating in the real world!

It then goes on to discuss how such a theorem is subject to "side-channel attacks" because such theorems typically assume Cartesian duality, which is not actually true (see Embedded Agency).

[-]habryka6y60

I often don't have much to say about these newsletters, since they usually only straightforwardly summarize things, or make statements that would take me a long time to engage with, but it seemed good to mention that this edition was particularly helpful to me (because I've been considering whether to invest the time to read all of the book, and this made it more likely that I will, since I seem to disagree with at least a bunch of the things you summarized here)

[-]Rohin Shah6y20

Glad to hear it! Yeah, I do expect many people to disagree with many parts of this book. My guess is that it mostly boils down to a difference in predictions about how we build powerful AI systems.

[-]Rohin Shah6y50

I mentioned in my opinion that I think many of my disagreements are because of an implicit disagreement on how we build powerful AI systems:

the book has an implied stance towards the future of AI research that I don't agree with: I could imagine that powerful AI systems end up being created by learning alone without needing the conceptual breakthroughs that Stuart outlines.

I didn't expand on this in the newsletter because I'm not clear enough on the disagreement; I try to avoid writing very confused thoughts that say wrong things about what other people believe in a publication read by a thousand people. But that's fine for a comment here!

Rather than attribute a model to Stuart, I'm just going to make up a model that was inspired by reading HC, but wasn't proposed by HC. In this model, we get a superintelligent AI system that looks like a Bayesian-like system that explicitly represents things like "beliefs", "plans", etc. Some more details:

Things like 'hierarchical planning' are explicit algorithms. Simply looking at the algorithm can give you a lot of insight into how it does hierarchy. You can inspect things like "options" just by looking at inputs/outputs to the hierarchical planning module. The same thing applies for e.g. causal reasoning.
Any black box deep learning system is only used to provide low-level inputs to the real 'intelligence', in the same way that for humans vision provides low-level inputs for the rest of cognition. We don't need to worry about the deep learning system "taking over", in the same way that we don't worry about our vision module "taking over".
The AI system was created by breakthroughs in algorithms for causal reasoning, hierarchical planning, etc, that allow it to deal with the combinatorial explosion caused by the real world. As a result, it is very cheap to run (i.e. doesn't need a huge amount of compute). This is more compatible with a discontinuous takeoff, though a continuous takeoff is possible if the algorithms improved continuously over time, rather than having breakthroughs.

Some implications of this model:

All of the "intelligence" is happening via explicit algorithms. We only need to make sure that the algorithms are aligned. So, we only have an outer alignment problem; there is no inner alignment problem.
Since the system is mostly Bayesian, the main challenges are to avoid misspecification (solution: use something equivalent to the Solomonoff prior) and to be computationally efficient (solution: keep a small set of hypotheses, detect when they fail to explain the data, and expand to a bigger class of hypotheses). You don't have to worry about other forms of robustness like adversarial examples.

[-]Rohin Shah6y50

If you're curious about how I select what goes in the newsletter: I almost put in this critical review of the book, in the spirit of presenting both sides of the argument. I didn't put it in because I couldn't understand it.

My best guess right now is that the author is arguing that "we'll never get superintelligence", possibly because intelligence isn't a coherent concept, but there's probably something more that I'm not getting. If it turned out that it was only saying "we'll never get superintelligence", and there weren't any new supporting arguments, I wouldn't include it in the newsletter, because we've seen and heard that counterargument more than enough.

[-]TurnTrout6y20

They also made an error in implicitly arguing that because they didn't think unaligned behavior seems intelligent, then we have nothing to worry about from such AI - they wouldn't be "intelligent". I think leaving this out was a good choice.

[-]Charlie Steiner6y30

As with the previous paper, this argument is only really a problem when the agent's belief about the reward function is wrong: if it is correct, then at the point where there is no more information to gain, the agent should already know that humans don't like to be killed, do like to be happy, etc.

There's also the scenario where the AI models the world in a way that has as good or better predictive power than our intentional stance model, but this weird model assigns undesirable values to the AI's co-player in the CIRL game. We can't rely on the agent "already knowing that humans don't like to be killed," because the AI doesn't have to be using the level of abstraction on which "human" or "killed" are natural categories.

[-]Rohin Shah6y30

I certainly would count an ontological failure in the reward function as an incorrect belief about the reward function.

[-]Charlie Steiner6y10

I'm just a little leery of calling things "wrong" when it makes the same predictions about observations as being "right." I don't want people to think that we can avoid "wrong ontologies" by starting with some reasonable-sounding universal prior and then updating on lots of observational data. Or that something "wrong" will be doing something systematically stupid, probably due to some mistake or limitation that of course the reader would never program into their AI.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

25

[AN #69] Stuart Russell's new book on why we need to replace the standard model of AI

25