Interpretability/Tool-ness/Alignment/Corrigibility are not Composable

johnswentworth

Interpretability

I have a decent understanding of how transistors work, at least for purposes of basic digital circuitry. Apply high voltage to the gate, and current can flow between source and drain. Apply low voltage, and current doesn’t flow. (... And sometimes reverse that depending on which type of transistor we’re using.)

undefined — Transistor visual from wikipedia showing Source, Drain, Gate, and (usually ignored unless we’re really getting into the nitty gritty) Body. At a conceptual level: when voltage is applied to the Gate, the charge on the gate attracts electrons (or holes) up into the gap between Source and Drain, and those electrons (or holes) then conduct current between Source and Drain.

I also understand how to wire transistors together into a processor and memory. I understand how to write machine and assembly code to run on that processor, and how to write a compiler for a higher-level language like e.g. python. And I understand how to code up, train and run a neural network from scratch in python.

In short, I understand all the pieces from which a neural network is built at a low level, and I understand how all those pieces connect together. And yet, I do not really understand what’s going on inside of trained neural networks.

This shows that interpretability is not composable: if I take a bunch of things which I know how to interpret, and wire them together in a way I understand, I do not necessarily know how to interpret the composite system. Composing interpretable pieces does not necessarily yield an interpretable system.

Tools

The same applies to “tools”, in the sense of “tool AI”. Transistors and wires are very tool-ish: I understand what they do, they’re definitely not optimizing the broader world or trying to trick me or modelling me at all or trying to self-preserve or acting agenty in general. They’re just simple electronic tools.

And yet, assuming agenty AI is possible at all, it will be possible to assemble those tools into something agenty.

So, like interpretability, tool-ness is not composable: if I take a bunch of non-agenty tools, and wire them together in a way I understand, the composite system is not necessarily a non-agenty tool. Composing non-agenty tools does not necessarily yield a non-agenty tool.

Alignment/Corrigibility

What if I take a bunch of aligned and/or corrigible agents, and “wire them together” into a multi-agent organization? Is the resulting organization aligned/corrigible?

Actually there’s a decent argument that it is, if the individual agents are sufficiently highly capable. If the agents can model each other well enough and coordinate well enough, then they should be able to each individually predict what individual actions will cause the composite system to behave in an aligned/corrigible way, and they want to be aligned/corrigible, so they’ll do that.

However, this does not work if the individual agents are very limited and unable to model the whole big-picture system. HCH-like proposals are a good example here: humans are not typically able to model the whole big picture of a large human organization. There are too many specialized skillsets, too much local knowledge and information, too many places where complicated things happen which the spreadsheets and managerial dashboards don’t represent well. And humans certainly can’t coordinate at scale very well in general - our large-scale communication bandwidth is maybe five to seven words at best. Each individual human may be reasonably aligned/corrigible, but that doesn’t mean they aggregate together into an aligned/corrigible system.

The same applies if e.g. we magically factor a problem and then have a low-capability overseeable agent handle each piece. I could definitely oversee a logic gate during the execution of a program, make sure that it did absolutely nothing fishy, but overseeing each individual logic gate would do approximately nothing at all to prevent the program from behaving maliciously.

How These Arguments Work In Practice

In practice, nobody proposes that AI built from transistors and wires will be interpretable/tool-like/aligned/corrigible because the transistors and wires are interpretable/tool-like/aligned/corrigible. But people do often propose breaking things into very small chunks, so that each chunk is interpretable/tool-like/aligned/corrigible. For instance, interpretability people will talk about hiring ten thousand interpretability researchers to each interpret one little circuit in a net. Or problem factorization people will talk about breaking a problem into a large number of tiny little chunks each of which we can oversee.

And the issue is, the more little chunks we have to combine together, the more noncomposability becomes a problem. If we’re trying to compose interpretability/tool-ness/alignment/corrigibility of many little things, then figuring out how to turn interpretability/tool-ness/alignment/corrigibility of the parts into interpretability/tool-ness/alignment/corrigibility of the whole is the central problem, and it’s a hard (and interesting) open research problem.

The upside of this, or of "more is different" , is we don't necessarily even need the property in the parts, or detailed understanding of the parts. And how the composition works / what survives renormalization / ... is almost the whole problem.

(see also this shortform, which makes a rudimentary version of the arguments in the first two subsections)

(Mostly just stating my understanding of your take back at you to see if I correctly got what you're saying:)

I agree this argument is obviously true in the limit, with the transistor case as an existence proof. I think things get weird at the in-between scales. The smaller the network of aligned components, the more likely it is to be aligned (obviously, in the limit if you have only one aligned thing, the entire system of that one thing is aligned); and also the more modular each component is (or I guess you would say the better the interfaces between the components), the more likely it is to be aligned. And in particular if the interfaces are good and have few weird interactions, then you can probably have a pretty big network of components without it implementing something egregiously misaligned (like actually secretly plotting to kill everyone).

And people who are optimistic about HCH-like things generally believe that language is a good interface and so conditional on that it makes sense to think that trees of humans would not implement egregiously misaligned cognition, whereas you're less optimistic about this and so your research agenda is trying to pin down the general theory of Where Good Interfaces/Abstractions Come From or something else more deconfusion-y along those lines.

Does this seem about right?

Good description.

Also I had never actually floated the hypothesis that "people who are optimistic about HCH-like things generally believe that language is a good interface" before; natural language seems like such an obviously leaky and lossy API that I had never actually considered that other people might think it's a good idea.