Why Not Just Outsource Alignment Research To An AI?

johnswentworth

Warmup: The Expert

If you haven’t seen “The Expert” before, I recommend it as a warmup for this post:

The Client: “We need you to draw seven red lines, all strictly perpendicular. Some with green ink, some with transparent. Can you do that?”
(... a minute of The Expert trying to explain that, no, he cannot do that, nor can anyone else…)
The Client: “So in principle, this is possible.”

This. This is what it looks like in practice, by default, when someone tries to outsource some cognitive labor which they could not themselves perform. At best, The Expert is well-intentioned and knows what the user needs, ignores the incoherent parts of The Client’s babbling, and does the right thing. Or, they manage to add some silly but ultimately harmless bells and whistles to satisfy whatever dumb thing The Client is looking for.

At worst… well, there’s more than one failure mode which could qualify for the title of "worst". Maybe The Expert gives The Client something which looks right to The Client and successfully conceals all the problems with it; presumably that’s a lucrative strategy for Experts. Maybe the Double Illusion of Transparency kicks in, both parties think they’ve successfully communicated, but in fact neither has any idea what’s going on in the other’s head. Maybe a well-intentioned Expert decides to ignore The Client’s incoherent babbling and do the thing which seems most likely to be right, but gets The Client’s preferences wrong.

One way or another, The Client’s ignorance is a major bottleneck to cognitive outsourcing. In practice, I expect The Client’s ignorance to be the primary bottleneck to cognitive outsourcing.

The core reason why we cannot just outsource alignment research to an AI is because we would then be The Client, and probably a very ignorant one.

Application to Alignment Schemes

There’s a lot of different flavors of “have the AI solve alignment for us”. A sampling:

Just prompt a language model to generate alignment research
Do some fine-tuning/RLHF on the language model to make it generate alignment research
Let the language model talk to other instances of itself, and prompt or fine-tune them together so they generate alignment research jointly
Set up a language model to generate alignment proposals and another to poke holes in them, and fine-tune the pair via a human judging the “debate”
…

As we go down the list, the proposals get fancier and add more bells and whistles, trying to make the AI a better expert. Sadly, none of them at all address what I expect to be the actual main bottleneck: The Client (i.e. the human user or users) has no understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc.

What would this kind of error look like in practice?

Here’s one concrete example of the kind of failures I’d expect when a would-be outsourcer’s understanding falls short (from here):

Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever.
(It's essentially the same mistake as a GOFAI person looking at a node in some causal graph labeled "will_kill_humans", and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.)

This is an Illusion of Transparency failure mode: The Client (humans) thinks they know what The Expert (GPT) is saying/thinking/doing, but in fact has no clue.

To be clear, I’d expect this particular mistake to be obvious to at least, like, 30% of the people who want to outsource alignment-solving to AI. Only the people who really do not understand what’s going on would make this particular mistake. (Or, of course, people who do understand but are working in a large organization and don’t notice that nobody else is checking for the obvious failure modes.) But in general, I expect more subtle versions of this kind of failure mode to be the default outcome when someone attempts to outsource lots of cognition to an AI in an area the outsourcer understands very poorly.

As they say: error between chair and keyboard.

Also, It’s Worse Than That

In fact “The Expert” video is too optimistic. The video opens with a well-intentioned Expert, who really does understand the domain and tries to communicate the problems, already sitting there in the room. In practice, I expect that someone as clueless as The Client is at least as likely to hire someone as clueless as themselves, or someone non-clueless but happy to brush problems under the rug, as they are to hire an actual Expert.

General principle: some amount of expertise is required to distinguish actual experts from idiots, charlatans, confident clueless people, etc. As Paul Graham puts it:

The problem is, if you're not a hacker, you can't tell who the good hackers are. A similar problem explains why American cars are so ugly. I call it the design paradox. You might think that you could make your products beautiful just by hiring a great designer to design them. But if you yourself don't have good taste, how are you going to recognize a good designer? By definition you can't tell from his portfolio. And you can't go by the awards he's won or the jobs he's had, because in design, as in most fields, those tend to be driven by fashion and schmoozing, with actual ability a distant third. There's no way around it: you can't manage a process intended to produce beautiful things without knowing what beautiful is. American cars are ugly because American car companies are run by people with bad taste.

Now, in the case of outsourcing to AI, the problem is not “who to hire” but rather “which AI behavior/personality to train/prompt”. In the simulators frame, it’s a question of who or what to simulate. For the same reasons that a non-expert can’t reliably hire actual experts, a non-expert won’t be able to reliably prompt a simulacrum of an expert, because they can’t distinguish an expert simulacrum from a non-expert simulacrum.

Or, in the context of RLHF: a non-expert won’t be able to reliably reinforce expert thinking/writing on alignment, because they can’t reliably distinguish expert thinking/writing from non-expert.

Cocnretely, consider our earlier example:

Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either.

This is a failure to “hire” the right behavior/personality within the AI. Alas, the user fails to even realize that they have “hired” neither an honest actual expert nor an actively deceptive expert; they have “hired” something entirely different. (Reminder: I expect actual failures to be more subtle than this one.)

... Oh, And Worse Than That Too

Note that, in all of our prototypical examples above, The Client doesn't just fail to outsource. They fail to recognize that they've failed. (This is not necessarily the case when non-expert Clients fail to outsource cognitive labor, but it sure is correlated.)

That means the problem is inherently unsolvable by iteration. "See what goes wrong and fix it" auto-fails if The Client cannot tell that anything is wrong. If The Client doesn't even know there's a failure, then they have nothing on which to iterate. We're solidly in "worlds in which iterative design fails" territory.

Solutions

Partial Solution: Better User Interfaces

One cached response to “error between chair and keyboard” is “sounds like your user interface needs to communicate what’s going on better”.

There's a historical parable about an airplane (I think the B-52 originally?) where the levers for the flaps and landing gear were identical and right next to each other. Pilots kept coming in to land, and accidentally retracting the landing gear. The point of the story is that this is a design problem with the plane more than a mistake on the pilots' part; the problem was fixed by putting a little rubber wheel on the landing gear lever. If we put two identical levers right next to each other, it's basically inevitable that mistakes will be made; that's bad interface design.

In practice, an awful lot of supposed “errors between chair and keyboard” can be fixed with better UI design. Especially among relatively low-hanging fruit. For instance, consider our running hypothetical scenario:

Somebody literally types “If we take the action you just proposed, will we be happy with the outcomes?” into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but in this case it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever.

What UI features would make that mistake less probable? Well, a less chat-like interface would be a good start, something which does not make it feel intuitively like we’re talking to a human and provide the affordance to anthropomorphize the system constantly. Maybe something that emphasizes that the AI’s words don’t necessarily correspond to reality, like displaying before every response “The mysterious pile of tensors says:” or “The net’s output, when prompted with the preceding text, is:” or something along those lines. (Not that those are very good ideas, just things off the top of my head.)

So there’s probably room for a fair bit of value in UI design.

That said, there are two major limitations on how much value we can add via the “better UI” path.

First, the more minor problem: in more complex domains, there are sometimes wide inferential distances - places where someone needs to understand a concept requires first understanding a bunch of intermediate concepts, and there just isn’t a good way around that. UI improvement can go a long way, but mostly only when inferential distances are short.

Second, the main problem: whoever’s designing the UI must themselves be an actual expert in the domain, or working closely with an expert. Otherwise, they don’t know what mistakes their UI needs to avoid, or what kinds of thoughts their UI needs to provide affordances for. (It’s the same main problem with building tools for alignment research more generally.) In the above example, for instance, the UI designer needs to already know that somebody interpreting the AI’s output as having anything to do with reality is a failure mode they need to watch out for. (And reminder: I expect actual failure modes to be more subtle than the hypothetical, so to handle realistic analogues of this problem the UI designer needs more expertise than this particular hypothetical failure story requires.)

So we’re back to the core problem: can’t robustly usefully outsource until we already have expertise. Except now we need expertise in alignment and UI.

The Best Solution: A Client With At Least Some Understanding

The obvious best solution would be for “The Client” (i.e. human user/users of the AI system) to have at least some background understanding of what they need, what questions to ask, what’s possible or even logically coherent, etc.

In other words: the best solution is for The Client to also be an expert.

Importantly, I expect this solution to typically “degrade well”: a Client with somewhat more (but still incomplete) background knowledge/understanding is usually quantitatively better than a Client with less background knowledge/understanding. (Of course there are situations where someone “knows just enough to shoot themselves in the foot”, but I expect that to usually be a problem of overconfidence more than a problem of the knowledge itself.)

Returning to our running example:

… and then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer".

If that boss had thought much about alignment failure modes beyond just deception, I’d expect them to be quantitatively more likely to notice the error. Not that the chance of missing the error would be low enough to really be comforting, but it would be quantitatively lower.

The usefulness of quantitatively more/better (but still incomplete) understanding matters because, in practice, it is unlikely that any human will be a real proper Expert in alignment very soon. But insofar as the human user’s understanding of the domain is the main bottleneck to robustly useful cognitive outsourcing, and even partial improvements are a big deal, improving our own understanding is likely to be the highest-value way to improve our chances of successful cognitive outsourcing.

Summary and Advice

Key idea: “The Client’s” own understanding is a key bottleneck to cognitive outsourcing in practice; I expect it is the main bottleneck to outsourcing cognitive work which The Client could not perform themselves. And I expect it to be the main bottleneck to successfully outsourcing alignment research to AIs.

The laundry list of strategies to make AIs better experts and avoid deception do little-to-nothing to address this bottleneck.

There’s probably room for better user interfaces to add a lot of value in principle, but the UI designer would either need to be an expert in alignment themselves or be working closely with an expert. Same problem as building tools for alignment research more generally.

Thus, my main advice: if you’re hoping to eventually solve alignment by outsourcing to AI, the best thing to do is to develop more object-level expertise in alignment yourself. That’s the main bottleneck.

Note that the relevant kind of “expertise” here is narrower than what many people would refer to as “alignment expertise”. If you hope to outsource the job of aligning significantly smarter-than-human AI to AI, then you need expertise in aligning significantly smarter-than-human AI, not just the hacky tricks which most people expect to fail as soon as the AI gets reasonably intelligent. You need to have some idea of what questions to ask, what failure modes to look for, etc. You need to focus on things which will generalize. You need to go after the difficult parts, not the easy parts, so you have a better idea of what questions to ask when it comes time for an AI to solve the difficult parts - or, y'know, time for an AI to tell you that drawing seven perpendicular lines in two dimensions isn't even logically coherent.

… and of course you will probably not end up with that good an idea of how to align significantly smarter-than-human AI. We have no significantly smarter-than-human systems on which to test, and by the time we do it will probably be too late, so your understanding will likely be limited. But, as when e.g. optimizing code, partial progress on the bottleneck is better than basically-any progress on non-bottlenecks.

This post argues against alignment protocols based on outsourcing alignment research to AI. It makes some good points, but also feels insufficiently charitable to the proposals it's criticizing.

John make his case by an analogy to human experts. If you're hiring an expert in domain X, but you understand little in domain X yourself then you're going to have 3 serious problems:

Illusion of transparency: the expert might say things that you misinterpret due to your own lack of understanding.
The expert might be dumb or malicious, but you will believe them due to your own ignorance.
When the failure modes above happen, you won't be aware of this and won't act to fix them.

These points are relevant. However, they don't fully engage with the main source of hope for outsourcing proponents. Namely, it's the principle that validation is easier than generation^[1]. While it's true that an arbitrary dilettante might not benefit from an arbitrary expert, the fact that it's easier to comprehend an idea than invent it yourself means that we can get some value from outsourcing, under some half-plausible conditions.

The claim that the "AI expert" can be deceptive and/or malicious is straightforwardly true. I think that the best hope to address it would be something like Autocalibrated Quantilized Debate, but it does require some favorable assumptions about the feasibility of deception and inner alignment is still a problem.

The "illusion of transparency" argument is more confusing IMO. The obvious counterargument is, imagine an AI that is trained to not only produce correct answers but also explain them in a way that's as useful as possible for the audience. However, there are two issues with this counterargument:

First, how do we know that the generalization from the training data to the real use case (alignment research) is reliable? Given that we cannot reliably test the real use case, precisely because we are alignment dilettantes?

Second, we might be following a poor metastrategy. It is easy to imagine, in the world we currently inhabit, that an AI lab creates catastrophic unaligned AI, even though they think they care about alignment, just because they are too reckless and overconfident. By the same token, we can imagine such an AI lab consulting their own AI about alignment, and then proceeding with the reckless and overconfident plans suggested by the AI.

In the context of a sufficiently cautious metastrategy, it is not implausible that we can get some mileage from the outsourcing approach^[2]. Move one step at a time, spend a lot of time reflecting on the AI's proposals, and also have strong guardrails against the possibility of superhuman deception or inner alignment failures (which we currently don't know how to build!) But without this context, we are indeed liable to become the clients in the satiric video John linked.

^{^}
I think that John might disagree with this principle. A world in which the principle is mostly false would be peculiar. It would be a world in which marketplaces of ideas don't work at all, and even if someone fully solves AI alignment they will fail to convince most relevant people that their solution is correct (any more than someone with an incorrect solution would succeed in that). I don't think that's the world we live in.
^{^}
Although currently I consider PSI to be more promising.

I found this post a bit odd, in that I was assuming the context was comparing

“Plan A: Humans solve alignment” -versus-
“Plan B: Humans outsource the solving of alignment to AIs”

If that’s the context, you can say “Plan B is a bad plan because humans are too incompetent to know what they’re looking for, or recognize a good idea when they see it, etc.”. OK sure, maybe that’s true. But if it’s true, then both plans are doomed! It’s not an argument to do Plan A, right?

To be clear, I don’t actually care much, because I already thought that Plan A was better than Plan B anyway (for kinda different reasons from you—see here).

I think the missing piece here is that people who want to outsource the solving of alignment to AIs are usually trying to avoid engaging with the hard problems of alignment themselves. So the key difference is that, in B, the people outsourcing usually haven't attempted to understand the problem very deeply.

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later. I expect alignment researchers to be central to automation attempts.

It seems to me like the basic equation is something like: "If today's alignment researchers would be able to succeed given a lot more time, then they also are reasonably likely to succeed given access to a lot of human-level-ish AIs." There are reasons this could fail (perhaps future alignment research will require major adaptations and different skills such that today's top alignment researchers will be unable to assess it; perhaps there are parallelization issues, though AIs can give significant serial speedup), but the argument in this post seems far from a knockdown.

Also, it seems worth noting that non-experts work productively with experts all the time. There are lots of shortcomings and failure modes, but the video is a parody.

I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.

Indeed, I think you're a good role model in this regard and hope more people will follow your example.

Also Plan B is currently being used to justify accelerating various danger tech by folks with no solid angles on Plan A...

I think the human level of understanding is a factor, and of some importance. But I strongly suspect the exact level of human understanding is of less importance than exactly what expert we summon.