I think that we should keep these stronger Lookup Table and Overseer’s Manual scenarios in mind when considering whether HCH might be safe.
These scenarios are not just strictly stronger but also have downsides, right? In particular they seem to give up some properties that made some people optimistic in Paul's approach in the first place. See this post by Daniel Dewey for example:
It seems to me that this kind of approach is also much more likely to be robust to unanticipated problems than a formal, HRAD-style approach would be, since it explicitly aims to learn how to reason in human-endorsed ways instead of relying on researchers to notice and formally solve all critical problems of reasoning before the system is built.
In the Lookup Table case, and to the extent that humans in Overseer’s Manual are just acting as lookup table decompressors, aren't we back to "relying on researchers to notice and formally solve all critical problems of reasoning before the system is built"?
Yeah, to some extent. In the Lookup Table case, you need to have a (potentially quite expensive) way of resolving all mistakes. In the Overseer's Manual case, you can also leverage humans to do some kind of more robust reasoning (for example, they can notice a typo in a question and still respond correctly, even if the Lookup Table would fail in this case). Though in low-bandwidth oversight, the space of things that participants could notice and correct is fairly limited.
Though I think this still differs from HRAD in that it seems like the output of HRAD would be a much smaller thing in terms of description length than the Lookup Table, and you can buy extra robustness by adding many more human-reasoned things into the Lookup Table (ie. automatically add versions of all questions with typos that don't change the meaning of a question into the Lookup Table, add 1000 different sanity check questions to flag that things can go wrong).
So I think there are additional ways the system could correct mistaken reasoning relative to what I would think the output of HRAD would look like, but you do need to have processes that you think can correct any way that reasoning goes wrong. So the problem could be less difficult than HRAD, but still tricky to get right.
So if I understand your point, this quote:
I think there's a way to interpret HCH in a way that leads to incorrect intuitions about why we would expect it to be safe.
is basically referring to this:
We could claim that this leads the reasoning that produces the answer to stay within the space of reasoning that humans use, and so more likely to reflect our values and less likely to yield unexpected outcomes that misinterpret our values.
And you think this claim is incorrect, or at least not the best/primary claim for HCH's safety? I agree with that.
-------------------
I'll also note that the other two cases sound a lot like GOFAI, and make it appear that HCH has not provided a very useful reduction or complete explanation of how to make a good "alignment target". That also seems correct to me.
I'd say that the claim is not sufficient - it might provide some alignment value, but it needs a larger story about how the whole computation is going to be safe. I do think that the HCH framework could make specifying an aligned GOFAI-like computation easier (but it's hard to come up with a rigorous argument for this without pointing to some kind of specification that we can make claims about, which is something I'd want to produce along the way while proceeding with HCH-like approaches)
HCH, introduced in Humans consulting HCH, is a computational model in which a human answers questions using questions answered by another human, which can call other humans, which can call other humans, and so on. Each step in the process consists of a human taking in a question, optionally asking one or more subquestions to other humans, and returning an answer based on those subquestions. HCH can be used as a model for what Iterated Amplification would be able to do in the limit of infinite compute. HCH can also be used to decompose the question of "is Iterated Amplification safe" into “is HCH safe” and “If HCH is safe, will Iterated Amplification approximate the behaviour of HCH in a way that is also safe”.
I think there's a way to interpret HCH in a way that leads to incorrect intuitions about why we would expect it to be safe. Here, I describe three models of how one could think HCH would work, and why we might expect them to be safe.
Mechanical Turk: The human Bob, is hired on Mechanical Turk to act as a component of HCH. Bob takes in some reasonable length natural language question, formulates subquestions to ask other Turkers, and turns the responses from those Turkers into an answer to the original question. Bob only sees the question he is asked and thinks for a short period of time before asking subquestions or returning an answer. The question of "is HCH corrigible" is about "how does the corrigibility of Bob translate into corrigibility of the overall system"? To claim that HCH is safe in this scenario, we could point to Bob being well-intentioned, having human-like concepts and reasoning in a human-like way. Also, since Bob has to communicate in natural language to other humans, those communications could be monitored or reflected upon. We could claim that this leads the reasoning that produces the answer to stay within the space of reasoning that humans use, and so more likely to reflect our values and less likely to yield unexpected outcomes that misinterpret our values.
Lookup Table: An AI safety research team lead by Alice writes down a set of 100 million possible queries that they claim capture all human reasoning. For each of these queries, they then write out the subquestions that would need to be written, along with simple computer code that combines the answers to the subquestions into an answer to the original question. This produces a large lookup table, and the "human" in HCH is just a call to this lookup table. The question of "is HCH corrigible" is about "has Alice's team successfully designed a set of rules that perform corrigible reasoning"? To justify this, we point to Alice's team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice, etc.
Overseer's Manual: An AI safety research team lead by Alice has written a manual on how to corrigibly answer questions by decomposing them into subquestions. This manual is handed to Bob, who was hired to decompose tasks. Bob carefully studies the manual and applies the rules in it when he is performing his task (and the quality of his work is monitored by the team). Alice's team has carefully thought about how to decomposed tasks, and performed many experiments with people like Bob trying to decompose tasks. So they understand the space of strategies and outputs that Bob will produce given the manual. The "human" in HCH is actually a human (Bob), but in effect Bob is acting as a compressed lookup table, and is only necessary because the lookup table is too large to write down. An analogy is that it would take too much space and time to write down a list of translations of all possible 10 word sentences from English to German, but it is possible to train humans who, given any 10 word English sentence can produce the German translation. The safety properties are caused by Alice's team's preparations, which include Alice's team modelling how Bob would produce answers after reading the manual. To justify the safety of the system, we again point to Alice's team having a large body of AI safety knowledge, proofs of properties of the system, demonstrations of the system working in practice etc.
I claim that the Mechanical Turk scenario is incomplete about why we might hope for an HCH system to be safe. Though it might be safer than a computation without human involvement, I would find it hard to trust that this system would continue to scale without running into problems, like handing over control deliberately or accidentally to some unsafe computational process. The Mechanical Turk scenario leaves out the process of design that Alice’s team takes part in the Lookup Table and Overseer’s Manual scenarios, which can include at least some consideration of AI safety issues (though how much of this is necessary is an open question). I think this design process, if done right, is the thing that could give the system the ability to avoid these problems as it scales. I think that we should keep these stronger Lookup Table and Overseer’s Manual scenarios in mind when considering whether HCH might be safe.
(Thanks to Andreas Stuhlmüller and Owain Evans for feedback on a draft of this post)