Assuming we've solved X, could we do Y...

Stuart_Armstrong

The year is 1933. Leó Szilárd has just hypothised the nuclear chain reaction. Worried researchers from proto-MIRI or proto-FHI ask themselves "assuming we've solved the issue of nuclear chain reactions in practice, could we build a nuclear bomb out of it"?

Well, what do we mean by "assuming we've solved the issue of nuclear chain reactions"? Does it mean that "we have some detailed plans for viable nuclear bombs, including all the calculations needed to make them work, and everything in the plans is doable by a rich industrial state"? In that case, the answer to "could we build a nuclear bomb out of it?" is a simple and trivial yes.

Alternatively, are we simply assuming "there exists a collection of matter that supports a chain reaction"? In which case, note that the assumption is (almost) completely useless. In order to figure out whether a nuclear bomb is buildable, we still need to figure out all the details of chain reactions - that assumption has bought us nothing.

Assuming human values...

At the recent AI safety unconference, David Krueger wanted to test, empirically, whether debate methods could be used for creating aligned AIs. At some point in the discussion, he said "let's assume the question of defining human values is solved", wanting to move on to whether a debate-based AI could then safely implement it.

But as above, when we assume that an underdefined definition problem (human values) is solved, we have to be very careful what we mean - the assumption might be useless, or might be too strong, and end up solving the implementation problem entirely.

In the conversation with David, we were imagining a definition of human values related to what humans would answer if we could reflexively ponder specific questions for thousands of years. One could object to that definition on the grounds that people can be coerced or tricked into giving the answers that the AI might want - hence the circumstances of that pondering is critical.

If we assume X="human values are defined in this way", could an AI safely implement X via debate methods? Well, what about coercion and trickery by the AI during the debate process? It could be that X doesn't help at all, because we still have to resolve all of the same issues.

Or, conversely, X might be too strong - it might define what trickery is, which solves a lot of the implementation problem for free. Or, in the extreme case, maybe X is expressed in computer code, and solve all the contradictions within humans, dealing with ontology issues, population changes, what an agent is, and all other subtleties. Then the question "given X, could an AI safely implement it?" reduces to "can the AI run code?"

In summary, when the issue is underdefined, the boundary between definition and implementation is very unclear, and assuming that one of them is solved is very unclear.

How to assume (for the good of all of us)

The obvious way around this issue is to be careful and precise in what we're assuming. So, for example, we might assume "we have an algorithm A, if run for a decade, would compute what humans would decide after a thousand years of debate". Then we have two practical and well defined subproblems to work on: can we approximate the output of A within reasonable time, and is "what humans would decide after a thousand years of debate" a good definition of human values?

Another option, when we lack a full definition, is to focus on some of the properties of that definition that we feel are certain or likely. For example, we can assume that "the total extinction of the all intelligent beings throughout the cosmos" is not a desirable feature according to most human values, and argue whether debate methods will lead to that outcome. Or, at smaller scale, we might assume that telling us informative truths is compatible with our values, and check whether the debate AI would do that.

I think this argument applies well to the case where we are trying to communicate, but not so much to the case where I individually am thinking about a problem. Communication is hard and if the speaker uses some intuitive assumption, chances are the listener will interpret it differently from what the speaker intended, and so being very precise seems quite helpful. However, when I'm thinking through a problem myself and I make an assumption, I usually have a fairly detailed intuitive model of what I mean, such that if you ask me whether I'm assuming that problem X is solved by the assumption, I could answer that, even though I don't have a precise formulation of the assumption. Making the assumption more precise would be quite a lot of work, and probably would not improve my thinking on the topic that much, so I tend not to do it until I think there's some insight and want to make the argument more rigorous. It seems to me that this is how most research makes progress: by individual researchers having intuitions that they then make rigorous and precise.

I think it depends on the individual. Certainly, before realising the points above, I would occasionally mentally do the "assume human values solved" in my mind, in an unrigorous and mentally misleading way.

Hey, David here!

Just writing to give some context... The point of this session was to discuss an issue I see with "super-human feedback (SHF)" schemes (e.g. debate, amplification, recursive reward modelling) that use helper AIs to inform human judgments. I guess there was more of an inferential gap going into the session than I expected, so for background: let's consider the complexity theory viewpoint in feedback (as discussed in section 2.2 of "AI safety via debate"). This implicitly assumes that we have access to a trusted (e.g. human) decision making process (TDMP), sweeping the issues that Stuart mentions under the rug.

Under this view, the goal of SHF is to efficiently emulate the TDMP, accelerating the decision-making. For example, we'd like an agent trained with SHF to be able to quickly (e.g. in a matter of seconds) make decisions that would take the TDMP billions of years to decide. But we don't aim to change the decisions.

Now, the issue I mentioned is: there doesn't seem to be any way to evaluate whether the SHF-trained agent is faithfully emulating the TDMP's decisions on such problems. It seems like, naively, the best we can do is train on problems where the TDMP can make decisions quickly, so that we can use its decisions as ground truth; then we just hope that it generalizes appropriately to the decisions that take TDMP billions of years. And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.

Imagine there are 2 copies of me, A and B. A makes a decision with some helper AIs, and independently, B makes a decision without their help. A and B make different decisions. Who do we trust? I'm more ready to trust B, since I'm worried about the helper AIs having an undesirable influence on A's decision-making.

--------------------------------------------------------------------

...So questions of how to define human preferences or values seem mostly orthogonal to this question, which is why I want to assume them away. However, our discussion did make me consider more that I was making an implicit assumption (and this seems hard to avoid), that there was some idealized decision-making process that is assumed to be "what we want". I'm relatively comfortable with trusting idealized versions of "behavioral cloning/imitation/supervised learning" (P) or "(myopic) reinforcement learning/preference learning" (NP), compared with the SHF-schemes (PSPACE).

One insight I gleaned from our discussion is the usefulness of disentangling:

an idealized process for *defining* "what we want" (HCH was mentioned as potentially a better model of this than "a single human given as long as they want to think about the decision" (which was what I proposed using, for the purposes of the discussion)).
a means of *approximating* that definition.

From this perspective, the discussion topic was: how can we gain empirical evidence for/against this question: "Assuming that the output of a human's indefinite deliberation is a good definition of 'what they want', do SHF-schemes do a good/safe job of approximating that?"

And the point of the session was to see if people have ideas for how to do less naive experiments that would allow us to increase our confidence that a SHF-scheme would yield safe generalization to these more difficult decisions.

Did anyone have ideas for this? My thinking is that you have to understand or make some assumptions about the nature of TDMP in order to have confidence about safe generalization, because if you just treat it as a black box, then it might be that for some class of queries it will do something that can't be approximated by SHF-schemes. No matter how you test, you can only conclude that if such queries exist they are not in the test sets you used.

Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?

Regarding the question of how to do empirical work on this topic: I remember there being one thing which seemed potentially interesting, but I couldn't find it in my notes (yet).

RE the rest of your comment: I guess you are taking issue with the complexity theory analogy; is that correct? An example hypothetical TDMP I used is "arbitrarily long deliberation" (ALD), i.e. a single human is allowed as long as they want to make the decision (I don't think that's a perfect "target" for alignment, but it seems like a reasonable starting point). I don't see why ALD would (even potentially) "do something that can't be approximated by SHF-schemes", since those schemes still have the human making a decision.

"Or was the discussion more about, assuming we have theoretical reasons to think that SHF-schemes can approximate TDMP, how to test it empirically?" <-- yes, IIUC.

I don’t see why ALD would (even potentially) “do something that can’t be approximated by SHF-schemes”, since those schemes still have the human making a decision.

Suppose there's a cryptographic hash function H inside a human brain whose algorithm is not introspectively accessible, and some secret state S which is also not introspectively accessible. The human can choose to, in each period, run S|Output := H(S|Input) and observe/report Output, so we can ask ALD, what's Output if you iterate H n times with X as the initial Input and update S each time. (I can try to clarify if it's not clear what I mean.) I think this can't be approximated by SHF-schemes, because there's no way to train ML to approximate H to serve as the baseline agent.

So what is this an analogy for? I think H could stand for human philosophical deliberation, and S for any introspectively inaccessible information in our brain that might go into and be changed by such deliberation.