When I first started learning about IDA, I thought that agents trained using IDA would be human-level after the first stage, i.e. that Distill(H) would be human-level. As I've written about before, Paul later clarified this, so my new understanding is that after the first stage, the distilled agent will be super-human in some respects and infra-human in others, but wouldn't be "basically human" in any sense.

But IDA is aiming to eventually be super-human in almost every way (because it's aiming to be competitive with unaligned AGI), so that raises some new questions:

  1. If IDA isn't going to be human-level after the first stage, then at what stage does IDA become at-least-human-level in almost every way?
  2. What exactly is the limitation that prevents the first stage of IDA from being human-level in almost every way?
  3. When IDA eventually does become at-least-human-level in almost every way, how is the limitation from (2) avoided?

That brings me to Evans et al., which contains a description of IDA in section 0. The way IDA is set up in this paper leads me to believe that the answer to (2) above is that the human overseer cannot provide a sufficient number of demonstrations for the most difficult tasks. For example, maybe the human can provide enough demonstrations for the agent to learn to answer very simple questions (tasks in in the paper) but it's too time-consuming for the human to answer enough complicated questions (say, in ). My understanding is that IDA gets around this by having an amplified system that is itself automated (i.e. does not involve humans in a major way, so cannot be bottlenecked on the slowness of humans); this allows the amplified system to provide a sufficient number of demonstrations for the distillation step to work.

So in the above view, the answer to (2) is that the limitation is the number of demonstrations the human can provide, and the answer to (3) is that the human can seed the IDA process with sufficient demonstrations of easy tasks, after which the (automated) amplified system can provide sufficient demonstrations of the harder tasks. The answer to (1) is kind of vague: it's just the smallest for which contains almost all tasks a human can do.

But the above view seems to conflict with what's in the IDA post and the IDA paper. In both of those, the amplified system is described as a human doing the decompositions (so it will be slow, or else one would need to argue that the slowness of humans decomposing tasks doesn't meaningfully restrict the number of demonstrations). Also, the main benefit of amplification is described not as the ability to provide more demonstrations, but rather to provide demonstrations for more difficult tasks. Under this alternative view, the answers to questions (1), (2), (3) aren't clear to me.

Thanks to Vipul Naik for reading through this question and giving feedback.

New Answer
New Comment

2 Answers sorted by

Rohin Shah

120

Let's ignore computational cost for now, and so consider iterated amplification without distillation, and the initial agent is some particular human. Amplification is also going to be simpler -- it just means letting the agent think twice as long.

For example, is a question-answering system that just sends me the question, and returns the answer I give after thinking about it for a day. refers to the answers I'd give if I had days to think about it.

Rather than talk about "human-level", let's talk about "Issa-level" -- agents need to answer questions as well as you could given a day's time.

Then, is super-Issa-level on some tasks (e.g. questions about Berkeley culture) and sub-Issa-level on some tasks (e.g. questions about Wikipedia culture). Why is this? Well, for that example, we have different information. But also, presumably there are differences in what we were good at learning, that would have led to differences even if we had the same information. That's the answer to (2) in this context.

The answer to (3) is that with enough time and effort I could answer questions about Wikipedia culture; it would just take me a lot longer to do so relative to you.

The answer to (1) is "idk, but eventually it's possible". For my specific model, one might hope that would be an upper bound -- at that point I'd get about as much time to answer the question as you have spent living.

The case with iterated distillation and amplification is basically the same:

1. Idk, but eventually it'll happen. (This does rely on the Factored Cognition hypothesis.)

2. A neural net trained by distillation will probably not replicate our skill on tasks perfectly -- what it becomes good at depends on the architecture, training process, the training data it was given, etc. Perhaps humans are really good at social reasoning because it was strongly selected for by evolution, and we didn't give a correspondingly higher amount of training data for the neural net for these social situations, and so it was subhuman at social reasoning.

3. With enough time / computational budget, the agent can (hopefully) replicate whatever (possibly expensive) explicit chunk of reasoning that underlies human performance (even if it was powered by human intuition). This is the Factored Cognition hypothesis. The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).

(I might recommend imagining that the first agent has perfect reasoning ability, except that it is very slow. This means that for any question, the first agent could answer it, given unlimited amounts of time. I wouldn't actually make this claim of IDA, but I think it is instructive for building intuitions.)

The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).

I think this is the crux of my confusion, so I would appreciate if you could elaborate on this. (Everything else in your answer makes sense to me.) In Evans et al., during the distillation step, the model learns to solve the difficult tasks directly by using example solutions from the amplification step. But if c

... (read more)
3Rohin Shah
You could do this, but it's expensive. In practice, from the perspective of distillation, there's always a tradeoff between: * Generating better ground truth data (which you can do by amplifying the agent that generates the ground truth data) * Improving the accuracy of the distilled model (which you can do by increasing the amount of data that you train on, and other ML tricks) You could get to an Issa-level model using just the second method for long enough, but it's going to be much more efficient to get to an Issa-level model by alternating the two methods.
2Issa Rice
I'm confused about the tradeoff you're describing. Why is the first bullet point "Generating better ground truth data"? It would make more sense to me if it said instead something like "Generating large amounts of non-ground-truth data". In other words, the thing that amplification seems to be providing is access to more data (even if that data isn't the ground truth that is provided by the original human). Also in the second bullet point, by "increasing the amount of data that you train on" I think you mean increasing the amount of data from the original human (rather than data coming from the amplified system), but I want to confirm. Aside from that, I think my main confusion now is pedagogical (rather than technical). I don't understand why the IDA post and paper don't emphasize the efficiency of training. The post even says "Resource and time cost during training is a more open question; I haven’t explored the assumptions that would have to hold for the IDA training process to be practically feasible or resource-competitive with other AI projects" which makes it sound like the efficiency of training isn't important.
4Rohin Shah
By "ground truth" I just mean "the data that the agent is trained on", feel free to just ignore that part of the phrase. But it is important that it is better data. The point of amplification is that Amplify(M) is more competent than M, e.g. it is a better speech writer, it has a higher ELO rating for chess, etc. This is because Amplify(M) is supposed to approximate "M thinking for a longer time". Yes, that's right. Paul's posts often do talk about this, e.g. An unaligned benchmark, and the competitiveness desideratum in Directions and desiderata for AI alignment. I agree though that it's hard to realize this since the posts are quite scattered. I suspect Paul would say that it is plausibly competitive relative to training a system using RL with a fixed reward function (because the additional human-in-the-loop effort could be a small fraction of that, as long as we do semi-supervised RL well). However, maybe we train systems in some completely different way (e.g. GPT-2 style language models), it's very hard to predict right now how IDA would compare to that.

Donald Hobson

20

In answer to question 2)

Consider the task "Prove Fermats last theorem". This task is arguably human level task. Humans managed to do it. However it took some very smart humans a long time. Suppose you need 10,000 examples. You probably can't get 10,000 examples of humans solving problems like this. So you train the system on easier problems. (maybe exam questions? ) You now have a system that can solve exam level questions in an instant, but can't prove Fermats last theorem at all. You then train on the problems that can be decomposed into exam level questions in an hour. (ie the problems a reasonably smart human can answer in an hour, given access to this machine. ) Repeat a few more times. If you have mind uploading, and huge amounts of compute (and no ethical concerns) you could skip the imitation step. You would get an exponentially huge number of copies of some uploaded mind(s) arranged in a tree structure, with questions being passed down, and answers being passed back. No single mind in this structure experiences more than 1 subjective hour.

If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem. (and if they did, I would expect it to be a surprisingly easy proof that everyone had somehow missed. )

There is no way that IDA can compete with unaligned AI while remaining aligned. The question is, what useful things can IDA do?

There is no way that IDA can compete with unaligned AI while remaining aligned

How do you know that? Do you mean to say, "I really don't think IDA can compete with unaligned AI while remaining aligned"?

If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem.

I would too. IDA/HCH doesn't have to work with the median human, though. It's ok to pick an excellent human, who has been trained for being in that situation. Paul has argued that it wouldn't be that surprising if some humans could be arbitrarily competent in an HCH-setup, even if some couldn't.

1Donald Hobson
Epistemic status: Intuition dump and blatant speculation Suppose that instead of the median human, you used Euclid in the HCH. (Ancient greek, invented basic geometry) I would still be surprised if he could produce a proof of fermat's last theorem (given a few hours for each H). I would suspect that there are large chunks of modern maths that he would be unable to do. Some areas of modern maths have layers of concepts built on concepts. And in some areas of maths, just reading all the definitions will take up all the time. Assuming that there are large and interesting branches of maths that haven't been explored yet, the same would hold true for modern mathematicians. Of course, it depends how big you make the tree. You could brute force over all possible formal proofs, and then set a copy on checking the validity of each line. But at that point, you have lost all alignment, someone will find their proof is a convincing argument to pass the message up the tree. I feel that it is unlikely that any kind of absolute threshold lies between the median human, and an unusually smart human, given that the gap is small in an absolute sense.