Let's ignore computational cost for now, and so consider iterated amplification without distillation, and the initial agent is some particular human. Amplification is also going to be simpler -- it just means letting the agent think twice as long.
For example, is a question-answering system that just sends me the question, and returns the answer I give after thinking about it for a day. refers to the answers I'd give if I had days to think about it.
Rather than talk about "human-level", let's talk about "Issa-level" -- agents need to answer questions as well as you could given a day's time.
Then, is super-Issa-level on some tasks (e.g. questions about Berkeley culture) and sub-Issa-level on some tasks (e.g. questions about Wikipedia culture). Why is this? Well, for that example, we have different information. But also, presumably there are differences in what we were good at learning, that would have led to differences even if we had the same information. That's the answer to (2) in this context.
The answer to (3) is that with enough time and effort I could answer questions about Wikipedia culture; it would just take me a lot longer to do so relative to you.
The answer to (1) is "idk, but eventually it's possible". For my specific model, one might hope that would be an upper bound -- at that point I'd get about as much time to answer the question as you have spent living.
The case with iterated distillation and amplification is basically the same:
1. Idk, but eventually it'll happen. (This does rely on the Factored Cognition hypothesis.)
2. A neural net trained by distillation will probably not replicate our skill on tasks perfectly -- what it becomes good at depends on the architecture, training process, the training data it was given, etc. Perhaps humans are really good at social reasoning because it was strongly selected for by evolution, and we didn't give a correspondingly higher amount of training data for the neural net for these social situations, and so it was subhuman at social reasoning.
3. With enough time / computational budget, the agent can (hopefully) replicate whatever (possibly expensive) explicit chunk of reasoning that underlies human performance (even if it was powered by human intuition). This is the Factored Cognition hypothesis. The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).
(I might recommend imagining that the first agent has perfect reasoning ability, except that it is very slow. This means that for any question, the first agent could answer it, given unlimited amounts of time. I wouldn't actually make this claim of IDA, but I think it is instructive for building intuitions.)
The addition of the distillation step is an extra confounder, but we hope that it doesn't distort anything too much -- its purpose is to improve speed without affecting anything else (though in practice it will reduce capabilities somewhat).
I think this is the crux of my confusion, so I would appreciate if you could elaborate on this. (Everything else in your answer makes sense to me.) In Evans et al., during the distillation step, the model learns to solve the difficult tasks directly by using example solutions from the amplification step. But if c
...In answer to question 2)
Consider the task "Prove Fermats last theorem". This task is arguably human level task. Humans managed to do it. However it took some very smart humans a long time. Suppose you need 10,000 examples. You probably can't get 10,000 examples of humans solving problems like this. So you train the system on easier problems. (maybe exam questions? ) You now have a system that can solve exam level questions in an instant, but can't prove Fermats last theorem at all. You then train on the problems that can be decomposed into exam level questions in an hour. (ie the problems a reasonably smart human can answer in an hour, given access to this machine. ) Repeat a few more times. If you have mind uploading, and huge amounts of compute (and no ethical concerns) you could skip the imitation step. You would get an exponentially huge number of copies of some uploaded mind(s) arranged in a tree structure, with questions being passed down, and answers being passed back. No single mind in this structure experiences more than 1 subjective hour.
If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem. (and if they did, I would expect it to be a surprisingly easy proof that everyone had somehow missed. )
There is no way that IDA can compete with unaligned AI while remaining aligned. The question is, what useful things can IDA do?
There is no way that IDA can compete with unaligned AI while remaining aligned
How do you know that? Do you mean to say, "I really don't think IDA can compete with unaligned AI while remaining aligned"?
If you picked the median human by mathematical ability, and put them in this setup, I would be rather surprised if they produced a valid proof of Fermats last theorem.
I would too. IDA/HCH doesn't have to work with the median human, though. It's ok to pick an excellent human, who has been trained for being in that situation. Paul has argued that it wouldn't be that surprising if some humans could be arbitrarily competent in an HCH-setup, even if some couldn't.
When I first started learning about IDA, I thought that agents trained using IDA would be human-level after the first stage, i.e. that Distill(H) would be human-level. As I've written about before, Paul later clarified this, so my new understanding is that after the first stage, the distilled agent will be super-human in some respects and infra-human in others, but wouldn't be "basically human" in any sense.
But IDA is aiming to eventually be super-human in almost every way (because it's aiming to be competitive with unaligned AGI), so that raises some new questions:
That brings me to Evans et al., which contains a description of IDA in section 0. The way IDA is set up in this paper leads me to believe that the answer to (2) above is that the human overseer cannot provide a sufficient number of demonstrations for the most difficult tasks. For example, maybe the human can provide enough demonstrations for the agent to learn to answer very simple questions (tasks in T0 in the paper) but it's too time-consuming for the human to answer enough complicated questions (say, in T100). My understanding is that IDA gets around this by having an amplified system that is itself automated (i.e. does not involve humans in a major way, so cannot be bottlenecked on the slowness of humans); this allows the amplified system to provide a sufficient number of demonstrations for the distillation step to work.
So in the above view, the answer to (2) is that the limitation is the number of demonstrations the human can provide, and the answer to (3) is that the human can seed the IDA process with sufficient demonstrations of easy tasks, after which the (automated) amplified system can provide sufficient demonstrations of the harder tasks. The answer to (1) is kind of vague: it's just the smallest n for which ⋃ni=0Ti contains almost all tasks a human can do.
But the above view seems to conflict with what's in the IDA post and the IDA paper. In both of those, the amplified system is described as a human doing the decompositions (so it will be slow, or else one would need to argue that the slowness of humans decomposing tasks doesn't meaningfully restrict the number of demonstrations). Also, the main benefit of amplification is described not as the ability to provide more demonstrations, but rather to provide demonstrations for more difficult tasks. Under this alternative view, the answers to questions (1), (2), (3) aren't clear to me.
Thanks to Vipul Naik for reading through this question and giving feedback.