Just Imitate Humans?

michaelcohen

Just Imitate Humans? — AI Alignment Forum

New Comment

60 comments, sorted by

Click to highlight new comments since: Today at 4:27 AM

Found a few more relevant posts (which I haven't read and digested yet, but I figure I'll post here before I forget to):

ETA:

[-]cousin_it7y20

Wow, when I click these links from greaterwrong.com, they go to arbital.greaterwrong.com which loads instantly. Thanks to Said for the nice work!

[-]habryka7y70

I actually spent a bunch of time in the last weeks fixing and updating Arbital, so it should be reasonably fast now. The arbital pages loaded for me in less than a second.

arbital.greaterwrong is obviously still faster, but it's no longer as massive of a difference.

[-]michaelcohen7y*10

Here are some of my thoughts on these posts. Thank you again for linking them.

Against mimicry:

Humans and machines have very different capabilities. Even a machine which is superhuman in many respects may be completely unable to match human performance in others. In particular, most realistic systems will be unable to exactly mimic human performance in any rich domain.

In light of this, it’s not clear what mimicry even means, and a naive definition won’t do what we want.

I don’t understand why an approximation of optimal sequence prediction doesn’t do what we want. That makes the objective minimizing the KL-divergence from the human policy to the imitation policy, but I think it is easier to think of this as just proper Bayesian updates (approximately). When there are too few samples, or using a bad approximation of optimal prediction, the imitator could fail, as the blocks examples describes. But a) it will learn to do everything that a human can do that it “can” learn, and b) the complaint that what we really want is for the imitator to just solve the task is just a wish for safe AGI. Yes, if better, more capable options than imitation can be resolved as safe, they will be superior.

Mimicry and Meeting Halfway:

It would be great if there were some way to get the best of both worlds [between approval directed agents (good for weak reasoners) and imitation (good from strong reasoners)]

We’ll be able to teach Arthur to achieve the task X if it can be achieved by the “intersection” of Arthur and Hugh

If I’m understanding correctly, this seems more like getting the worst of both worlds. (Or at least doing no better than imitation).

Also, the generator (i.e. the agent) has an incentive to take over the world to shut off the discriminator.

Edit: I was ascribing too much agent-ness to the generator, which might be relevant for future GAN-inspired stuff, but for current versions of GANs, its only conception of the discriminator is its gradient update, and it doesn't believe the output of the discriminator depends on the state of the world. Depending on the internals of the discriminator, this incentive might reappear, but I'm not sure.

Reliable prediction:

I think this is a question of confidence calibration. I don’t know how to tractably approximate ideal reasoning, but I don’t this really jeopardizes imitation learning.

Safe training procedure for human-imitators:

How do we train a reinforcement learning system to imitate a human producing complex outputs such as strings?

Supervised learning suffices: tractably approximate ideal reasoning. I know this is a non-answer, but I don’t know the details of how to do this. This most naturally falls under the retrodiction category in the article. The “tractable approximations” which computational complexity problems threaten nonetheless seem attainable to me given the existence of humans.

Selective similarity metrics for imitation:

These are some interesting ideas on the problem that I am abstracting away regarding tractably approximating optimal sequence prediction.

Whole brain emulation and DL imitation learning:

Seems reasonable. Worth stressing something I think Gwern would agree with: a WBE inspired DL architecture for an artificial agent is definitely not going to make it safe by default.

Imitation Learning Considered Unsafe?:

1) Training a flexible model with a reasonable simplicity prior to imitate (e.g.) human decisions (e.g. via behavioral cloning) should presumably yield a good approximation of the process by which human judgments arise, which involves a planning process.

2) We shouldn't expect to learn exactly the correct process, though.

3) Therefore imitation learning might produce an AI which implements an unaligned planning process, which seems likely to have instrumental goals, and be dangerous.

If I’m understanding correctly, the concern is that the imitator learns how humans plan before learning what humans want, so then it plans like a human toward the achievement of some inhuman goal. I don’t think this causes an existential catastrophe. Human-like planning in the service of very roughly human-like goals just doesn’t seem to me to be similar at all to take-over-the-world behavior.

The AI that Pretends to be Human:

This seems quantilization-like, but without the formal guarantees of quantilization. I like quantilization a lot. I intend to think more about whether it could be extended to a multi-action instead of single-action definition, and whether one could use an approximate human policy rather than a perfect one.

Elaborations on Apprenticeship Learning:

Rather than imitating human behavior, the AI system imitates the behavior of a human who has access to a collection of AI assistants. These assistants can also be trained using AL-with-bootstrapping. In principle, such a process could scale well past human level.

This seems to be HCH (the prediction version). One reason why I think HSIFAUH might be superior is that if you have a bunch of copies of HSIFAUH that are in a flexible management hierarchy, intelligent agents can be in charge of allocating resources effectively between instances, and restructuring communication protocols, whereas with HCH, there is the fixed tree hierarchy. More critically, if I’m understanding HCH correctly, it is trained by having an actual human with access to the freshest version of HCH, and then HCH gets trained on the human’s output. If the real is human is the “manager”, or if the human eventually assumes that role, there is never any more training on the subtasks, like making a good spreadsheet for a manager to look at. A training regimen for the human could be designed ad hoc around when to query the actual human with different subtasks, or you could use the approach of HSIFAUH to query a human when there is sufficient expected information gain. But I think the capabilities of vanilla HCH depend a lot on how you design the set of tasks it is trained on.

Counterfactual human-in-the-loop:

The situation that this proposal is designed for, if I’m understanding correctly, is that we have an otherwise unaligned and otherwise dangerous AGI, but if it attempted to take over the world, a human would recognize its behavior as dangerous, and can step in. This proposal is to replace the human with a human imitation to make it all more efficient. If we are in that situation, I agree this is a good proposal for a speedup. I don’t think we will find ourselves in that situation.

[-]David Scott Krueger7y30

RE: "Imitation learning considered unsafe?" (I'm the author):

The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.

I agree with your response; this is also why I said: "Mistakes in imitating the human may be relatively harmless; the approximation may be good enough".

I don't agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).

[-]michaelcohen7y40

I don't agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).

Thanks for the clarification. Consider the sort of relatively simple, super-human planning algorithm that, for most goals, would lead the planner/agent to take over the world or do similarly elaborate and impactful things in the service of whatever goal is being pursued. A Bayesian predictor of the human's behavior will consider the hypothesis $H_{g}$ that the human does the sort of planning described above in the service of goal $g$ . It will have a corresponding hypothesis for each such goal $g$ . It seems to me, though, that these hypotheses will be immediately eliminated. The human's observed behavior won't include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form $H_{g}$ . A hypothesis which says that the observed behavior is the output of human-like planning in the service of some goal which is slightly incorrect may maintain some weight in the posterior after a number of observations, but I don't see how "dangerously powerful planning + goal" remains under consideration.

The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.

I suppose the point of human imitation is to produce a weak, conservative, lazy, impact-sensitive mesa-optimizer, since humans are optimizers with those qualifiers. If it weren't producing a mesa-optimizer, something would have gone very wrong. So this is a good point. As for whether this is dangerous, I think the discussion above is the place to focus.

[-]David Scott Krueger7y20

A Bayesian predictor of the human's behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human's observed behavior won't include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg.

This is a very good argument, and I'm still trying to decide how decisive I think it is.

In the meanwhile, I'll mention that I'm imagining the learner as something closer to a DNN than a Bayesian predictor. One image how how DNN learning often proceeds is as a series of "aha" moments (generating/revising highly general explanations of the data) interspersed/intermingled with something more like memorization of data-points that don't fit the current general explanations. That view makes it seem plausible that "planning" would emerge as an "aha" moment before being refined as "oh wait, bounded planning... with these heuristics... and these restrictions...", creating a dangerous window of time between "I'm doing planning" and "I'm planning like a human, warts and all".

[-]Wei Dai7y70

Eliezer talked about human imitations quite a bit in Challenges to Christiano’s capability amplification proposal, specifically the safety implications of deviations from perfect imitation.

I've discussed (some difficulties of) imitating humans as a way to accelerate philosophical progress.

I also share shminux's concern about humans (and hence human imitations) not being safe in extreme/unusual circumstances, and have discussed it under "human safety problems".

ETA: However I think it's definitely worth investigating further.

8

Just Imitate Humans?

8