This idea is so simple that I'm sure it's been had by someone somewhere.

Suppose we have some method to make really smart honest AIs that do not have goals. Let's say it's a yes/no oracle. Our aimless ace.

But we want to accomplish stuff! AIcorp wants the printmoneynow.py. I'm worried that if I train the ace to fold fitted sheets and unkink the garden hose then it will go overboard and fold my femur bone and unkink the Yellowstone super volcano.

So let's use a bit-less-smart model to do tasks. The active amateur.

We'll ask the ace if the amateur is plotting something. If yes, we shut down the amateur. Every day we also ask "is today the last day I can shut down the amateur?"

(We do not generate code with the ace!)

So we get super-smart supervision of a decent task AI. The temptation (ie competitive pressure) to agentify the oracle is there still, but no longer overwhelming.

Maybe we can even have the amateur write ace_v2, have ace_v1 check ace_v2, switch to ace_v2, have amateur_v1 write amateur_v2, check it with ace_v2, then switch to amateur_v2.


If this looks like it's avoiding the hard parts of the alignment problem, that's because it is! But does something actually go wrong? Was it too big an assumption that we could build an aimless honest ace?

New Comment