Personal Blog

0

A putative new idea for AI control; index here.

I've got a partial design for motivating an AI to improve human understanding.

However, the AI is rewarded for generic human understanding of many variables, most of the quite pointless from our perspective. Can we motivate the AI to ensure our understanding on variables we find important? The presence of free humans, say, rather than air pressure in Antarctica?


It seems we might be able to do this. First, because we've defined human understanding, we can define accuracy-increasing statements: statements that, if given to a human, will improve their understanding.

So I'd say that a human understands the important facts about a real or hypothetical world, if there are no short accuracy-increasing statements that would cause them to radically change their assessment about the quality of a world (AI: "oops, forgot to mention everyone in under the influence of brain slugs - will that change your view of this world?").

Of course, it possible to present true informative facts in the worse possible light, so it might be better to define this as "if there are short accuracy-increasing statements that would cause them to radically change their assessment about the quality of a world, there are further accuracy-increasing statements that will undo that change". (AI: "actually, brain slugs are a popular kind of hat where people store their backup memories").

Personal Blog
New Comment