x

AI ALIGNMENT FORUM
AF

Wireheading is in the eye of the beholder — AI Alignment Forum

Frontpage

10

Wireheading is in the eye of the beholder

by Stuart_Armstrong

30th Jan 2019

1 min read

10

Frontpage

Wireheading is in the eye of the beholder

2Stuart_Armstrong

New Comment

2 comments, sorted by

Click to highlight new comments since: Today at 2:37 AM

[-]tom4everitt6y10

Is this analogous to the stance-dependency of agents and intelligence?

[-]Stuart_Armstrong6y20

It is analogous, to some extent; I do look into some aspect of Daniel Dennett's classification here: https://www.youtube.com/watch?v=1M9CvESSeVc

I also had a more focused attempt at defining AI wireheading here: https://www.lesswrong.com/posts/vXzM5L6njDZSf4Ftk/defining-ai-wireheading

I think you've already seen that?

More from Stuart_Armstrong

Curated and popular this week

10

tl;dr: there is no natural category called "wireheading", only wireheading relative to some desired ideal goal.

Suppose that we have a built an AI, and have invited a human H to help test it. The human H is supposed to press a button B if the AI seems to be behaving well. The AI's reward is entirely determined by whether H presses B or not.

So the AI manipulates or tricks H into pressing B. A clear case of the AI wireheading itself.

Or is it? Suppose H was a meddlesome government inspector that we wanted to keep away from our research. Then we want H to press B, so we can get them our of our hair. In this case, the AI is behaving entirely in accordance with our preferences. There is no wireheading involved.

Same software, doing the same behaviour, and yet the first is wireheading and the second isn't. What gives?

Well, initially, it seemed that pressing the button was a proxy goal for our true goal, so manipulating H to press it was wireheading, since that wasn't what we intended. But in the second case, the proxy goal is the true goal, so maximising that proxy is not wireheading, it's efficiency. So it seems that the definition of wireheading is only relative to what we actually want to accomplish.

In other domains

I similarly have the feeling that wireheading-style failures in value-learning, low impact, and corrigibility, also depend on a specification of our values and preferences - or at least a partial specification. The more I dig into these areas, the more I'm convinced they require partial value specification in order to work - they are not fully value-agnostic.

Mentioned in

47AI Alignment 2018-19 Review

12Defining AI wireheading