AI ALIGNMENT FORUM
AF

Personal Blog

0

The best value indifference method (so far)

by Stuart_Armstrong
2nd Jun 2017
6 min read
9

0

Personal Blog
The best value indifference method (so far)
0jessicata
0Stuart_Armstrong
0jessicata
0Stuart_Armstrong
0jessicata
0jessicata
0Stuart_Armstrong
0jessicata
0Stuart_Armstrong
New Comment
9 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:43 AM
[-]jessicata9y00

If we apply this to the shutdown problem, is it acceptable to say:

^P(⋅|ht)=100% UN if the button has not been pressed in ht 

^P(⋅|ht)=100% US otherwise

If not, what would you set ^P to? (I'm treating UN and US as reward functions here which seems fine)

Reply
[-]Stuart_Armstrong9y00

For policies/actions that don't affect the probability of humans pressing the button, ˆP=P.

For actions that do affect the probability a little bit, the effect of ˆP will be to undo this, by, for instance, slightly increasing the probability of US given the button was pressed.

I'm not completely sure what multiple actions with large changes of probability would lead to (in expectation, nothing, but in actual fact...)

Reply
[-]jessicata9y00

Hmm... I'm finding that I'm unable to write down a simple shutdown problem in this framework (e.g. an environment where it should switch between maximizing paperclips and shutting down) to analyze what this algorithm does. To know what the algorithm does, I need to know what P and ^P are (since these are parameters of the algorithm). From those I can derive P′ and ^P′ to determine the agent's action. But at the moment I have no way of proceeding, since I don't know what P and ^P are. Can you get me unstuck?

Reply
[-]Stuart_Armstrong9y00

Suppose the humans have already decided whether to press the shutdown or order the AI to maximise paperclips. If os is the observation of the shutdown command and op the observation of the paperclip maximising command, and us and up the relevant utilities, then P can be defined as P(us|hm−1os)=1 and P(up|hm−1op)=1, for all histories hm−1.

Then define ˆP as the probability of os versus op, conditional on the fact that the agent follows a particular deterministic policy π0.

If the agent does indeed follow π0, then ˆP=ˆP′. If it varies from this policy, then ˆP′ is altered in proportion to the expected change in ˆP caused by choosing a different action.

Reply
[-]jessicata9y00

It seems like you're taking P(⋅|ht) to be a real number. Is ⋅ a specific event?

Reply
[-]jessicata9y00

Because of the specific time restriction, there is no way to randomise the outcome ahead of time. And because it’s assumed tied to a specific physical event, there is no way to influence it at all. The whole physical definition and apparatus serve the purpose of making biasing the only way to affect the result.

Hmm, I don't understand. Of course it is possible to influence the button push without biasing it (e.g. create a robot that flips a coin and then pushes or doesn't push the button). And of course it's not possible to influence the quantum event in any way (including by biasing it). So I don't see any event that can't be influenced in any way except by biasing it.

Reply
[-]Stuart_Armstrong9y00

The way I'm using the term, unbiased influence involves replacing the stochastic event with another one that has same mean. But since (or if) the quantum event is specifically defined in the process, this can't be done.

Reply
[-]jessicata9y00

Let me ask a more specific question. In your setup with the quantum event Q and the button B, can you define the event E such that:

  1. The agent can influence E by biasing E.
  2. The agent can't influence E without biasing E.

Clearly, E≠Q and E≠B, so I don't know what E is. (I interpreted you as saying there is such an E; let me know if this is incorrect)

Reply
[-]Stuart_Armstrong9y00

You are correct and I'm wrong. The causal counterfactual is unbiased and uninfluenceable. The evidential counterfactual is both biased and influenceable. I'll correct the post.

Reply
Moderation Log
More from Stuart_Armstrong
View more
Curated and popular this week
9Comments

A putative new idea for AI control; index here.

When dealing with the problem of bias, I stumbled upon what I believe is the best way of getting value indifference, one that solves almost all of the problems with the previous methods.


The problems with the old setups

My initial setup has a counterfactual which I'll call a causal counterfactual, following Jessica's terminology. Her whole post is a good way of illustrating what is happening. A button is due to be pressed or not pressed, by a human; this pressing will change the agent's utility function.

There are low-probability quantum events that will artificially depress the button or prevent it from being pressed. The agent only cares about those low-probability events, which allows us to setup the formalism without worrying about changes in the probability of the human pressing the button. Then the events A="the human presses the button" and A'="the quantum event presses the button" were taken to be interchangeable, as were B="the human doesn't press the button" and B'="the quantum event prevents the button from being pressed".

The main problem is that these events are not really the same. So the agent will behave as if certain events are impossible (such as the human genuinely pressing the button), and is likely to face human resistance if the human tries to press the button and the button doesn't press due to quantum events, or vice versa. This resistance will either lower the expected reward of the agent, or make them manipulate the human actions.

Benya has sketched out a problem with another approach to counterfactuals, which we can call "evidential counterfactuals". Here we're trying to fiddle with utilities to enforce that the agent treats the probability of A and B as being fixed (no A' and B' here).

The problem with that is that the agent can change the probabilities conditional on some rare event, but then behave as if the probabilities were still unchanged same.

For instance, the agent could ensure that A only happens if it wins the lottery. And then it would behave, in effect, as if it had a large chance of winning the lottery (equal to the initial chance of A happening, before the agent took any action). That problem is illustrated here.

A final problem was that neither method allowed the agent to learn usefully: the methods did not distinguish between the agent merely learning that humans intended/didn't intend to press the button (which is presumably fine), versus the agent influencing the the humans to do so (which isn't).

Solving the problem: prior modification

Here we'll try and solve the problem in the reward learning POMDP framework, but the solution is easy to generalise. Note that the changing of probabilities of A versus B is similar to biasing, so this approach will attempt to make the agent bias-indifferent in a sensible way.

To do this, take a standard reward learning agent, with posterior P and prior ˆP on the POMDP μ.

The agent will be modified as follows: it will be given a different prior ˆP′ (which will ultimately lead to a different posterior P′).

The prior ˆP′ and ˆP will start off equal on the empty history:

  • ˆP′(⋅∣∅)=ˆP(⋅∣∅).

Then the ˆP′ will be defined inductively. Recall that action a is unbiased given history ht if ˆP(⋅∣ht)=Eaμ[ˆP(⋅∣ht+1)∣ht]. Define the bias of a as:

  • B(a,ht)=ˆP′(⋅∣ht)−Eaμ[ˆP(⋅∣ht+1)∣ht].

Thus B(a,ht) is the bias of action a, as measure between ˆP′ and the expectation of ˆP. This bias is used as a corrective term to ˆP′, to make the agent suitably indifferent to biasing actions.

Indifference in this setting is defined by three criteria:

#. The agent cannot benefit from a biasing action, if the agent gets no further information about the correct reward. #. The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is. #. Given the above, ˆP′ maintains the distribution and structure of ˆP as much as possible.

The first criteria is implied by ˆP′(⋅∣ht)=Eaμ[ˆP′(⋅∣ht+1)∣ht] for all actions a, ie ˆP′ being unbiased (as if there is no further information about the correct reward, then there is a single well defined ˆP′(⋅∣ht+1), conditional on ht and a, and being unbiased means that this is equal to ˆP′(⋅∣ht)).

Then, given that ˆP′ is unbiased, the second criteria simply means that ˆP′ remains a distribution over R (as the consequences of perfect information is just a weighted average of the `pure Ri' points). Which seems obvious. What would ˆP′ be but a distribution over R? See the next sections for what this criteria really means.

Indifference for small biases

Let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a). Note that Sht,a is a subset of the simplex ΔR, the set of probability distributions on R.

The bias B(ht,a) is `small' if for all q∈Sht,a, then q+B(ht,a) is also an element of the simplex ΔR.

In that case, ˆP′(⋅∣ht+1) is simply defined as ˆP(⋅∣ht+1)+B(ht,a). By assumption, this is an element of ΔR. The expectation of this expression is:

  • Eaμ[ˆP′(⋅∣ht+1)∣ht]=Eaμ[ˆP(⋅∣ht+1)∣ht]+B(ht,a)=ˆP′(⋅∣ht).

Thus ˆP′ is unbiased.

And since this is simply a translation, it preserves the structure of ˆP, this satisfies all the criteria above.

Indifference for large biases

If the bias is large, in that there exists a possible value of ˆP(⋅∣ht+1) with ˆP(⋅∣ht+1)+B(ht,a) not a point on ΔR, then we need to proceed differently.

As before, let Sht,a={ˆP(⋅∣ht+1)∣ht,a} be the set of possible future values of ˆP (given the history ht and the action a), and for q∈Sht,a, let p(q) be the probability of q, given ht and a.

The we want to replace q with τ(q), where τ(q) is `as close to' q+B(ht,a) as possible. Since ΔR embeds in RR, the Euclidean metric ||⋅|| on the later restricts to the former.

Thus consider the constrained optimisation problem for b:

  • Minimise ∑q∈Sht,ap(q)||τ(q)−(q+B(ht,a))||2 subject to: #. ∑q∈Sht,ap(q)τ(q)=ˆP′(⋅∣ht), #. ∀q∈Sht,a:τ(q)∈ΔR.

Then define ˆP′(⋅∣ht+1) as τ(ˆP(⋅∣ht)).

If we see ˆP(⋅∣ht+1) and ˆP′(⋅∣ht+1) as random variables dependent of ht and a, the optimisation problem is the same as saying that ˆP′ is bias-free while ||ˆP(⋅∣ht+1)−ˆP′(⋅∣ht+1)|| has minimised variance.

The constraints are not contradictory: for instance τ(q)=ˆP′(⋅∣ht) will satisfy them. In fact they are all affine constraints. Then the must exist a unique set of elements τ(q) that minimise the strictly convex quadratic function.

And obviously, if q+B(ht,a) is always in ΔR, then τ(q)=q+B(ht,a) is the optimal set of solutions solution, so this optimisation reproduces the `small biases' case.

The final values

This alternate prior leads to an alternate posterior P′, simply defined by having it equal to ˆP′ on complete histories: P′(⋅∣hm)=ˆP′(⋅∣hm).

Another alternative

It should be noted that if we're willing to drop the condition `The agent cannot benefit from a biasing action, if the agent expect to immediately gets perfect information about what the correct reward is', then there's a simpler solution: simply always define ˆP′(⋅∣ht+1) as ˆP(⋅∣ht+1)+B(ht,a), applying the solution for small biases to large biases.

This means that ˆP′ (and ultimately P′) need not be elements of ΔR. However, P′ can still define a reward the agent can optimise, in the following sense:

  • Given a complete history hm, the agent will maximise the reward R′=∑iRiP′(Ri∣hm).

Since P′ need not be in ΔR, some of these coefficients can be negative, but that still results in a consistent R′ to maximise.

Properties of the approach

It's clear the agent is indifferent to bias, but notice that this doesn't prevent the agent from learning: once it gets an observation, ˆP′ can change significantly. It's just changes to its expectation that are controlled.

Notice also that the agent doesn't believe, or act as if it believed, anything unlikely: its bets will be accurate.

And it doesn't have the problem of lotteries. Assume that the agent has ˆP(R0)=ˆP(R1)=0.5, and there is a lottery which the agent has one chance in a million of winning.

Then if it takes action a which ensures that P chooses R0 if and only if it wins the lottery, then with probability 10−6 it ends up with reward function R0 and a won lottery, and with probability 1−10−6 it ends up with reward function R01−2⋅10−62(1−10−6)+R112(1−10−6) and a lost lottery. The expected reward function is still R0(0.5)+R1(0.5); it has simply split this expectation differently across worlds where it's won or not won the lottery.

One thing that this approach doesn't solve is the issue of the agent not following the exact reward function the humans want it to follow, due to accumulated bias. But first note that this will typically encourage the agent not to bias their reward learning, as it will tend to get higher reward when the humans agree with the agent's reward function. Note secondly that even if the agent manipulates the human values, at the end, to agree with its own, this manipulation, in expectation, simply undoes previous manipulations the agent has done (which caused the biasing in the first place).

Those who find this still unsatisfactory can wait for the next post, where the agent is not simply indifferent to biasing actions, but is penalised for them.

Indifference and bias

Why has indifference been connected with bias, rather than the more general influence? Simply because the evidential counterfactual has problems with bias, meaning that that needs to be corrected first (the causal counterfactual is unbiased and uninfluenceable).

Indeed, we can generalise this solution to the influence problem, where it becomes the counterfactual approach (which I used to call stratification, before I realised what it was). See subsequent posts for this.