Why unriggable *almost* implies uninfluenceable

Stuart_Armstrong

When talking about learning processes, I've said things like "every uninfluenceable learning process is unriggable" and "every unriggable learning process is uninfluenceable, if the set of environments is large enough".

In this post, I want to clarify the relationship between the two, in ways that are hopefully understandable. I won't be using much math, but several hopefully intuitive examples.

The main points I hope to get across are:

Uninfluenceable and unriggable are of different types. Unriggable is algebra about your learning process; uninfluenceable is about learning independent facts about the environment.
Nevertheless, these are almost equivalent.
Unriggable process that are not uninfluenceable are specific and unstable, and generally collapse if the agent can gain extra information.

The third point is new, so will be of interest even to those who already understand these concepts.

Informal definitions

Here, it suffices to think of uninfluenceable as meaning "learning about values/preferences is the same as learning objective facts about the universe" and unriggable as "you can't push the expected probability of a value/preference in a particular direction".

Unriggable-but-influenceable

Parents guide you

I've used an example involving a robot agent asking its parents, so let's continue with that. There is a robot who doesn't know whether to become a banker or a doctor. To figure this out, it can ask either of its parents:

Its possible actions are (ask mother) and $F$ (ask father). When it does so, the relevant parent will "banker" ( $B$ ) or "doctor" ( $D$ ), which will fully resolve the issue.

So, given the possible histories, this is what the robot will learn, with actions in red and observations in blue:

$[M or F] B \to become a banker [M or F] D \to become a doctor$

Let's $μ_{B B}$ be the world in which both parents will say "banker", $μ_{B D}$ the world in which the mother says "banker" and the father says "doctor", $μ_{D B}$ the world in which they will say the opposite, and $μ_{D D}$ the world in which both will say "doctor".

Then the learning process is unriggable if the robot cannot push towards "doctor" or "banker" via its actions. This means that the probabilities of $D / B$ are the same whichever parent the robot asked. This happens when the robots (subjective) probabilities of $μ_{D B}$ and $μ_{B D}$ are the same^[1].

Unriggable-but-influenceable is unstable

If the learning process is unriggable, but not uninfluenceable, then the unriggability is unstable. First of all because of the requirement that $P (μ_{D B}) = P (μ_{B D})$ ; so the robot has to believe that the its father and mother are equally likely to say either banker or doctor. This is highly specific and unusual situation to be in.

But what happens if the robot receives an update about either parent's preferences, before asking them? Maybe it found one of their diaries:

Then it's likely that the probabilities of $μ_{D B}$ and $μ_{B D}$ will drift out of equality.

If the learning process is truly unriggable, then the robot cannot receive any such update. Or, by pure coincidence, the update must keep the probabilities equal. If the process was uninfluenceable, then there would be a reason for this - the probabilities and the updates are caused by some underlying information (see the section about uninfluenceable learning).

But, generically, if there is an unriggable-but-not-uninfluenceable learning process, then allowing the robot to learn new information will mess it up, and it will no longer be unriggable.

Uninfluenceable

One of the results of the paper is that any unriggable learning process is equivalent to an uninfluenceable one, though possibly with different environments.

So if the robot's process is unriggable, in the sense that $μ_{D B}$ and $μ_{B D}$ are of equal probability, what is the equivalent uninfluenceable one?

Here's one way it could happen. There is a ghost, that has opinions on what career the robot should take:

After the robot has decided which parent to ask, the ghost will possesses that parent, and impose their own judgement as to what job the robot should take:

In this case, there are only two environments, $μ_{B}$ and $μ_{D}$ , depending on whether the ghost wants the robot to be a banker or a doctor. This learning process is uninfluenceable, ie depends only on the underlying environment: $μ_{B}$ means the robot should be a banker, $μ_{D}$ that it should be a doctor.

The equivalence

Why are the ghost scenario and the standard parent scenario equivalent? Well, there are only four histories that the robot can observe in either scenarios: two actions ( $M$ or $F$ ) and two observations ( $B$ or $D$ ).

Then in the standard parents scenario, the probability of $M B$ , given $M$ , is the probabilities of $μ_{B B}$ and $μ B D$ (the two worlds where the mother says "banker"). Similarly, the probability of $M D$ , given $M$ , is the sum of the probabilities of $μ_{D B}$ and $μ_{D D}$ .

In the ghost scenario, those probabilities are the probabilities of $μ_{B}$ and $μ_{D}$ respectively. So if we set $P (μ_{B}) = P (μ_{B B}) + P (μ_{B D})$ and $P (μ_{D}) = P (μ_{D B}) + P (μ_{D D})$ , then, given that it chooses $M$ , the probabilities of the robot seeing $B$ or $D$ are the same in both scenarios.

Since the standard parent scenario is unriggable, we have $μ_{B D} = μ_{D B}$ . This implies that the probability of the other two histories, given $F$ , are also the same^[2].

Thus these two scenarios have the same probabilities for their histories.

Stability

Even though they are equivalent in terms of probabilities, the uninfluenceable ghost scenario is stable under adding new information. If the robot discovers a diary of the ghost, then this updates its probability which ever parent it asks.

The parents are now giving the same answer, in expectation, not because of a coincidence of that $μ_{B D}$ and $μ_{D B}$ have the same probabilities, but because there is a single underlying cause - the ghost - for both answers.

Of course, the underlying cause might be the parents agreeing on the robot's profession ahead of time, rather than a ghost - but what's the fun in that?

In summary

If a learning process is unriggable and stable under extra information, then it is uninfluenceable.
Conversely, if a learning process is unriggable but influenceable, then it is unstable: extra true information tends to make it no longer unriggable.
If the learning process is unriggable but influenceable, there is an equivalent uninfluenceable one, in that the probability of all actions and observations are the same.
But the equivalent uninfluenceable one will be stable under extra information.

The learning process being influenceable means that the probabilities of $μ_{D B}$ or $μ_{B D}$ are non-zero; if they are both zero, then both parents have agreed on what is best for the child, and asking either one makes no difference. ↩︎
To check: the probability that the robot hears $B$ after $F$ is $P (μ_{B B}) + P (μ_{D B})$ , which is $P (μ_{B B}) + P (μ_{B D})$ by unriggability, which is $P (μ_{B})$ . Similarly, the probability that the robot hears $D$ after $F$ is $P (μ_{B D}) + P (μ_{D D})$ , which is $P (μ_{D})$ . ↩︎

AI ALIGNMENT FORUM
AF