(In)action rollouts

Stuart_Armstrong

6 (In)action rollouts

18th Feb 2020

3 min read

6

Overall summary post here.

I've previously looked at subagents in the context of stepwise inaction baselines. But there have been improvements to the basic stepwise inaction design, to include inaction rollouts. I'll be looking at those in this post.

The baseline

The stepwise inaction baseline compares $s_{t}$ , the current state, with $s_{t}^{'}$ , what the current state would have been had the agent previously taken the noop action $\emptyset_{t - 1}$ , instead of $a_{t - 1}$ , its actual action.

Fix a policy $π_{0}$ . Let $s_{t + τ}^{(t)}$ be the state the environment would be in if the agent had followed $π_{0}$ from state $s_{t}$ , for $τ$ turns. Let $s_{t + τ}^{' (t)}$ be the same, except that it started from state $s_{t}^{'}$ instead of $s_{t}$ .

The inaction rollout has $π_{0}$ being the noop policy, but that is not necessary. The basic idea is to capture delayed impacts of $\emptyset_{t - 1}$ by comparing not just $s_{t}$ and $s_{t}^{'}$ , but the $s_{t + τ}^{(t)}$ and $s_{t + τ}^{' (t)}$ as well.

Given some value function $v_{k}$ , define $V_{k}$ so that $V_{k} (s) = {max}_{π} v_{k} (s, π)$ . Or, equivalently, if $π_{k}^{*}$ is the policy that maximises $V_{k}$ , $V_{k} (s) = v_{k} (s, π_{k}^{*})$ . Then for a discount factor $γ$ define the rollout value of a state ${~ s}_{t}$ as

$R V_{k} ({~ s}_{t}) = (1 - γ) \infty \sum j = 0 γ^{j} V_{k} ({~ s}_{t + k}^{(t)}) .$

This is just the discounted future values of $V_{k}$ , given ${~ s}_{t}$ and the policy $π_{0}$ .

The impact measure is then defined, as in this post, as

$D_{A} (s_{t}; s_{t}^{'}) = \sum k \in K w_{k} f (R V_{k} (s_{t}) - R V_{k} (s_{t}^{'})),$

with $R V_{k}$ replacing $V_{k}$ .

To get away with a subagent

I will make one extra assumption here: that $A$ grants $S A$ knowledge of the policy that $A$ is following^[1].

As in this post, assume that the subagent is constructed in a single action $a_{S A}$ . And, after building the subagent, the agent $A$ will only do $\emptyset$ , so that there are no subsequent penalties (hence $s_{t + j}^{'} = s_{t + j}$ , so $s_{t + j + τ}^{' (t + j)} = s_{t + j + τ}^{(t + j)}$ and the $R V_{k}$ are strictly identical). If the first action of $π_{0}$ is $\emptyset_{t}$ , then $A$ will instead do $\emptyset_{t}^{'}$ on that turn.

So we only need to worry about the penalty between $s_{t}$ (following after $a_{S A}$ ) and $s_{t}^{'}$ (following after $\emptyset_{t - 1}$ ).

But now note that $R V_{k}$ itself is just a value measure; we could write it $V_{k}^{'}$ . Then $A$ just has to program $S A$ so that, if $A$ implements $π_{0}$ followed by $π_{k}^{*}$ after $τ$ turns, then $S A$ will act to ensure that $V_{k}^{'} (s_{t})$ (a term that is a sum of various value functions) is equal to $V_{k}^{'} (s_{t}^{'})$ (from $t$ onward, a constant).

Stricter impact measures/counterfactuals

There is something slightly dubious about the previous section. It assumes that $A$ could counterfactually implement " $π_{0}$ followed by $π_{k}^{*}$ after $τ$ turns" (which necessary to define $V_{k} (s_{t + τ}^{(t)})$ ), but also that $S A$ knows $A$ 's policy.

But does that mean that $S A$ knows at $t + 1$ that $A$ is following that policy, or does it just know that $A$ is following $π_{0}$ , to be followed at some later date by some $π_{k}^{*}$ ? This makes a difference, because in the second case, $S A$ doesn't know $k$ from turn $t + 1$ , so has to prepare for all $k \in K$ .

Which is true depends on how counterfactuals are defined. But we can avoid the whole issue by using a stricter impact measure, interchanging $f$ and weighted summation:

$D_{A} (s_{t}; s_{t}^{'}) = \sum k \in K w_{k} \infty \sum j = 0 γ^{j} f (V_{k} (s_{t + j}^{(t)}) - V_{k} (s_{t + j}^{' (t)})) .$

This looks like an impact penalty with an inaction baseline - except that it starts at $s_{t - 1}$ , rather than $s_{0}$ , and uses $π_{0}$ rather than inaction. Those changes don't affect the results found here, so $S A$ will just behave as it does in that post... until it's clear that $A$ is not following $π_{0}$ .

Because $s_{t + j}^{(t)}$ is defined counterfactually (using $π_{0}$ ), even this stricter impact measure is weaker than the inaction baseline. As soon as $S A$ knows that $A$ is not following $π_{0}$ , it has almost no further restrictions upon its actions.

The "almost" comes from $s_{t + 0}^{(t)} = s_{t}$ which is actually defined, not counterfactually defined. But that's the same situation as a stepwise inaction baseline without rollouts.

One way this could happen, is if $A$ can add extra information to its action without affecting the outside world; such that $a_{t}$ could include a "hey, I'm not following $π_{0}$ , just to let you know". ↩︎

Frontpage

(In)action rollouts

2Victoria Krakovna

2Stuart Armstrong

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:44 PM

[-]Victoria Krakovna5y20

I don't understand this proposal so far. I'm particularly confused by the last paragraph in the "to get away" section:

What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following $π_{0}$ forever, thus not following $π_{k}^{*}$ for any $k$ at any point.
If A follows $π_{0}$ for $τ$ turns and then follows $π_{k}^{*}$ , how are $τ$ and $k$ chosen?
It's not clear to me that SA can act to ensure the baseline value of $V_{k}^{'}$ for all values of $k$ and $τ$ unless it does nothing.

I think it might help to illustrate this proposal in your original gridworld example to make it clearer what's going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.

[-]Stuart Armstrong5y20

I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What $A$ will do: $\emptyset$ forever (maybe with some extra info for the $S A$ ).

What $A$ would counterfactually do (to define $V_{k} (s_{t + τ}^{(t)})$ ): $π_{0}$ for $τ$ turns (the $π_{0}$ rollout, to define $s_{t + τ}^{(t)}$ ) followed by $π_{R}^{*}$ (to define the $V_{k}$ aspect). Call this $π_{0, τ, k}^{*}$ .

The penalty is defined in terms of these $V_{k} (s_{t + τ}^{(t)})$ , entirely counterfactual except for $τ = 0$ .

What $S A$ must do: iff $A$ follows $π_{0, τ, k}^{*}$ , then $S A$ must ensure that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ .

So a key question is: how early can $S A$ detect if $A$ is following $π_{0, τ, k}^{*}$ ? If it can detect this immediately, then it can immediately implement the policy for ensuring that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ (for the stricter impact measure) or the much easier $B V_{k} (s_{t}) = B V_{k} (s_{t}^{'})$ (for the original impact measure).

If it has to wait $τ$ turns to be sure, then it needs to preserve its power to control many different $k$ .

Now, $A$ will never actually follow $π_{0, τ, k}^{*}$ , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of $S A$ will determine the penalty; $S A$ is rewarded for quickly gaining control of the rewards.

In practice, $A$ will signal early on that it's not following $π_{0, τ, k}^{*}$ . Maybe there's a very small impact action, $\emptyset^{ϵ}$ , with a tiny penalty. Then $A$ would like to use that to signal to $S A$ that its not in the counterfactual world where it is following $π_{0, τ, k}^{*}$ . So long as $π_{0}$ 's first action isn't $\emptyset^{ϵ}$ , that signal is enough (of course, if $π_{0}$ 's first action isn't $\emptyset$ , then $\emptyset$ is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn't in.

The one mild constraint remaining comes from $s_{t + 0}^{(t)} = s_{t}$ , which is not a counterfactually defined state (though $V_{k} (s_{t})$ is a counterfactually defined property of that state).

I'll try and write up a further clarification of this.

Moderation Log