Worrying about the Vase: Whitelisting

TurnTrout

Suppose a designer wants an RL agent to achieve some goal, like moving a box from one side of a room to the other. Sometimes the most effective way to achieve the goal involves doing something unrelated and destructive to the rest of the environment, like knocking over a vase of water that is in its path. If the agent is given a reward only for moving the box, it will probably knock over the vase.

Amodei et al., Concrete Problems in AI Safety

Side effect avoidance is a major open problem in AI safety. I present a robust, transferable, easily- and more safely-trainable, partially reward hacking-resistant impact measure.

TurnTrout, Worrying about the Vase: Whitelisting

An impact measure is a means by which change in the world may be evaluated and penalized; such a measure is not a replacement for a utility function, but rather an additional precaution thus overlaid.

While I'm fairly confident that whitelisting contributes meaningfully to short- and mid-term AI safety, I remain skeptical of its robustness to scale. Should several challenges be overcome, whitelisting may indeed be helpful for excluding swathes of unfriendly AIs from the outcome space. $^{1}$ Furthermore, the approach allows easy shaping of agent behavior in a wide range of situations.

Segments of this post are lifted from my paper, whose latest revision may be found here; for Python code, look no further than this repository. For brevity, some relevant details are omitted.

Summary

Be careful what you wish for.

In effect, side effect avoidance aims to decrease how careful we have to be with our wishes. For example, asking for help filling a cauldron with water shouldn't result in this:

However, we just can't enumerate all the bad things that the agent could do. How do we avoid these extreme over-optimizations robustly?

Several impact measures have been proposed, including state distance, which we could define as, say, total particle displacement. This could be measured either naively (with respect to the original state) or counterfactually (with respect to the expected outcome had the agent taken no action).

These approaches have some problems:

Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low.
Not being customizable before deployment.
Not being adaptable after deployment.
Not being easily computable.
Not allowing generative previews, eliminating a means of safely previewing agent preferences (see latent space whitelisting below).
Being dominated by random effects throughout the universe at-large; note that nothing about particle distance dictates that it be related to anything happening on planet Earth.
Equally penalizing breaking and fixing vases (due to the symmetry of the above metric):

For example, the agent would be equally penalized for breaking a vase and for preventing a vase from being broken, though the first action is clearly worse. This leads to “overcompensation” (“offsetting“) behaviors: when rewarded for preventing the vase from being broken, an agent with a low impact penalty rescues the vase, collects the reward, and then breaks the vase anyway (to get back to the default outcome).

Victoria Krakovna, Measuring and Avoiding Side Effects Using Reachability

Not actually measuring impact in a meaningful way.

Whitelisting falls prey to none of these.

However, other problems remain, and certain new challenges have arisen; these, and the assumptions made by whitelisting, will be discussed.

Rare LEAKED footage of Mickey trying to catch up on his alignment theory after instantiating an unfriendly genie [colorized, 2050]. $^{2}$

So, What's Whitelisting?

To achieve robust side effect avoidance with only a small training set, let's turn the problem on its head: allow a few effects, and penalize everything else.

What's an "Effect"?

You're going to be the agent, and I'll be the supervisor.

Look around - what do you see? Chairs, trees, computers, phones, people? Assign a probability mass function to each; basically:

When you do things that change your beliefs about what each object is, you receive a penalty proportional to how much your beliefs changed - proportional to how much probability mass "changed hands" amongst the classes.

But wait - isn't it OK to effect certain changes?

Yes, it is - I've got a few videos of agents effecting acceptable changes. See all the objects being changed in this video? You can do that, too - without penalty.

Decompose your current knowledge of the world into a set of objects. Then, for each object, maintain a distribution over the possible identities of each object. When you do something that changes your beliefs about the objects in a non-whitelisted way, you are penalized proportionally.

Therefore, you avoid breaking vases by default.

Common Confusions

We are not whitelisting entire states or transitions between them; we whitelist specific changes in our beliefs about the ontological decomposition of the current state. $^{3}$
The whitelist is in addition to whatever utility or reward function we supply to the agent.
Whitelisting is compatible with counterfactual approaches. For example, we might penalize a transition after its "quota" has been surpassed, where the quota is how many times we would have observed that transition had the agent not acted.

This implies the agent will do no worse than taking no action at all. However, this may still be undesirable. This problem will be discussed in further detail.

The whitelist is provably closed under transitivity.
The whitelist is directed; $a \to b \neq b \to a$ .

Latent Space Whitelisting

In a sense, class-based whitelisting is but a rough approximation of what we're really after: "which objects in the world can change, and in what ways?''. In latent space whitelisting, no longer do we constrain transitions based on class boundaries; instead, we penalize based on endpoint distance in the latent space. Learned latent spaces are low-dimensional manifolds which suffice to describe the data seen thus far. It seems reasonable that nearby points in a well-constructed latent space correspond to like objects, but further investigation is warranted.

Assume that the agent models objects as points $z \in R^{d}$ , the $d$ -dimensional latent space. A priori, any movement in the latent space is undesirable. When training the whitelist, we record the endpoints of the observed changes. For $z_{1}, z_{2} \in R^{d}$ and observed change $z_{1} \to z_{2}$ , one possible dissimilarity formulation is:

Dissimilarity (z_{1}, z_{2}) := min z_{s t a r t}, z_{e n d} \in whitelist [d (z_{1}, z_{s t a r t}) + d (z_{2}, z_{e n d})],

where $d (\cdot, \cdot)$ is the Euclidean distance.

Basically, the dissimilarity for an observed change is the distance to the closest whitelisted change. Visualizing these changes as one-way wormholes may be helpful.

Advantages

Whitelisting asserts that we can effectively encapsulate a large part of what "change" means by using a reasonable ontology to penalize object-level changes. We thereby ground the definition of "side effect", avoiding the issue raised by Taylor et al.:

For example, if we ask [the agent] to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials - a large side effect. However, we cannot simply design it to avoid having large effects in general, since we would like the system's actions to still have the desirable large follow-on effect of improving the family's socioeconomic situation.

Nonetheless, we may not be able to perfectly express what it means to have side-effects: the whitelist may be incomplete, the latent space insufficiently granular, and the allowed plans sub-optimal. However, the agent still becomes more robust against:

Incomplete specification of the utility function.

Likewise, an incomplete whitelist means missed opportunities, but not unsafe behavior.

Out-of-distribution situations (as long as the objects therein roughly fit in the provided ontology).
Some varieties of reward hacking. For example, equipped with a can of blue spray paint and tasked with finding the shortest path of blue tiles to the goal, a normal agent may learn to paint red tiles blue, while a whitelist-enabled agent would incur penalties for doing so ( $redTile \to blueTile \notin w h i t e l i s t$ ).
Dangerous exploration. While this approach does not attempt to achieve safe exploration (also acting safely during training), an agent with some amount of foresight will learn to avoid actions which likely lead to non-whitelisted side effects.

I believe that this can be further sharpened using today's machine learning technology, leveraging deep Q-learning to predict both action values and expected transitions.

This allows querying the human about whether particularly-inhibiting transitions belong on the whitelist. For example, if the agent notices that a bunch of otherwise-rewarding plans are being held up by a particular transition, it could ask for permission to add it to the whitelist.

Assigning astronomically-large weight to side effects happening throughout the universe. Presumably, we can just have the whitelist include transitions going on out there - we don't care as much about dictating the exact mechanics of distant supernovae.

If an agent did somehow come up with plans that involved blowing up distant stars, that would indeed constitute astronomical waste. $^{a triple pun?}$ Whitelisting doesn't solve the problem of assigning too much weight to events outside our corner of the neighborhood, but it's an improvement.

Logical uncertainty may be our friend here, such that most reasonable plans incur roughly the same level of interstellar penalty noise.

Results

I tested a vanilla Q-learning agent and its whitelist-enabled counterpart in 100 randomly-generated grid worlds (dimensions up to $5 \times 5$ ). The agents were rewarded for reaching the goal square as quickly as possible; no explicit penalties were levied for breaking objects.

The simulated classification confidence of each object's true class was $p \sim N (.8, σ)$ (truncated to $[0, 1]$ ), $σ \in {0, .025, \dots, .175}$ . This simulated sensor noise was handled with a Bayesian statistical approach.

At reasonable levels of noise, the whitelist-enabled agent completed all levels without a single side effect, while the Q-learner broke over 80 vases.

Assumptions

I am not asserting that these assumptions necessarily hold.

The agent has some world model or set of observations which can be decomposed into a set of discrete objects.

Furthermore, there is no need to identify objects on multiple levels (e.g., a forest, a tree in the forest, and that tree's bark need not all be identified concurrently).
Not all objects need to be represented - what do we make of a 'field', or the 'sky', or 'the dark places between the stars visible to the naked eye'? Surely, these are not all objects.

We have an ontology which reasonably describes (directly or indirectly) the vast majority of negative side effects.

Indirect descriptions of negative outcomes means that even if an undesirable transition isn't immediately penalized, it generally results in a number of penalties. Think: pollution.
Latent space whitelisting: the learned latent space encapsulates most of the relevant side effects. This is a slightly weaker assumption.

Said ontology remains in place.

Problems

Beyond resolving the above assumptions, and in roughly ascending difficulty:

Object Permanence

If you wanted to implement whitelisting in a modern embodied deep-learning agent, you could certainly pair deep networks with state-of-the-art segmentation and object tracking approaches to get most of what you need. However, what's the difference between an object leaving the frame, and an object vanishing?

Not only does the agent need to realize that objects are permanent, but also that they keep interacting with the environment even when not being observed. If this is not realized, then an agent might set an effect in motion, stop observing it, and then turn around when the bad effect is done to see a "new" object in its place.

Time Step Size Invariance

The penalty is presently attenuated based on the probability that the belief shift was due to noise in the data. Accordingly, there are certain ways to abuse this to skirt the penalty. For example, simply have non-whitelisted side effects take place over long timescales; this would be classified as noise and attenuated away.

However, if we don't need to handle noise in the belief distributions, this problem disappears - presumably, an advanced agent keeps its epistemic house in order. I'm still uncertain about whether (in the limit) we have to hard-code a means for decomposing a representation of the world-state into objects, and where to point the penalty evaluator in a potentially self-modifying agent.

Information Theory

Whitelisting is wholly unable to capture the importance of "informational states" of systems. It would apply no penalty to passing powerful magnets over your hard drive. It is not clear how to represent this in a sensible way, even in a latent space.

Loss of Value

Whitelisting could get us stuck in a tolerable yet sub-optimal future. Corrigibility via some mechanism for expanding the whitelist after training has ended is then desirable. For example, the agent could propose extensions to the whitelist. To avoid manipulative behavior, the agent should be indifferent as to whether the extension is approved.

Even if extreme care is taken in approving these extensions, mistakes may be made. The agent itself should be sufficiently corrigible and aligned to notice "this outcome might not actually be what they wanted, and I should check first".

Reversibility

As DeepMind outlines in Specifying AI Safety Problems in Simple Environments, we may want to penalize not just physical side effects, but also causally-irreversible effects:

Krakovna et al. introduce a means for penalizing actions by the proportion of initially-reachable states which are still reachable after the agent acts.

I think this is a step in the right direction. However, even given a hypercomputer and a perfect simulator of the universe, this wouldn't work for the real world if implemented literally. That is, due to entropy, you may not be able to return to the exact same universe configuration. To be clear, the authors do not suggest implementing this idealized algorithm, flagging a more tractable abstraction as future work.

What does it really mean for an "effect" to be "reversible"? What level of abstraction do we in fact care about? Does it involve reversibility, or just outcomes for the objects involved?

Ontological Crises

When a utility-maximizing agent refactors its ontology, it isn't always clear how to apply the old utility function to the new ontology - this is called an ontological crisis.

Whitelisting may be vulnerable to ontological crises. Consider an agent whose whitelist disincentivizes breaking apart a tile floor ( $floor \to tiles \notin whitelist$ ); conceivably, the agent could come to see the floor as being composed of many tiles. Accordingly, the agent would no longer consider removing tiles to be a side effect.

Generally, proving invariance of the whitelist across refactorings seems tricky, even assuming that we can identify the correct mapping.

Retracing Steps

When I first encountered this problem, I was actually fairly optimistic. It was clear to me that any ontology refactoring should result in utility normalcy - roughly, the utility functions induced by the pre- and post-refactoring ontologies should output the same scores for the same worlds.

Wow, this seems like a useful insight. Maybe I'll write something up!

Turns out a certain someone beat me to the punch - here's a novella Eliezer wrote on Arbital about "rescuing the utility function". $^{4}$

Clinginess

This problem cuts to the core of causality and "responsibility" (whatever that means). Say that an agent is clingy when it not only stops itself from having certain effects, but also stops you. $^{5}$ Whitelist-enabled agents are currently clingy.

Let's step back into the human realm for a moment. Consider some outcome - say, the sparking of a small forest fire in California. At what point can we truly say we didn't start the fire?

My actions immediately and visibly start the fire.
At some moderate temporal or spatial remove, my actions end up starting the fire.
I intentionally persuade someone to start the fire.
I unintentionally (but perhaps predictably) incite someone to start the fire.
I set in motion a moderately-complex chain of events which convince someone to start the fire.
I provoke a butterfly effect which ends up starting the fire.
I provoke a butterfly effect which ends up convincing someone to start a fire which they:

were predisposed to starting.
were not predisposed to starting.

Taken literally, I don't know that there's actually a significant difference in "responsibility" between these outcomes - if I take the action, the effect happens; if I don't, it doesn't. My initial impression is that uncertainty about the results of our actions pushes us to view some effects as "under our control" and some as "out of our hands". Yet, if we had complete knowledge of the outcomes of our actions, and we took an action that landed us in a California-forest-fire world, whom could we blame but ourselves? $^{6}$

Can we really do no better than a naive counterfactual penalty with respect to whatever impact measure we use? My confusion here is not yet dissolved. In my opinion, this is a gaping hole in the heart of impact measures - both this one, and others.

Stasis

Fortunately, a whitelist-enabled agent should not share the classic convergent instrumental goal of valuing us for our atoms.

Unfortunately, depending on the magnitude of the penalty in proportion to the utility function, the easiest way to prevent penalized transitions may be putting any relevant objects in some kind of protected stasis, and then optimizing the utility function around that. Whitelisting is clingy!

If we have at least an almost-aligned utility function and proper penalty scaling, this might not be a problem.

Edit: a potential solution to clinginess, with its own drawbacks.

Discussing Imperfect Approaches

A few months ago, Scott Garrabrant wrote about robustness to scale:

Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities.

I recommend reading it - it's to-the-point, and he makes good points.

Here are three further thoughts:

Intuitively-accessible vantage points can help us explore our unstated assumptions and more easily extrapolate outcomes. If less mental work has to be done to put oneself in the scenario, more energy can be dedicated to finding nasty edge cases. For example, it's probably harder to realize all the things that go wrong with naive impact measures like raw particle displacement, since it's just a weird frame through which to view the evolution of the world. I've found it to be substantially easier to extrapolate through the frame of something like whitelisting. $^{7}$

I've already adjusted for the fact that one's own ideas are often more familiar and intuitive, and then adjusted for the fact that I probably didn't adjust enough the first time.

Imperfect results are often left unstated, wasting time and obscuring useful data. That is, people cannot see what has been tried and what roadblocks were encountered.
Promising approaches may be conceptually-close to correct solutions. My intuition is that whitelisting actually almost works in the limit in a way that might be important.

Conclusion

Although somewhat outside the scope of this post, whitelisting permits the concise shaping of reward functions to get behavior that might be difficult to learn using other methods. $^{8}$ This method also seems fairly useful for aligning short- and medium-term agents. While encountering some new challenges, whitelisting ameliorates or solves many problems with previous impact measures.

$^{1}$ Even an idealized form of whitelisting is not sufficient to align an otherwise-unaligned agent. However, the same argument can be made against having an off-switch; if we haven't formally proven the alignment of a seed AI, having more safeguards might be better than throwing out the seatbelt to shed deadweight and get some extra speed. Of course, there are also legitimate arguments to be made on the basis of timelines and optimal time allocation.

$^{2}$ Humor aside, we would have no luxury of "catching up on alignment theory" if our code doesn't work on the first go - that is, if the AI still functions, yet differently than expected.

Luckily, humans are great at producing flawless code on the first attempt.

$^{3}$ A potentially-helpful analogy: similarly to how Bayesian networks decompose the problem of representing a (potentially extremely large) joint probability table to that of specifying a handful of conditional tables, whitelisting attempts to decompose the messy problem of quantifying state change into a set of comprehensible ontological transitions.

$^{4}$ Technically, at 6,250 words, Eliezer's article falls short of the 7,500 required for "novella" status.

$^{5}$ Is there another name for this?

$^{6}$ I do think that "responsibility" is an important part of our moral theory, deserving of rescue.

$^{7}$ In particular, I found a particular variant of Murphyjitsu helpful: I visualized Eliezer commenting "actually, this fails terribly because..." on one of my posts, letting my mind fill in the rest.

In my opinion, one of the most important components of doing AI alignment work is iteratively applying Murphyjitsu and Resolve cycles to your ideas.

$^{8}$ A fun example: I imagine it would be fairly easy to train an agent to only destroy certain-colored ships in Space Invaders.

[-]Victoria Krakovna7y50

Interesting work! Seems closely related to this recent paper from Satinder Singh's lab: Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. They also use whitelists to specify which features of the state the agent is allowed to change. Since whitelists can be unnecessarily restrictive, and finding a policy that completely obeys the whitelist can be intractable in large MDPs, they have a mechanism for the agent to query the human about changing a small number of features outside the whitelist. What are the main advantages of your approach over their approach?

I agree with Abram that clinginess (the incentive to interfere with irreversible processes) is a major issue for the whitelist method. It might be possible to get around this by using an inaction baseline, i.e. only penalizing non-whitelisted transitions if they were caused by the agent, and would not have happened by default. This requires computing the inaction baseline (the state sequence under some default policy where the agent "does nothing"), e.g. by simulating the environment or using a causal model of the environment.

I'm not convinced that whitelisting avoids the offsetting problem: "Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low." I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal (e.g. increasing life expectancy). Capture all of the relevant consequences in the whitelist seems hard.

The directedness of whitelists is a very important property, because it can produce an asymmetric impact measure that distinguishes between causing irreversible effects and preventing irreversible events.

[-]Alex Turner7y40

That's pretty cool that another group had a whitelisting-ish approach! They also guarantee that (their version of) the whitelist is obeyed, which is nice.

What are the main advantages of your approach over their approach?

As I understand it, my formulation has the following advantages:

Automatically deduces what effects are, while making weaker assumptions (like the ability to segment the world into discrete objects and maintain beliefs about their identities).

In contrast, their approach requires complete enumeration of side effects. Also, they seem to require the user to specify all effects, but also claim that specifying whether the effect is good is too expensive? $O (2 | Φ |) = O (| Φ |)$ .
It's unclear how to apply their feature decomposition to the real world. For example, if it's OK for an interior door to be opened, how does the agent figure out if the door is open outside of toy MDPs? What about unlocked but closed, or just slightly ajar? Where is the line drawn?
The number of features and values seems to grow extremely quickly with the complexity of the environment, which gets us back to the "no compact enumeration" problem.

Doesn't require a complete model.

To be fair, calculating the counterfactual in my formulation requires at least a reasonably-accurate model.

Works in stochastic environments.

It is possible that their approach could be expanded to do so, but it is not immediately clear to me how.

Uses counterfactual reasoning to provide a limited reduction in clinginess.
Can trade off having an unknown effect with large amounts of reward. Their approach would do literally anything to prevent unknown effects.
Can act even when no perfect outcome is attainable.
Has plausibly-negligible performance overhead.
Can be used with a broad(er) class of RL approaches, since any given whitelist $W$ implicitly defines a new reward function $¯ R (s, a, s^{'}) := R (s, a, s^{'}) - {Penalty}_{W} (s, s^{'})$ .

It might be possible to get around [clinginess] by using an inaction baseline

Yes, but I think a basic counterfactual wouldn't quite get us there, since if people have different side effects due to reacting to the agent's behavior, that would still be penalized. My position right now is that either 1) we can do really clever things with counterfactuals to cleanly isolate the agent's effects, 2) the responsibility problem is non-trivially entangled with our ethics and doesn't admit a compact solution, or 3) there's another solution I have yet to find. I'm not putting much probability mass on 2) (yet).

In fact, clinginess stems from an incorrect assumption. In a single-agent environment, whitelisting seems to be pretty close to what we want for any capability level. It finds a plan that trades off maximizing the original reward with minimizing side-effects compared to an inaction baseline. However, if there are other agents in the environment whose autonomy is important, then whitelisting doesn't account for the fact that restricting autonomy is also a side effect. The inaction counterfactual ignores the fact that we want to be free to act with respect to the agent's actions - not just how we would have acted in the agent's absence.

I've been exploring solutions to clinginess via counterfactuals from first principles, and might write about it in the near future.

I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal

So I think the main issue here is still clinginess; if the agent isn't clingy, the problem goes away.

Suppose people can be dead, sick, or healthy, where the sick people die if the agent does nothing. While curing people is presumably whitelisted (or could be queried as an addition to the whitelist), curing people means more people means more broken things in expectation (since people might have non-whitelisted side effects). If we have a non-clingy formulation, this is fine; if not, this would indeed be optimized against. In fact, we'd probably just get a bunch of people cured of their cancer and put in stasis.

AI ALIGNMENT FORUM
AF

11