Interesting work! Seems closely related to this recent paper from Satinder Singh's lab: Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes. They also use whitelists to specify which features of the state the agent is allowed to change. Since whitelists can be unnecessarily restrictive, and finding a policy that completely obeys the whitelist can be intractable in large MDPs, they have a mechanism for the agent to query the human about changing a small number of features outside the whitelist. What are the main advantages of your approach over their approach?
I agree with Abram that clinginess (the incentive to interfere with irreversible processes) is a major issue for the whitelist method. It might be possible to get around this by using an inaction baseline, i.e. only penalizing non-whitelisted transitions if they were caused by the agent, and would not have happened by default. This requires computing the inaction baseline (the state sequence under some default policy where the agent "does nothing"), e.g. by simulating the environment or using a causal model of the environment.
I'm not convinced that whitelisting avoids the offsetting problem: "Making up for bad things it prevents with other negative side effects. Imagine an agent which cures cancer, yet kills an equal number of people to keep overall impact low." I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal (e.g. increasing life expectancy). Capture all of the relevant consequences in the whitelist seems hard.
The directedness of whitelists is a very important property, because it can produce an asymmetric impact measure that distinguishes between causing irreversible effects and preventing irreversible events.
That's pretty cool that another group had a whitelisting-ish approach! They also guarantee that (their version of) the whitelist is obeyed, which is nice.
What are the main advantages of your approach over their approach?
As I understand it, my formulation has the following advantages:
It might be possible to get around [clinginess] by using an inaction baseline
Yes, but I think a basic counterfactual wouldn't quite get us there, since if people have different side effects due to reacting to the agent's behavior, that would still be penalized. My position right now is that either 1) we can do really clever things with counterfactuals to cleanly isolate the agent's effects, 2) the responsibility problem is non-trivially entangled with our ethics and doesn't admit a compact solution, or 3) there's another solution I have yet to find. I'm not putting much probability mass on 2) (yet).
In fact, clinginess stems from an incorrect assumption. In a single-agent environment, whitelisting seems to be pretty close to what we want for any capability level. It finds a plan that trades off maximizing the original reward with minimizing side-effects compared to an inaction baseline. However, if there are other agents in the environment whose autonomy is important, then whitelisting doesn't account for the fact that restricting autonomy is also a side effect. The inaction counterfactual ignores the fact that we want to be free to act with respect to the agent's actions - not just how we would have acted in the agent's absence.
I've been exploring solutions to clinginess via counterfactuals from first principles, and might write about it in the near future.
I think this depends on how extensive the whitelist is: whether it includes all the important long-term consequences of achieving the goal
So I think the main issue here is still clinginess; if the agent isn't clingy, the problem goes away.
Suppose people can be dead, sick, or healthy, where the sick people die if the agent does nothing. While curing people is presumably whitelisted (or could be queried as an addition to the whitelist), curing people means more people means more broken things in expectation (since people might have non-whitelisted side effects). If we have a non-clingy formulation, this is fine; if not, this would indeed be optimized against. In fact, we'd probably just get a bunch of people cured of their cancer and put in stasis.
An impact measure is a means by which change in the world may be evaluated and penalized; such a measure is not a replacement for a utility function, but rather an additional precaution thus overlaid.
While I'm fairly confident that whitelisting contributes meaningfully to short- and mid-term AI safety, I remain skeptical of its robustness to scale. Should several challenges be overcome, whitelisting may indeed be helpful for excluding swathes of unfriendly AIs from the outcome space.1 Furthermore, the approach allows easy shaping of agent behavior in a wide range of situations.
Segments of this post are lifted from my paper, whose latest revision may be found here; for Python code, look no further than this repository. For brevity, some relevant details are omitted.
Summary
In effect, side effect avoidance aims to decrease how careful we have to be with our wishes. For example, asking for help filling a cauldron with water shouldn't result in this:
However, we just can't enumerate all the bad things that the agent could do. How do we avoid these extreme over-optimizations robustly?
Several impact measures have been proposed, including state distance, which we could define as, say, total particle displacement. This could be measured either naively (with respect to the original state) or counterfactually (with respect to the expected outcome had the agent taken no action).
These approaches have some problems:
Whitelisting falls prey to none of these.
However, other problems remain, and certain new challenges have arisen; these, and the assumptions made by whitelisting, will be discussed.
Rare LEAKED footage of Mickey trying to catch up on his alignment theory after instantiating an unfriendly genie [colorized, 2050].2
So, What's Whitelisting?
To achieve robust side effect avoidance with only a small training set, let's turn the problem on its head: allow a few effects, and penalize everything else.
What's an "Effect"?
You're going to be the agent, and I'll be the supervisor.
Look around - what do you see? Chairs, trees, computers, phones, people? Assign a probability mass function to each; basically:
When you do things that change your beliefs about what each object is, you receive a penalty proportional to how much your beliefs changed - proportional to how much probability mass "changed hands" amongst the classes.
Yes, it is - I've got a few videos of agents effecting acceptable changes. See all the objects being changed in this video? You can do that, too - without penalty.
Decompose your current knowledge of the world into a set of objects. Then, for each object, maintain a distribution over the possible identities of each object. When you do something that changes your beliefs about the objects in a non-whitelisted way, you are penalized proportionally.
Therefore, you avoid breaking vases by default.
Common Confusions
Latent Space Whitelisting
In a sense, class-based whitelisting is but a rough approximation of what we're really after: "which objects in the world can change, and in what ways?''. In latent space whitelisting, no longer do we constrain transitions based on class boundaries; instead, we penalize based on endpoint distance in the latent space. Learned latent spaces are low-dimensional manifolds which suffice to describe the data seen thus far. It seems reasonable that nearby points in a well-constructed latent space correspond to like objects, but further investigation is warranted.
Assume that the agent models objects as points z∈Rd, the d-dimensional latent space. A priori, any movement in the latent space is undesirable. When training the whitelist, we record the endpoints of the observed changes. For z1,z2∈Rd and observed change z1→z2, one possible dissimilarity formulation is:
where d(⋅,⋅) is the Euclidean distance.
Basically, the dissimilarity for an observed change is the distance to the closest whitelisted change. Visualizing these changes as one-way wormholes may be helpful.
Advantages
Whitelisting asserts that we can effectively encapsulate a large part of what "change" means by using a reasonable ontology to penalize object-level changes. We thereby ground the definition of "side effect", avoiding the issue raised by Taylor et al.:
Nonetheless, we may not be able to perfectly express what it means to have side-effects: the whitelist may be incomplete, the latent space insufficiently granular, and the allowed plans sub-optimal. However, the agent still becomes more robust against:
Results
I tested a vanilla Q-learning agent and its whitelist-enabled counterpart in 100 randomly-generated grid worlds (dimensions up to 5×5). The agents were rewarded for reaching the goal square as quickly as possible; no explicit penalties were levied for breaking objects.
The simulated classification confidence of each object's true class was p∼N(.8,σ) (truncated to [0,1]), σ∈{0,.025,…,.175}. This simulated sensor noise was handled with a Bayesian statistical approach.
At reasonable levels of noise, the whitelist-enabled agent completed all levels without a single side effect, while the Q-learner broke over 80 vases.
Assumptions
I am not asserting that these assumptions necessarily hold.
Problems
Beyond resolving the above assumptions, and in roughly ascending difficulty:
Object Permanence
If you wanted to implement whitelisting in a modern embodied deep-learning agent, you could certainly pair deep networks with state-of-the-art segmentation and object tracking approaches to get most of what you need. However, what's the difference between an object leaving the frame, and an object vanishing?
Not only does the agent need to realize that objects are permanent, but also that they keep interacting with the environment even when not being observed. If this is not realized, then an agent might set an effect in motion, stop observing it, and then turn around when the bad effect is done to see a "new" object in its place.
Time Step Size Invariance
The penalty is presently attenuated based on the probability that the belief shift was due to noise in the data. Accordingly, there are certain ways to abuse this to skirt the penalty. For example, simply have non-whitelisted side effects take place over long timescales; this would be classified as noise and attenuated away.
However, if we don't need to handle noise in the belief distributions, this problem disappears - presumably, an advanced agent keeps its epistemic house in order. I'm still uncertain about whether (in the limit) we have to hard-code a means for decomposing a representation of the world-state into objects, and where to point the penalty evaluator in a potentially self-modifying agent.
Information Theory
Whitelisting is wholly unable to capture the importance of "informational states" of systems. It would apply no penalty to passing powerful magnets over your hard drive. It is not clear how to represent this in a sensible way, even in a latent space.
Loss of Value
Whitelisting could get us stuck in a tolerable yet sub-optimal future. Corrigibility via some mechanism for expanding the whitelist after training has ended is then desirable. For example, the agent could propose extensions to the whitelist. To avoid manipulative behavior, the agent should be indifferent as to whether the extension is approved.
Even if extreme care is taken in approving these extensions, mistakes may be made. The agent itself should be sufficiently corrigible and aligned to notice "this outcome might not actually be what they wanted, and I should check first".
Reversibility
As DeepMind outlines in Specifying AI Safety Problems in Simple Environments, we may want to penalize not just physical side effects, but also causally-irreversible effects:
Krakovna et al. introduce a means for penalizing actions by the proportion of initially-reachable states which are still reachable after the agent acts.
I think this is a step in the right direction. However, even given a hypercomputer and a perfect simulator of the universe, this wouldn't work for the real world if implemented literally. That is, due to entropy, you may not be able to return to the exact same universe configuration. To be clear, the authors do not suggest implementing this idealized algorithm, flagging a more tractable abstraction as future work.
What does it really mean for an "effect" to be "reversible"? What level of abstraction do we in fact care about? Does it involve reversibility, or just outcomes for the objects involved?
Ontological Crises
When a utility-maximizing agent refactors its ontology, it isn't always clear how to apply the old utility function to the new ontology - this is called an ontological crisis.
Whitelisting may be vulnerable to ontological crises. Consider an agent whose whitelist disincentivizes breaking apart a tile floor (floor→tiles∉whitelist); conceivably, the agent could come to see the floor as being composed of many tiles. Accordingly, the agent would no longer consider removing tiles to be a side effect.
Generally, proving invariance of the whitelist across refactorings seems tricky, even assuming that we can identify the correct mapping.
Retracing Steps
When I first encountered this problem, I was actually fairly optimistic. It was clear to me that any ontology refactoring should result in utility normalcy - roughly, the utility functions induced by the pre- and post-refactoring ontologies should output the same scores for the same worlds.
Turns out a certain someone beat me to the punch - here's a novella Eliezer wrote on Arbital about "rescuing the utility function".4
Clinginess
This problem cuts to the core of causality and "responsibility" (whatever that means). Say that an agent is clingy when it not only stops itself from having certain effects, but also stops you.5 Whitelist-enabled agents are currently clingy.
Let's step back into the human realm for a moment. Consider some outcome - say, the sparking of a small forest fire in California. At what point can we truly say we didn't start the fire?
Taken literally, I don't know that there's actually a significant difference in "responsibility" between these outcomes - if I take the action, the effect happens; if I don't, it doesn't. My initial impression is that uncertainty about the results of our actions pushes us to view some effects as "under our control" and some as "out of our hands". Yet, if we had complete knowledge of the outcomes of our actions, and we took an action that landed us in a California-forest-fire world, whom could we blame but ourselves?6
Can we really do no better than a naive counterfactual penalty with respect to whatever impact measure we use? My confusion here is not yet dissolved. In my opinion, this is a gaping hole in the heart of impact measures - both this one, and others.
Stasis
Fortunately, a whitelist-enabled agent should not share the classic convergent instrumental goal of valuing us for our atoms.
Unfortunately, depending on the magnitude of the penalty in proportion to the utility function, the easiest way to prevent penalized transitions may be putting any relevant objects in some kind of protected stasis, and then optimizing the utility function around that. Whitelisting is clingy!
If we have at least an almost-aligned utility function and proper penalty scaling, this might not be a problem.
Edit: a potential solution to clinginess, with its own drawbacks.
Discussing Imperfect Approaches
A few months ago, Scott Garrabrant wrote about robustness to scale:
I recommend reading it - it's to-the-point, and he makes good points.
Here are three further thoughts:
Conclusion
Although somewhat outside the scope of this post, whitelisting permits the concise shaping of reward functions to get behavior that might be difficult to learn using other methods.8 This method also seems fairly useful for aligning short- and medium-term agents. While encountering some new challenges, whitelisting ameliorates or solves many problems with previous impact measures.
1 Even an idealized form of whitelisting is not sufficient to align an otherwise-unaligned agent. However, the same argument can be made against having an off-switch; if we haven't formally proven the alignment of a seed AI, having more safeguards might be better than throwing out the seatbelt to shed deadweight and get some extra speed. Of course, there are also legitimate arguments to be made on the basis of timelines and optimal time allocation.
2 Humor aside, we would have no luxury of "catching up on alignment theory" if our code doesn't work on the first go - that is, if the AI still functions, yet differently than expected.
Luckily, humans are great at producing flawless code on the first attempt.
3 A potentially-helpful analogy: similarly to how Bayesian networks decompose the problem of representing a (potentially extremely large) joint probability table to that of specifying a handful of conditional tables, whitelisting attempts to decompose the messy problem of quantifying state change into a set of comprehensible ontological transitions.
4 Technically, at 6,250 words, Eliezer's article falls short of the 7,500 required for "novella" status.
5 Is there another name for this?
6 I do think that "responsibility" is an important part of our moral theory, deserving of rescue.
7 In particular, I found a particular variant of Murphyjitsu helpful: I visualized Eliezer commenting "actually, this fails terribly because..." on one of my posts, letting my mind fill in the rest.
In my opinion, one of the most important components of doing AI alignment work is iteratively applying Murphyjitsu and Resolve cycles to your ideas.
8 A fun example: I imagine it would be fairly easy to train an agent to only destroy certain-colored ships in Space Invaders.