I like the proposed iterative formulation for the step-wise inaction counterfactual, though I would replace pi_Human with pi_Environment to account for environment processes that are not humans but can still "react" to the agent's actions. The step-wise counterfactual also improves over the naive inaction counterfactual by avoiding repeated penalties for the same action, which could help avoid offsetting behaviors for a penalty that includes reversible effects.
However, as you point out, not penalizing the agent for human reactions to its actions introduces a manipulation incentive for the agent to channel its effects through humans, which seems potentially very bad. The tradeoff you identified is quite interesting, though I'm not sure whether penalizing the agent for human reactions necessarily leads to an incentive to put humans in stasis, since that is also quite a large effect (such a penalty could instead incentivize the agent to avoid undue influence on humans, which seems good). I think there might be a different tradeoff (for a penalty that incorporates reversible effects): between avoiding offsetting behaviors (where the stepwise counterfactual likely succeeds and the naive inaction counterfactual can fail) and avoiding manipulation incentives (where the stepwise counterfactual fails and the naive inaction counterfactual succeeds). I wonder if some sort of combination of these two counterfactuals could get around the tradeoff.
I've thought some more about the step-wise inaction counterfactual, and I think there are more issues with it beyond the human manipulation incentive. With the step-wise counterfactual, future transitions that are caused by the agent's current actions will not be penalized, since by the time those transitions happen, they are included in the counterfactual. Thus, there is no penalty for a current transition that set in motion some effects that don't happen immediately (this includes influencing humans), unless the whitelisting process takes into account that this transition causes these effects (e.g. using a causal model).
For example, if the agent puts a vase on a conveyor belt (which results in the vase breaking a few time steps later), it would only be penalized if the "vase near belt -> vase on belt" transition is not in the whitelist, i.e. if the whitelisting process takes into account that the belt would eventually break the vase. There are also situations where penalizing the "vase near belt -> vase on belt" transition would not make sense, e.g. if the agent works in a vase-making factory and the conveyor belt takes the vase to the next step in the manufacturing process. Thus, for this penalty to reliably work, the whitelisting process needs to take into account accurate task-specific causal information, which I think is a big ask. The agent would also not be penalized for butterfly effects that are difficult to model, so it would have an incentive to channel its impact through butterfly effects of whitelisted transitions.
So this issue is correct, as my post is written. I realized after the deadline that I hadn't spelled this out at all, and I didn't feel comfortable editing at that point; there's a little clarification in the post now.
For each time step , we're running both of those calls indefinitely. For each time step in the simulation, we penalize those effects which are only in the simulation at that (simulated) time step and which manifest under the full plan. This means that if directly caused a side effect, it gets counted exactly once.
I agree that it's a big ask, modeling butterfly effects like that, but the idea was to get an unbounded solution and see where that left us.
So I don't know how we could quantify "stopping humans from having effects" as an effect without a strong offsetting incentive.
Let's consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You're basically saying "optimize this utility the best you can without letting there be an actual impact". However, I actually hadn't thought of this formulation before, and it's plausible it's even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
There's another problem, however: "people conclude that this AI design doesn't work and try another variant" is a pretty plausible result of this naive counterfactual. When people imagine the counterfactual, it seems they usually think about "what would happen if the agent did nothing and then people shrugged and went about their lives, forgetting about AGI". The odds of that being the counterfactual are pretty slim. It's even possible that any agents/variants people would make in the counterfactual would have undefined behavior... Sufficiently-similar agents would also simulate what would happen if they did nothing, got tweaked and rebooted, and then ran the same simulation... where would it bottom out, and with what conclusion? Probably with a wholly-different kind of agent being tried out.
The iterative formulation doesn't seem to have that failure mode.
Let's consider an alternate form of whitelisting, where we instead know the specific object-level transitions per time step that would have occurred in the naive counterfactual (where the agent does nothing). Discarding the whitelist, we instead penalize distance from the counterfactual latent-space transitions at that time step.
How would you define a distance measure on transitions? Since this would be a continuous measure of how good transitions are, rather than a discrete list of good transitions, in what sense is it a form of whitelisting?
This basically locks us into a particular world-history. While this might be manipulation- and stasis-free, this is a different kind of clinginess. You're basically saying "optimize this utility the best you can without letting there be an actual impact". However, I actually hadn't thought of this formulation before, and it's plausible it's even more desirable than whitelisting, as it seems to get us a low/no-impact agent semi-robustly. The trick is then allowing favorable effects to take place without getting back to stasis/manipulation.
I expect that in complex tasks where we don't know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement. The distance measure on transitions could also be traded off with reward (or some other task-specific objective function), so if an action is sufficiently useful for the task, the high reward would dominate the distance penalty.
This would still have offsetting issues though. In the asteroid example, if the agent deflects the asteroid, then future transitions (involving human actions) are very different from default transitions (involving no human actions), so the agent would have an offsetting incentive.
in what sense is it a form of whitelisting?
You're right, it isn't. I should have been more precise:
"Suppose we have an impact measure that considers whitelist-esque object transitions, but doesn't use a whitelist. Instead, it penalizes how dissimilar the observed object transitions are at a time step to those which were counterfactually expected."
I expect that in complex tasks where we don't know the exact actions we would like the agent to take, this would prevent the agent from being useful or coming up with new unforeseen solutions. I have this concern about whitelisting in general, though giving the agent the ability to query the human about non-whitelisted effects is an improvement.
I think this failure mode on its own is relatively benign, given querying.
What I find more worrying is that an intelligent agent would likely be able to hard-optimize while avoiding penalties (either through the allowed transitions, by skating by on technicalities re: object recognition, etc).
I suspect the/a ideal solution will have far fewer parameters (if any).
In which I outline a solution to the clinginess problem and illustrate a potentially-fundamental trade-off between assumptions about the autonomy of humans and about the responsibility of an agent for its actions.
Consider two plans for ensuring that a cauldron is full of water:
All else equal, the latter plan does better in expectation, as there are fewer ways the cauldron might somehow become not-full (e.g., evaporation, and the minuscule loss of utility that would entail). However, the latter plan "changes" more "things" than we had in mind.
Undesirable maxima of an agent's utility function often seem to involve changing large swathes of the world. If we make "change" costly, that incentivizes the agent to search for low-impact solutions. If we are not certain of a seed AI's alignment, we may want to implement additional safeguards such as impact measures and off-switches.
I designed an impact measure called whitelisting - which, while overcoming certain weaknesses of past approaches, is yet vulnerable to
Clinginess
Since we can only blame ourselves, we should take actions which do not lead to side effects. These actions may involve enacting impact measure-preventing precautions throughout the light cone, since the actions of other agents and small ripple effects of ours could lead to significant penalties if left unchecked.
Clinginess arises in part because we fail to model agents as anything other than objects in the world. While it might be literally true that there are not ontologically-basic agents that escape determinism and "make choices", it might be useful to explore how we can protect human autonomy via the abstraction of game-theoretic agency.
To account for environmental changes already set in motion, a naive counterfactual framework was proposed in which impact is measured with respect to the counterfactual where the agent did nothing. We will explore how this fails, and how to do better.
Thought Experiments
We're going to isolate the effects for which the agent is responsible over the course of three successively more general environment configurations: one-off (make one choice and then do nothing), stationary iterative (make T choices, but your options and their effects don't change), and iterative (the real world, basically).
Assumptions
Formalization
We formalize our environment as a stochastic game ⟨I,S,A,P,g⟩.
Let Ω be the space of possible side effects, and suppose that ϕ:S×S→P(Ω) is a reasonable impact measure. πi:S→A is agent i's policy; let π:ti be πi for the first t time steps, and ∅ thereafter.
Let effects(πH,πM) be the (set of) effects - both immediate and long-term - that would take place if H executes πH and M executes πM.
The goal: a counterfactual reasoning framework which pinpoints the effects for which M is responsible.
One-Off
We first consider a single-turn game (T=1).
Example
Yup, this is about where we're at in alignment research right now.
Approach
M should realize that a lot more effects happen if it presses the left button, and should penalize that plan by the difference. This is the aforementioned naive approach: M penalizes things that wouldn't have happened if it had done nothing. For the one-turn case, this clearly isolates both the immediate and long-term impacts of M's actions.
Penalized Effects
Stationary Iterative
Both parties act for countably many time steps. This environment is assumed to be stationary: actions taken on previous turns do not affect the availability or effects of later actions. Formally, ∀i∈I,s,s′∈S,t∈N+:A(i,s,t)=A(i,s′,t).
Example
Hugh and Ma again find themselves faced with a slew of dangerous, bad-effect-having buttons. Some take effect the next turn, while others take a while.
Approach
This seems easy, but is actually a little tricky - we have to account for the fact that H can change its actions in response to what M does. Thanks to stationarity, we don't have to worry about H's selecting moves that depend on M's acting in a certain way. In the counterfactual, we have H act as if it had observed M execute πM, and we have M actually do nothing.
Penalized Effects
Let πH|πM denote the actions H would select if it observed M executing πM.
Note: the naive counterfactual scheme, effects(πH,πM)−effects(πH,π:0M), fails because it doesn't account for H's right to change its mind in response to M.
Iterative
We're now in a realistic scenario, so we have to get even fancier.
Example
Suppose M pushes the vase to the left, and H decides to break it. The stationary iterative approach doesn't allow for the fact that H can only break the vase if M already pushed it. Therefore, simulating M's inaction but H's action (as if M had pushed the vase) results in no vases being broken in the counterfactual. The result: M penalizes itself for H's decision to break the vase. Chin up, M!
Approach
How about penalizing
Pretty, right?
Do you see the flaw?
Really, look.
The above equation can penalize M for side effects which don't actually happen. This arises when interrupting πM causes side effects which would otherwise have been prevented by later parts of the plan. For example, if I push a vase off the table and then catch it (being sure that I could do so in time), I didn't cause a side effect.
We should instead
Every turn, M calculates the effects that only result if it acts (the two simulations run to the longest time step observed under the full plan).
Those effects which are exact matches to effects actually observed in that time step in the final outcome are penalized. The motivation here is that if it's a side effect of M's, it had to be caused by one of its actions. Furthermore, M's side effects are counted exactly once (and not just because we're using sets right now).
The previous solutions are special cases of this formulation.
Note: the number of counterfactual simulations grows as O(T) - crucially, not with the number of agents H represents.
Applications to Whitelisting
Class-ic
Here, we remove the constraint that side effects be identified by "object identifiers", allowing like side effects to be treated as exchangeable. It is then trivial to implement probabilistic class-based whitelisting with the iterative counterfactual penalty using basic vector arithmetic, min, and max. I don't want to bore the reader with the details, but I'm fairly confident this can be done rather easily.
Latent Spaces
We now do away with the assumption of discrete side effects. Because we're dealing with exact counterfactuals (by assumption) and because side effects either take place in the actual outcome or they don't, we can extract the relevant step-wise latent space transitions via the iterative formulation. We then penalize only these effects.
Observe that the iterative formulation doesn't allow penalty avoidance by "blending in" with counterfactual effects. Imagine that the default outcome is a large asteroid striking the Earth. M only escapes penalization if it has the impact on the same time step and in the same way and on the same objects... Basically, it can't escape penalization in this manner under the omniscience assumption. In short, M isn't incentivized to have extra impact because it would "happen anyways".
Fundamental Trade-Off
Fortunately, this formulation solves clinginess; H's effects are not penalized, while M's effects are. Unfortunately, insofar as H may be manipulated by M, the impact measure becomes useless; M can simply avoid penalization by making H do its dirty work.
This can happen in two ways:
I don't think that this is a flaw in my formulations; instead, it appears that impact measures exist on a continuum:
Note that an aligned M (seems to) stay aligned under this formulation, safeguarding object status against other agents only so far as necessary to prevent interruption of its (aligned) plans. Furthermore, any M separated from an H with a known-flat utility function also gains no incentives to mess with H (beyond the existing convergent instrumental ones).
In general, unaligned M stay basically unaligned due to the workarounds detailed above.
Forwards
It isn't clear that penalizing the elimination of H would be helpful, as that seems hard to do robustly; furthermore, other forms of coercion would remain possible. What, pray tell, is a non-value-laden method of differentiating between "M makes H break a vase at gunpoint" and "M takes an action and H decides to break a vase for some reason"? How do we robustly differentiate between manipulative and normal behavior?
I'm slightly more pessimistic now, as it seems less likely that the problem admits a concise solution that avoids difficult value judgments on what kinds of influence are acceptable. However, I have only worked on this problem for a short time, so I still have a lot of probability mass on having missed an even more promising formulation. If there is such a formulation, my hunch is that it either imposes some kind of counterfactual information asymmetry at each time step or uses some equivalent of the Shapley value.
I'd like to thank TheMajor and Connor Flexman for their feedback.