These posts are quite good, thank you for writing them.
I no longer think that the desiderata I listed in Impact Measure Desiderata should be our guiding star (although I think Rohin's three are about right). Let's instead look directly at the process of getting a (goal-directed) AI to do what we want, and think about what designs do well.
First, we specify the utility function. Second, the agent computes and follows a high-performing policy. This process continues, where we refine the goal if the agent isn't doing what we want.
What we want is for the AI to eventually be doing the right thing (even if we have to correct it a few times). The first way this can not happen is that the agent can act to make what we want no longer feasible, or at least more expensive. That is, the agent changes the world so that even if it had the goal we wanted to give it, it would be significantly harder to accomplish:
The second problem is that the agent can prevent us from being able to correct it properly (by gaining or preserving too much power for itself, generally):
Together, these are catastrophes - we're no longer able to get what we want in either situation. We should consider what designs preclude these failures naturally.
When considering debates over desiderata, it seems to me that we're debating whether the desideratum will lead to good things (and each of us probably secretly had a different goal in mind for what impact measures should do). I'm interested in making the goal of this research explicit and getting it right. My upcoming sequence will cover this at length.
In the first five years after Stuart Armstrong posted his first research suggestions for impact measures, very little published work expanded on the idea. The last post in this sequence was intended to somewhat comprehensively review this literature, but it surveyed only four papers total, including the original article.
In the last two years, research has now picked up pace by a significant margin. The two papers which are most significant are Penalizing side effects using stepwise relative reachability by Victoria Krakovna et al. and Conservative Agency by Alexander Turner et al. In that time a few blog posts have come out explaining the approaches in more detail, and public debate over the utility of impact measures has become much more visible.
Here I will briefly explain the two most prominent measures, relative reachability and attainable utility. We will see that they diverge conceptually from earlier research. By being different, they also end up satisfying some desirable properties. I will then consider some recent notable critiques of impact measures more generally. A personal analysis of these critiques will wait one more day. This post will only cover the surface.
Preliminaries
Before I can explain either of the two measures, I must first introduce the language which allows me to precisely define each approach. Both impact measures have quite simple natural language descriptions, but it is easy to feel as though one is not getting the full story if it is explained using English alone. The specific way that the two methods are represented takes place within a Markov decision process (MDP).
Intuitively, an MDP is just a way of representing actions that an agent can take in a stochastic environment, which is made up of a set of states. Formally, an MDP is defined by a tuple (S,A,r,p,γ). S is the set of states in the environment. A is the set of actions that the agent can take. r is a function which maps state-action pairs to a real number reward. p is a function which returns the probability of transitioning into one state given the previous state and an action, p(st+1|st,at). γ is the discount factor for the rewards, γ∈[0,1].
Relative reachability
In Victoria Krakovna's blog post introducing relative reachability, she explains that relative reachability was a synthesis of two related ideas: preserving reversibility of the environment, and penalizing impact over states. The central insight was that these two ideas can be combined to avoid the downsides of either of them alone.
The idea behind preserving reversibility is simple. We don't want our artificial intelligence to do something that would make us unable to return things to the way that they were previously. For example, if we wanted the AI to create a waste disposal facility, we might not want it to irrevocably pollute a nearby lake in the process.
The way we formalize state reversibility is by first introducing a reachability measure. This reachability measure essentially takes in two states and returns 1 if there is some sequence of actions that the agent could take in order to go from the first state to the final state, and 0 if there is no such sequence of actions. But this is not yet the full description of the reachability measure. In order to take into account uncertainty in the environment, and a discount factor, reachability is actually defined as the following function of two states x and y
R(x;y):=maxπE[γNπ(x;y)r]
where π is some policy, γr is the reachability discount factor ∈(0,1], and N a function which returns the number of steps it takes to reach y from x when following π. In English, this is stating that reachability between two states is the expected value of the the discount factor raised to the power of the number of states it would take if one were to follow an optimal policy from the first state to the final state. The more steps we are expected to take in order to go from x to y, the closer reachability is to zero. If there is no sequence of actions which can take us from x to y, then reachability is exactly zero. On the other hand, if x=y, and they are the same state, then the reachability between them is one.
An unreachability deviation is a penalty that we can add to actions which incentivizes against taking some irreversible action. This penalty is simply defined as 1−R(s,s′) where s′ is some baseline state. In other words, if we are very close to the baseline state, then the penalty is close to zero (since reachability would be close to one).
The exact way that we define the baseline state is not particularly important for understanding a first pass through. Naively, the baseline could simply refer to the first step in the episode. It is, however, better to think about the baseline state as some type of reference world where the agent had done decided to do nothing. We can take this concept further by defining "doing nothing" as either a counterfactual reality where the agent was never turned on, or the result of an infinite sequence of nothing actions which began in the last time step. The second interpretation is preferred for a number of reasons, but this isn't crucially important for understanding relative reachability. (Read the paper for more details).
The problem with penalizing actions with the unreachability deviation is that it yields the maximum possible penalty for all actions which result in some irreversibility. This is clearly an issue in a complex environment, since all actions are in some sense irreversible. See section 2.2 in the paper for a specific toy example of why using mere unreachability won't work.
The contribution that Krakovna makes is by introducing a measure which is sensitive to the magnitude of irreversibility. Relative reachability is defined as the average reduction in reachability of all states from the current state compared to the baseline. This is written as the following, where dRR(st;s′t) represents the relative reachability deviation from a state at time t compared to a baseline state s′t
1|S|∑s∈Smax(R(s′t;s)−R(st,s),0)
Take a moment to pause and inspect the definition above. We are summing over all states in the environment, and taking a difference between the reachability between the baseline and our current state. This feels like we are determining how far we are from the set of all states in the environment that are close to the baseline. For some particularly irreversible action, relative reachability will assign a high penalty to this action because it reduced the reachability to all the states we could have been in if we had done nothing. The idea is that presumably we should not try to go into regions of the state space which will make it hard to set everything back to "normal." Conversely, we shouldn't enter states that would be hard to get to if we never did anything at all.
Attainable utility
Alexander Turner expanded upon relative reachability by generalizing it to reward functions rather than states. As I understand, it was not Turner's initial intention to create a general version of reachability, but the way that the two approaches ended up being similar allowed for a natural abstraction of both (see the section on Value-difference measures in the relative reachability paper).
Attainable utility is the idea that, rather than caring about the average reduction of state reachability, we should instead care about the average reduction of utility reachability. The central insight guiding attainable utility is summed up nicely in a single sentence in his introductory post, Towards A New Impact Measure.
"Goals" in this case refers to some set of arbitrary utility functions. They don't need to be our utility functions. They could instead be any sufficiently diverse set of utility functions. In the above post Turner uses the set of all computable utility functions weighted by their complexity. In general these reward functions are referred to as the auxiliary set.
There are a few ways that attainable utility has been represented formally. In Conservative Agency the penalty is written as
P(s,a):=∑r∈R|Qr(s,a)−Qr(s,∅)|
where ∅ refers to the baseline "do nothing" action and Qr(s) refers to the Q-value of an action taken at some state s, or in other words the expected cumulative value of taking that action and following an optimal policy from the point of view of the particular reward function. This penalty is then scaled by some constant factor before being incorporated into a utility function.
The way that we choose the scaling depends on an operational choice. We can either measure the impact of some mild reference action, or we can scale by the Q-value over all the reward functions in the auxillary set: ∑r∈RQr(s,∅). As mentioned in the last post, the advantage of the first method is that it allows us to avoid the problem of catastrophic miscalibration of impact penalties. Turner tentatively proposes the following,
In both relative reachability and attainable utility preservation we modify the reward function by adding a regularized term. This is represented as the following:
R′(s,a)=R(s,a)−λPenalty(s,a)Scale(s)
where λ is some parameter that controls the strength of the impact penalty, perhaps representing the operator's belief in the power of the impact penalty.
What does this solve?
In the introductory post to attainable utility preservation, Turner claims that by using attainable utility, we are able to satisfy a number of desirable properties which were unsatisfied in earlier approaches. Yesterday, I outlined a few notable critiques to impact measurements, such as incentives for keeping the universe in a stasis. Turner sought to outline a ton of potential desiderata for impact measures, some of which were only discovered after realizing that other methods like whitelisting were difficult to make work.
Among the desirable properties are some obvious ones that had already been recognized, like value-agnosticism, natural kind, and the measure being apparently rational. Turner contributed some new ones like dynamic consistency and efficiency, which allowed him to provide tests for his new approach. (It is interesting to compare the computational efficiency of calculating relative reachability and attainable utility).
Some people have disagreed with the significance of some items on the list, and turned to simpler frameworks. Rohin Shah has added,
By contrast Daniel Filan has compiled a list of test cases for impact measures. While both the relative reachability paper and the paper describing attainable utility preservation provided tests on AI safety gridworld environments, it is not clear to me at the moment whether these are particularly significant. I am driven to study impact measurements mainly because of the force of intuitive arguments for each approach, rather than due to any specific empirical test.
The post Best reasons for pessimism about impact of impact measures? is the most comprehensive collection of critiques from the community. So far I have not been able to find any long-form rebuttals to the specific impact measurements. Instead, the best counterguments come from this post above.
In general there is a disagreement about the aim of impact measures, and how we could possibly apply them in a way that meaningfully helps align artificial intelligence. In the top reply from the "Best reasons for pessimism" post, lesswrong user Vaniver is primarily concerned with our ability to reduce AI alignment into a set of individual issues such that impact measurements helps solve a particular one of these issues.
The state of the debate over impact measurement is best described as informal and scattered across many comments on Lesswrong and the Alignment Forum.
In the next post I will continue my discussion of impact measures by providing what I view as finer grained intuitions for what I think impact measures are good for. This should hopefully provide some insight into what problem we can actually solve by taking this approach, and whether the current impact measures rise to the challenge.