Nice post!
One related thing I was thinking about last week: part of the idea of abstraction is that we can pick a Markov blanket around some variable X, and anything outside that Markov blanket can only "see" abstract summary information f(X). So, if we have a goal which only cares about things outside that Markov blanket, then that goal will only care about f(X) rather than all of X. This holds for any goal which only cares about things outside the blanket. That sounds like instrumental convergence: any goal which does not explicitly care about things near X itself, will care only about controlling f(X), not all of X.
This isn't quite the same notion of goal-locality that the OP is using (it's not about how close the goal-variables are to the agent), but it feels like there's some overlapping ideas there.
The more I think about it, the more I come to believe that locality is very related to abstraction. Not the distance part necessarily, but the underlying intuition. If my goal is not "about the world", then I can throw almost all information about the world except a few details and still be able to check my goal. The "world" of the thermostat is in that sense a very abstracted map of the world where anything except the number on its sensor is thrown away.
Thanks! Glad that I managed to write something that was not causally or rhetorically all wrong. ^^
One related thing I was thinking about last week: part of the idea of abstraction is that we can pick a Markov blanket around some variable X, and anything outside that Markov blanket can only "see" abstract summary information f(X). So, if we have a goal which only cares about things outside that Markov blanket, then that goal will only care about f(X) rather than all of X
That makes even more sense to me than you might think. My intuitions about locality comes from its uses in distributed computing, where it measures both how many rounds of communication are needed to solve a problem and how far in the communication graph one needs to look to compute one's own output. This looks like my use of locality here.
On the other hand, recent work on distributed complexity also studied the volume complexity of a problem: the size of the subgraph one needs to look at, which might be very different from a ball. The only real constraint is connectedness. Modulo the usual "exactness issue", which we can deal with by replacing "the node is not used" by "only f(X) is used", this looks a lot like your idea.
Planned summary for the Alignment Newsletter:
This post introduces the concept of the _locality_ of a goal, that is, how “far” away the target of the goal is. For example, a thermometer’s “goal” is very local: it “wants” to regulate the temperature of this room, and doesn’t “care” about the temperature of the neighboring house. In contrast, a paperclip maximizer has extremely nonlocal goals, as it “cares” about paperclips anywhere in the universe. We can also consider whether the goal depends on the agent’s internals, its input, its output, and/or the environment.
The concept is useful because for extremely local goals (usually goals about the internals or the input) we would expect wireheading or tampering, whereas for extremely nonlocal goals, we would instead expect convergent instrumental subgoals like resource acquisition.
Thanks for the summary! It's representative of the idea.
Just by curiosity, how do you decide for which posts/paper you want to write an opinion?
I ask myself if there's anything in particular I want to say about the post / paper that the author(s) didn't say, with an emphasis on ensuring that the opinion has content. If yes, then I write it.
(Sorry, that's not very informative, but I don't really have a system for it.)
Introduction
Studying goal-directedness produces two kinds of questions: questions about goals, and questions about being directed towards a goal. Most of my previous posts focused on the second kind; this one shifts to the first kind.
Assume some goal-directed system with a known goal. The nature of this goal will influence which issues of safety the system might have. If the goal focuses on the input, the system might wirehead itself and/or game its specification. On the other hand, if the goal lies firmly in the environment, the system might have convergent instrumental subgoals and/or destroy any unspecified value.
Locality aims at capturing this distinction.
Intuitively, the locality of the system's goal captures how far away from the system one must look to check the accomplishment of the goal.
Let's give some examples:
Locality isn't about how the system extract a model of the world from its input, but about whether and how much it cares about the world beyond it.
Starting points
This intuition about locality came from the collision of two different classification of goals: the first from from Daniel Dennett and the second from Evan Hubinger.
Thermostats and Goals
In "The Intentional Stance", Dennett explains, extends and defends... the intentional stance. One point he discusses is his liberalism: he is completely comfortable with admitting ridiculously simple systems like thermostats in the club of intentional systems -- to give them meaningful mental states about beliefs, desires and goals.
Lest we readers feel insulted at the comparison, Dennett nonetheless admits that the goals of a thermostat differ from ours.
The goals and beliefs of a thermostat are thus not about heat and the room it is in, as our anthropomorphic bias might suggest, but about the binary state of its sensor.
Now, if the thermostat had more information about the world -- a camera, GPS position, general reasoning ability to infer information about the actual temperature from all its inputs --, then Dennett argues its beliefs and goals would be much more related to heat in the room.
Humans, Dennett argues, are more like this enhanced thermostat, in that our beliefs and goals intertwine with the state of the world. Or put differently, when the world around us changes, it will influence almost always influence our mental states; whereas a basic thermostat might react the exact same way in vastly different environments.
Part of this distinction between goals comes from generalization, a property considered necessary for goal-directedness since Rohin's initial post on the subject. But the two goals also differs in their "groundedness": the thermostat's goal lies completely in its sensors' inputs, whereas the goals of humans depend on things farther away, on the environment itself.
That is, these two goals have different locality.
Goals Across Cartesian Boundaries
The other classification of goals comes from Evan Hubinger, in a personal discussion. Assuming a Cartesian Boundary outlining the system and its inputs and outputs, goals can be functions of:
Of course, many goals are functions of multiple parts of this quatuor. Yet separating them allows a characterization of a given goal through their proportions.
Going back to Dennett's example, the basic thermostat's goal is a function of its input, while human goals tend to be functions of the environment. And once again, an important aspect of the difference appears to lie in how far from the system is there information relevant to the goal -- locality.
What Is Locality Anyway?
Assuming some model of the world (possibly a causal DAG) containing the system, the locality of the goal is inversely proportional to the minimum radius of a ball, centered at the system, which suffice to evaluate the goal. Basically, one needs to look a certain distance away to check whether one’s goal is accomplished; locality is a measure of this distance. The more local a goal, the less grounded in the environment, and the most it is susceptible to wireheading or change of environment without change of internal state.
Running with this attempt at formalization, a couple of interesting point follow:
In summary, locality is a measure of the distance at which information about the world matters for a system's goal. It appears in various guises in different classification of goals, and underlies multiple safety issues. What I give is far from a formalization; it is instead a first exploration of the concept, with open directions to boot. Yet I believe that the concept can be put into more formal terms, and that such a measure of locality captures a fundamental aspect of goal-directedness.
Thanks to Victoria Krakovna, Evan Hubinger and Michele Campolo for discussions on this idea.