I think this formulation of goal-directedness is pretty similar to one I suggested in the post before the coherence arguments post (Intuitions about goal-directed behavior, section "Our understanding of the behavior"). I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops). Should they worry that their laptops are going to take over the world?
For a deeper response, I'd recommend Intuitions about goal-directed behavior. I'll quote some of the relevant parts here:
There is a general pattern in which as soon as we understand something, it becomes something lesser. As soon as we understand rainbows, they are relegated to the “dull catalogue of common things”. This suggests a somewhat cynical explanation of our concept of “intelligence”: an agent is considered intelligent if we do not know how to achieve the outcomes it does using the resources that it has (in which case our best model for that agent may be that it is pursuing some goal, reflecting our tendency to anthropomorphize). That is, our evaluation about intelligence is a statement about our epistemic state.
[... four examples ...]
To the extent that the Misspecified Goal argument relies on this intuition, the argument feels a lot weaker to me. If the Misspecified Goal argument rested entirely upon this intuition, then it would be asserting that because we are ignorant about what an intelligent agent would do, we should assume that it is optimizing a goal, which means that it is going to accumulate power and resources and lead to catastrophe. In other words, it is arguing that assuming that an agent is intelligent definitionally means that it will accumulate power and resources. This seems clearly wrong; it is possible in principle to have an intelligent agent that nonetheless does not accumulate power and resources.
Also, the argument is not saying that in practice most intelligent agents accumulate power and resources. It says that we have no better model to go off of other than “goal-directed”, and then pushes this model to extreme scenarios where we should have a lot more uncertainty.
See also the summary of that post:
“From the outside”, it seems like a goal-directed agent is characterized by the fact that we can predict the agent’s behavior in new situations by assuming that it is pursuing some goal, and as a result it is acquires power and resources. This can be interpreted either as a statement about our epistemic state (we know so little about the agent that our best model is that it pursues a goal, even though this model is not very accurate or precise) or as a statement about the agent (predicting the behavior of the agent in new situations based on pursuit of a goal actually has very high precision and accuracy). These two views have very different implications on the validity of the Misspecified Goal argument for AI risk.
But also, even ignoring all of that, I see this post as compatible with my post. My goal was for people to premise their AI safety risk arguments on the concept of goal-directedness, rather than utility maximization, and this post does exactly that.
I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).
This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.
In coherence arguments do not imply goal directed behavior Rohin Shah argues that a system's merely being at all model-able as an EU maximizer does not imply that it has "goal directed behavior". The argument as I understand it runs something like this:
1: Any behavior whatsoever maximizes some utility function.
2: Not all behaviors are goal directed.
Conclusion: A system's behavior maximizing some utility function does not imply that its behavior is goal directed.
I think this argument is technically sound, but misses an important connection between VNM coherence and goal directed behavior.
Shah does not give a formal definition of "goal directed behavior" but it is basically what you intuitively think it is. Goal directed behavior is the sort of behavior that seems like it is aimed at accomplishing some goal. Shah correctly points out that a system being goal directed and being good at accomplishing its goal is what makes it dangerous, not merely that it is good at maximizing some utility function. Every object in the universe perfectly maximizes the utility function that assigns 1 to all of the actual causal consequences of its behavior, and 0 to any other causal consequences its behavior might have had.
Shah seems to suggest that being model-able as an EU maximizer is not very closely related to goal directed behavior. Sure, having goal directed behavior implies that you are model-able as an EU maximizer, but so does having any kind of behavior whatsoever.
The implication does not run the other way according to Shah. Something being an EU maximizer for some utility function, even a perfect one, does not imply that its behavior is goal directed. I think this is right, but I will argue that nonetheless, it is true that it being a good idea for you to model an agent as an EU maximizer does imply that its behavior will seem goal directed (at least to you).
Shah gives the example of a twitching robot. This is not a robot that maximizes the probability of its twitching, or that wants to twitch as long as possible. Shah agrees that a robot that maximized those things would be dangerous. Rather, this is a robot that just twitches. Such a robot maximizes a utility function that assigns 1 to whatever the actual consequences of its actual twitching behaviors are, and 0 to anything else that the consequences might have been.
This system is a perfect EU maximizer for that utility function, but it is not an optimization process for any utility function. For a system to be an optimization process it must be that it is more efficient to predict it by modeling it as an optimization process than by modeling it as a mechanical system. Another way to put it is that it must be a good idea for you to model it as an EU maximizer.
This might be true in two different ways. It might be more efficient in terms of time or compute. My predictions of the behavior when I model the system as an EU maximizer might not be as good as my predictions of the behavior when I model it as a mechanical system, but the reduced accuracy is worth it, because modeling the system mechanically would take me much longer or be otherwise costly. Think of predicting a chess playing program. Even though I could predict the next move by learning its source code and computing it by hand on paper, I would be better off in most contexts just thinking about what I would do in its circumstances if I were trying to win at chess.
Another related but distinct sense in which it might be more efficient is that modeling the system as an EU maximizer might allow me to compress its behavior more than modeling it as a mechanical system. Imagine if I had to send someone a python program that makes predictions about the behavior of the twitching robot. I could write a program that just prints "twitch" over and over again, or I could write a program that models the whole world and picks the behavior that best maximizes the expected value of a utility function that assigns 1 to whatever the actual consequences of the twitching are, and 0 to whatever else they might have been. I claim that the second program would be longer. It would not however allow the receiver of my message to predict the behavior of the robot any more accurately than a program that just prints "it twitches again" over and over.
Maybe the exact twitching pattern is complicated, or maybe it stops at some particular time, and in that case the first program would have to be more complicated, but as long as the twitching does not seem goal directed, I claim that a python program that predicts the robot's behavior by modeling the universe and the counterfactual consequences of different kinds of possible twitching will always be longer than one that predicts the twitching by exploiting regularities that follow from the robot's mechanical design. I think this might be what it means for a system to be goal directed.
(Might also be worth pointing out that knowing that there is a utility function which the twitching robot is a perfect optimizer relative to does not allow us to predict its behavior in advance. "It optimizes the utility function that assigns 1 to the consequences of its behavior and 0 to everything else" is a bad theory of the twitching robot in the same way that "the lady down the street is a witch; she did it" is a bad theory of anything.)
A system seems goal directed to you if the best way you have of predicting it is by modeling it as an EU maximizer with some particular utility function and credence function. (Actually, the particulars of the EU formalism might not be very relevant to what makes humans think of a system's behavior as goal directed. It being a good idea to model it as having something like preferences and some sort of reasonably accurate model of the world that supports counterfactual reasoning is probably good enough.) This conception of goal directed-ness is somewhat awkward because the first notion of "efficiently model" is relative to your capacities and goals, and the second notion is relative to the programming language we choose, but I think it is basically right nonetheless. Luckily, we humans have relatively similar capacities and goals, and it can be shown that using the second notion of "efficiently model" we will only disagree about how agenty different systems are by at most some additive constant regardless of what programming languages we choose.
One argument that what it means for a system's behavior to seem goal directed to you is just for it to be best for you to model it as an EU maximizer is that if it were a better idea for you to model it some other way, that is probably how you would model it instead. This is why we do not model bottle caps as EU maximizers but do model chess programs as (something at least a lot like) EU maximizers. This is also why the twitching robot does not seem intelligent to us, absent other subsystems that we should model as EU maximizers, but that's a story for a different post.
I think we should expect most systems that it is a good idea for us to model as EU maximizers to pursue convergent instrumental goals like computational power and ensuring survival, etc. If I know the utility function of an EU maximizer better than I know its specific behavior, often the best way for me to predict its behavior is by imagining what I would do in its circumstances if I had the same goal. Take a complicated utility like the utility function that assigns 1 to whatever the actual consequences of the twitching robot's twitches are and 0 to anything else. Imagine that I did not have the utility function specified that way, which hides all of the complexity in "whatever the actual consequences are." Rather, imagine I had the utility specified as an extremely specific description of the world that gets scored above all else without reference to the actual twitching pattern of the robot. If maximizing that utility function were my goal, it would seem like a good idea to me to get more computational power for predicting the outcomes of my available actions, to make sure that I am not turned off prematurely, and to try to get as accurate a model of my environment as possible.
In conclusion, I agree with Shah that being able to model a system as an EU maximizer at all does not imply that its behavior is goal directed, but I think that sort of misses the point. If the best way for you to model a system is to model it as an EU maximizer, then its behavior will seem goal directed to you, and if the shortest program that predicts a system's behavior does so by modeling it as an EU maximizer, then its behavior will be goal directed (or at least up to an additive constant). I think the best way for you to model most systems that are more intelligent than you will be to model them as EU maximizers, or something close, but again, that's a story for a different post.