What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety.
- (What I'm not saying) We shouldn't ascribe any cognition to the system, just find rules of association for its behavior (aka Behaviorism)
- That's not even coherent with my favored approach to goal-directedness, the intentional stance. Dennett clearly ascribes beliefs and desires to beings and systems; his point is that the ascription is done based on the behavior and the circumstances.
I agree pretty strongly with all of this, fwiw. I think Dennett/the intentional stance really gets at the core of what it means for a system to "be an agent"; essentially, a system is one to the extent it makes sense to model it as such, i.e. as having beliefs and preferences, and acting on those beliefs to achieve those preferences, etc. The very reason why we usually consider our selves and other humans to be "agents" is exactly because that's the model over sensory data that the mind finds most reasonable to use, most of the time. In doing so, we actually are ascribing cognition to these systems, and in practice, of course we'll need to understand how such behavior will actually be implemented in our AIs. (And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)
I'm glad, you're one of the handful of people I wrote this post for. ;)
(And thinking about how "goal-directed behavior" is implemented in humans/biological neural nets seems like a good place to mine for useful insights and analogies for this purpose.)
Definitely. I have tended to neglect this angle, but I'm trying to correct that mistake.
Beyond past confusions
Over the last year, I wrote and thought many confused and confusing ideas on the relationship between goal-directedness of behavior. In the linked post for example, I defended a deconfusion of goal-directdness solely in terms of behavior; in doing so, I might pass for a behaviorist (someone thinking that mental constructs are not needed and so don't exist), or look like I imply that we should never use internal knowledge of our models to determine goal-directedness. Without even mentioning the factual errors.
So here is my attempt at a short and clear explanation of the link I see between goal-directedness and behavior. If you're confused by this take, or believe me to be confused, I would really appreciate a comment. My goal isn't to prove that I'm obviously right, just to get less confused and hopefully help lift the fog of confusion for everyone.
Thanks to Jack Koch for a recent discussion that reminded me of this issue, and to Richard Ngo for giving me food for thought on this subject with his comments.
Behavior in all its glory
What's my take? I think that when we talk about goal-directedness, what we really care about is a range of possible behaviors, some of which we worry about in the context of alignment and safety. We might for example think that goal-directed systems have convergent subgoals, which tells us how they could lack corrigibility and cause catastrophic outcomes. that such a goal-directed system could follow.
My entire point is that for deconfusing goal-directedness, we want a better understanding of this range of behaviors. At the moment, when thinking about a given behavior, I don't know whether that's the sort of thing a goal-directed system would do. And it seems problematic both for understanding the risks of goal-directed systems, and for detecting them.
Note that even a purely structural definition of goal-directedness would constrain the structure such that the system behave in a certain way. So even if we want a structural definition, clarifying the range of behaviors sounds like progress.
What I'm not saying
How I could be wrong
The main crux I see about this take on behavior is whether it's even possible or tractable to deconfuse and formalize the range of behaviors of goal-directed systems. No matter how useful a formalization would be, if we can't get it, we should turn to other approaches.
That being said, I haven't seen any convincing argument that it's impossible, and the more I dig, the more stuff I find, so I am quite convinced that some progress is possible.