Considering how much I’ve been using “the intentional stance" in my thinking about the nature of agency and goals and discussions of the matter recently, I figured it would be a good idea to, y’know, actually read what Dan Dennett originally wrote about it. While doing so, I realized that he was already considering some nuances in the subject that the Wikipedia summary of the intentional stance leaves out but that are nonetheless relevant to the issues we face when attempting to e.g. formalize the approach, or think more clearly about the nature of agency in the context of alignment. I don’t expect many LessWrongers will read the original book in full, but I do expect that some additional clarity on what exactly Dennett was claiming about the nature of agency and goals will be helpful in having less confused intuitions and discussions about the subject.

In what follows, I provide an in-depth summary of Dennett’s exposition of the intentional stance, from Chapter 2 of The Intentional Stance (“True Believers: The Intentional Strategy and Why It Works”), which Dennett considers “the flagship expression” of his position. Then, I discuss a few takeaways for thinking about agency in the context of AI safety. In brief, I think 1) we should stop talking about whether the systems we build will or won’t “be agents,” and instead debate how much it will make sense to consider a given system as “an agent,” from the information available to us, and 2) we should recognize that even our internally-experienced beliefs and desires are the result of parts of our minds “applying the intentional stance” to other parts of the mind or the mind as a whole.

This work was completed as a Summer Research Fellow at the Center on Long-Term Risk under the mentorship of Richard Ngo. Thanks to Richard, Adam Shimi, Kaj Sotala, Alex Fabbri, and Jack Auen for feedback on drafts of this post.

Summarizing Dennett's position

TLDR: There is no observer-independent “fact of the matter” of whether a system is or is not an “agent”. However, there is an objective fact of the matter about how well-modeled a particular system’s behavior is modeled by the intentional stance, from the point of view of a given observer. There are, objectively, patterns in the observable behavior of an intentional system that correspond to what we call “beliefs” and “desires”, and these patterns explain or predict the behavior of the system unusually well (but not perfectly) for how simple they are.

In an attempt to be as faithful as possible in my depiction of Dennett’s original position, as well as provide a good resource to point back to on the subject for further discussion[1], I will err on the side of directly quoting Dennett perhaps too frequently, at least in this summary section.

He begins by caricaturing two opposing views on the nature of belief: 1) realism: there’s an “objective internal matter of fact” to the nature of belief; for example, in principle, sufficiently detailed understanding of cognitive psychology, neuroscience, or even physics would allow one to effectively “find the beliefs” inside the believer’s head. 2) interpretationism: “likens the question of whether a person has a particular belief to the question of whether a person is immoral, or has style, or talent, or would make a good wife…. ‘It’s a matter of interpretation.’”

Dennett’s position (the intentional strategy or adopting the intentional stance), then, is that “while belief is a perfectly objective phenomenon (that apparently makes me a realist), it can be discerned only from the point of view of one who adopts a certain predictive strategy, and its existence can be confirmed only by an assessment of the success of that strategy (that apparently makes me an interpretationist).”

The Intentional Strategy: How it works, how well it works

Three Stances

There are several approaches one might take to predicting the future behavior of some system; Dennett compares three: the physical stance, the design stance, and the intentional stance.

In adopting the physical stance towards a system, you utilize an understanding of the laws of physics to predict a system’s behavior from its physical constitution and its physical interactions with its environment. One simple example of a situation where the physical stance is most useful is in predicting the trajectory of a rock sliding down a slope; one would be able to get very precise and accurate predictions with knowledge of the laws of motion, gravitation, friction, etc. In principle (and presuming physicalism), this stance is capable of predicting in full the behavior of everything from quantum mechanical systems to human beings to the entire future of the whole universe.

With the design stance, by contrast, “one ignores the actual (possibly messy) details of the physical constitution of an object, and, on the assumption that it has a certain design, predicts that it will behave as it is designed to behave under various circumstances.” For example, humans almost never consider what their computers are doing on a physical level, unless something has gone wrong; by default, we operate on the level of a user interface, which was designed in order to abstract away messy details that would otherwise hamper our ability to interact with the systems.

Finally, there’s the intentional stance:

Here is how it works: first you decide to treat the object whose behavior is to be predicted as a rational agent; then you figure out what beliefs that agent ought to have, given its place in the world and its purpose. Then you figure out what desires it ought to have, on the same considerations, and finally you predict that this rational agent will act to further its goals in the light of its beliefs. A little practical reasoning from the chosen set of beliefs and desires will in many—but not all—instances yield a decision about what the agent ought to do; that is what you predict the agent will do.

Before further unpacking the intentional stance, one helpful analogy might be that the three stances can be understood as providing gears-level models for the system under consideration, at different levels of abstraction.[2] For purposes of illustration, imagine we want to model the behavior of a housekeeping robot:

  • The physical stance gives us a gears-level model where the gears are the literal gears (or other physical components) of the robot.
  • The design stance gives us a gears-level model where the gears come from the level of abstraction at which the system was designed. The gears could be e.g. the CPU, memory, etc., on the hardware side, or on the level of the robot’s user interface, on the software side.
  • The intentional stance gives us a gears-level model where the relevant gears are the robot’s beliefs, desires, goals, etc.

Attributing Beliefs and Desires

The above description of the intentional stance doesn’t provide many specifics about how to determine the beliefs and desires the intentional system “ought to have”; how do we actually determine the beliefs and desires to be attributed? Dennett first notes that we typically come to form beliefs about the parts of the world we are exposed to through our senses. However, we obviously do not learn or remember all the potentially inferable truths from our sensory data; “what we come to know, normally, are only all the relevant truths our sensory histories avail us.”

This leaves us with one heuristic for attributing beliefs under the intentional stance: “attribute as beliefs all the truths relevant to the system’s interests (or desires) that the system’s experience to date has made available.” For example, imagine that a group of people are having a discussion about AI in a pub where a football match is being shown on television. Those who are interested in AI but not football are more likely to form beliefs about the content of the conversation than about the content on the television, and vice versa. Although a useful rule of thumb, this heuristic fails to capture a system’s false beliefs, not to mention the fact that humans remain incapable of perfectly remembering all of the beliefs relevant to their desires they’ve had the opportunity to learn.[3] To be clear, this rule is derived from the fundamental rule, “attribute the beliefs the system ought to have”—an intentional system should form beliefs about the aspects of its environment relevant to its desires.

This heuristic for belief attribution also presumes an attribution of desires to the intentional system. For humans, this is relatively straightforward—the desires we should be quickest to attribute to another human are those that are most common to all of us: “survival, absence of pain, food, comfort, procreation, entertainment.” It doesn’t take a great leap of imagination to reason that other humans want to be happy, safe, and comfortable, because that’s what almost all of us want. (As Dennett notes, “citing any one of these desires typically terminates the ‘Why?’ game of reason giving.”) This is the result of the fundamental rule “attribute the desires the system ought to have”—the desires (at least, the fundamental ones) humans “ought to have” are fairly obvious. We can also attribute desires with the heuristics “attribute desires for those things a system believes to be good for it” and “attribute desires for those things a system believes to be best means to the other ends it desires” (instrumental desires).[4]

There’s also a need to consider how “rational” an intentional system is in the process of belief and desire attribution. A logically omniscient intentional system would believe all the truths derivable from its current beliefs, but obviously humans (and any system that exists in the physical universe, for that matter) fall far short of this ideal. Dennett writes:

One starts with the ideal of perfect rationality and revises downward as circumstances dictate. That is, one starts with the assumption that people believe all the implications of their beliefs and believe no contradictory pairs of beliefs. This does not create a practical problem of clutter (infinitely many implications, for instance), for one is interested only in ensuring that the system one is predicting is rational enough to get to the particular implications that are relevant to its behavioral predicament of the moment. Instances of irrationality, or of finitely powerful capacities of inferences, raise particularly knotty problems of interpretation, which I will set aside on this occasion (see chapter 4, “Making Sense of Ourselves,” and Cherniak 1986).[5]

It works (well)

As Dennett emphasizes, the intentional strategy is the strategy we all already use, all the time when interacting with other humans in the world (not to mention occasionally with some other systems, from animals to robots to plants, even to thermostats); imagine seeing two children pulling at opposing ends of a toy and not thinking “both those kids want that toy”! The intentional strategy, in a sense, is a description of how we attribute beliefs and desires to a system when our minds find it easiest to perceive that system as “being an agent,” “having intentions,” etc.

True Believers

Here we get to the real meat of Dennett’s position, so to speak. Now that he’s described how we attribute beliefs and desires to systems that seem to us to have intentions of one kind or another, “the next task would seem to be distinguishing those intentional systems that really have beliefs and desires from those we may find it handy to treat as if they had beliefs and desires.” (For example, although a thermostat’s behavior can be understood under the intentional stance, most people intuitively feel that a thermostat doesn’t “really” have beliefs.) This, however, cautions Dennett, would be a mistake.

As a thought experiment, Dennett asks us to imagine that some superintelligent Martians descend upon us; to them, we’re as simple as thermostats are to us. If they were capable of predicting the activities of human society on a microphysical level, without ever treating any of us as intentional systems, it seems fair to say that we wouldn’t “really” be believers, to them. This shows that intentionality is somewhat observer-relative—whether or not a system has intentions depends on the modeling capabilities of the observer.

However, this is not to say that intentionality is completely subjective, far from it—there are objective patterns in the observables corresponding to what we call “beliefs” and “desires.” (Although Dennett is careful to emphasize that these patterns don’t allow one to perfectly predict behavior; it’s that they predict the data unusually well for how simple they are. For one, your ability to model an intentional system will fail under certain kinds of distributional shifts; analogously, understanding a computer under the design stance does not allow one to make accurate predictions about what it will do when submerged in liquid helium.)

Another interesting point Dennett argues is that the intentional stance is unavoidable with regard to oneself and one’s fellow intelligent beings; “if they observe, theorize, predict, communicate, they view themselves as intentional systems. [Footnote: Might there not be intelligent beings who had no use for communicating, predicting, observing,...? There might be marvelous, nifty, invulnerable entities lacking these modes of action, but I cannot see what would lead us to call them intelligent.]”[6]

Thus, with the intentional stance, Dennett advocates a “milder sort of realism,” where “there is no fact of the matter of exactly which beliefs and desires a person has in these degenerate cases [e.g. of failures of rationality or distributional shift], but this is not a surrender to relativism or subjectivism, for when and why there is no fact of the matter is itself a matter of objective fact.”

One analogy that might help elucidate this relationship between the objectivity of the belief- and desire-patterns and the apparently subjective point of view required to see them is with the example of a Turing machine implemented in Life. The Turing machine is “objectively” there—anyone who understands what a Turing machine is will not fail to see the pattern in the data—but a specific frame of mind is needed to recognize it. Analogously, the patterns in an intentional system’s observable behavior corresponding to “beliefs” and “desires” are objectively there, once you’ve made the decision to consider the system under the intentional stance.

Thermostats again (but, really, are they agents?)

Now, we return to thermostats and the question of what it means for a system to “really” have beliefs and/or desires. Dennett’s punchline is that “all there is to being a true believer is being a system whose behavior is reliably predictable via the intentional strategy, and hence all there is to really and truly believing that p (for any proposition p) is being an intentional system for which p occurs as a belief in the best (most predictive) interpretation.”

We might be willing to attribute half a dozen beliefs and desires to a normal thermostat; it believes the room is too hot or too cold, etc. Dennett notes that we could “de-interpret” its beliefs and desires via symbol substitution or abstraction: it believes that the R is too H or C, etc. (“After all, by attaching the thermostatic control mechanism to different input and output devices, it could be made to regulate the amount of water in a tank, or the speed of a train.”)

On the other hand, if we start to enrich its causal attachments to the world, e.g. by giving it more than one way to learn about temperature, or by giving it a fully general visual system, we also enrich the semantics of the “dummy predicates” (H and C, etc.). Given the actual link to the world, we could endow a state of the primitive thermostat with “meaning” (of a sort), but it was too easy to substitute a different minimal link and altogether change the meaning (“in this impoverished sense”) of that internal state (the “de-interpreted” beliefs weren’t very “meaningful” since they could represent the beliefs of a speed or volume regulator just as well as those of a temperature regulator). However, with “perceptually richer and behaviorally more versatile” systems,

it becomes harder and harder to make substitutions in the actual links of the system to the world without changing the organization of the system itself. If you change its environment, it will notice, in effect, and make a change in its internal state in response. There comes to be a two-way constraint of growing specificity between the device and the environment. (emphasis added) Fix the device in any one state and it demands a very specific environment in which to operate properly (you can no longer switch it easily from regulating temperature to regulating speed or anything else); but at the same time, if you do not fix the state it is in, but just plonk it down in a changed environment, its sensory attachments will be sensitive and discriminative enough to respond appropriately to the change, driving the system into a new state, in which it will operate effectively in the new environment. There is a familiar way of alluding to this tight relationship that can exist between the organization of a system and its environment: you say that the organism continuously mirrors the environment, or that there is a representation of the environment in—or implicit in—the organization of the system.[7]

It is not that we attribute (or should attribute) beliefs and desires only to things in which we find internal representations, but rather that when we discover some object for which the intentional strategy works, we endeavor to interpret some of its internal states or processes as internal representations. What makes some internal feature of a thing a representation could only be its role in regulating the behavior of an intentional system. (emphasis added in italics)

This is important to understand, as we can intuitively feel that any system that’s “really” an agent, or “really” has desires or beliefs, must have internal representations corresponding to those desires and beliefs, but, as Dennett points out, the relative predictive success of the intentional strategy is what really determines how much it makes sense to consider a system as “being intentional” or “having intentions beliefs, desires, etc.”. An intentional system might represent its environment implicitly in its organization (like bacteria, or ants, for example).

Now the reason for stressing our kinship with the thermostat should be clear. There is no magic moment in the transitions from a simple thermostat to a system that really has an internal representation of the world around it. The thermostat has a minimally demanding representation of the world, fancier thermostats have more demanding representations of the world, fancier robots for helping around the house have still more demanding representations of the world. Finally you reach us.

When it comes to the question of why the intentional strategy works as well as it does, Dennett says that the question is ambiguous, with two very different possible kinds of answers:

  1. The system was designed such that the intentional stance applies well to it. In the case of a thermostat, Dennett simply claims that it was designed to be understood under the intentional stance, but Adam Shimi offers a clearer, more nuanced explanation: tautologically, designed things are designed to be understood from the design stance, but humans tend to have a teleological instinct when designing things, so most things designed by humans end up being interpretable under both the design and intentional stances. In the case of humans, “evolution has designed human beings to be rational, to believe what they ought to believe and want what they ought to want.” (I’m a bit suspicious of the phrasing that evolution designed us to believe and want what we “ought” to, but I think this is pointing at the idea that systems that have been subjected to a lot of selection pressure are more likely to “be coherent” and therefore more likely to be well-described via the intentional stance.)
  2. How the machinery works. The features of the thermostat’s design that explain why its behavior is well-understood under the intentional stance are easily discovered and understood, but not so with human minds. “How do human minds and brains implement ‘goal-directed behavior’?” is a fundamental and open question, insights into which will hopefully prove useful in understanding how prosaic systems will implement similar behavior.

As for one potential explanation for how our machinery works, Dennett suggests that brains themselves may have machinery that correspond to “beliefs” and “desires.” If this were the case, the explanation for why the intentional stance works would be that its explanatory terms coincide with the actual, physical and/or functional machinery responsible for producing the observed behavior (at some relevant level of abstraction of the mechanics—certainly individual neurons don’t implement goal-directed behavior by themselves!). However, Dennett is careful to distinguish this claim from the claim that the intentional stance identifies objective patterns in observable behavior corresponding to “beliefs” and “desires”:

Those who think that it is obvious, or inevitable, that such a theory will prove true (and there are many who do), are confusing two different empirical claims. The first is that intentional stance description yields an objective, real pattern in the world—the pattern our imaginary Martians missed. This is an empirical claim, but one that is confirmed beyond skepticism. The second is that this real pattern is produced by another real pattern roughly isomorphic to it within the brains of intelligent creatures. Doubting the existence of the second real pattern is not doubting the existence of the first.

Dennett suggests human language as a candidate for this kind of machinery of belief and desire; perhaps a better modern candidate would be something vaguely Bayesian (probabilistic generative models?).[8] (Also, I could be wrong here, but I remember reading somewhere, I think maybe in one of Steve’s posts, that we have some evidence now that natural language mirrors the structure of thought and not the other way around—maybe children independently inventing languages that bear significant structural similarities to existing natural languages? And personally, much of my own thought isn’t in the form of words but more abstract patterns and concepts, but both language and abstract concepts feel like they share underlying structure, or something.)

Takeaways for deconfusing agency

Editorial note: To be clear, these “takeaways” are both “things Dan Dennett is claiming about the nature of agency with the intentional stance” and “ideas I’m endorsing in the context of deconfusing agency for AI safety.” I believe that Dennett really gets at the heart of the matter of agency with the intentional strategy, because it’s the clearest description I know of the process by which the human mind attributes “agency” not only to other systems but also to itself. Although developing a more formal characterization of the strategy is challenging for several reasons, I know of no other better starting point for developing a more rigorous understanding of the nature of agency.

There's no observer-independent fact of the matter about whether a system "is" an agent[9]

If something appears agent-y to us (i.e., we intuitively use the intentional strategy to describe its behavior), our next question tends to be, “but is it really an agent?” (It’s unclear what exactly is meant by this question in general, but it might be interpreted as asking whether some parts of the system correspond to explicit representations of beliefs and/or desires.) In the context of AI safety, we often talk about whether or not the systems we build “will or won’t be agents,” whether or not we should build agents, etc.

One of Dennett’s key messages with the intentional stance is that this is a fundamentally confused question. What it really and truly means for a system to “be an agent” is that its behavior is reliably predictable by the intentional strategy; all questions of internal cognitive or mechanistic implementation of such behavior are secondary. (Put crudely, if it looks to us like an agent, and we don’t have an equally-good-or-better alternative for understanding that system’s behavior, well, then it is one.) In fact, once you have perfectly understood the internal functional mechanics of a system that externally appears to be an agent (i.e. you can predict its behavior more accurately than with the intentional stance, albeit with much more information), that system stops looking like “an agent,” for all intents and purposes. (At least, modeling the system as such becomes only one potential model for understanding the system’s behavior, which you might still use in certain contexts e.g. for efficient inference or real-time action.)

We should therefore be more careful to recognize that the extent to which AIs will “really be agents” is just the extent to which our best model of their behavior is of them having beliefs, desires, goals, etc. If GPT-N appears really agent-y with the right prompting, and we can’t understand this behavior under the design stance (how it results from predicting the most likely continuation of the prompt, given a giant corpus of internet text) or a “mechanistic” stance (how individual neurons, small circuits, and/or larger functional modules interacted to produce the output), then GPT-N with that prompting really is an agent.

Remember that concepts like “agent” and “goal” are representations within the world models in which “we” exist, not things which can actually exist within the territory (presuming physicalism). The representations correspond to artificially-imposed (but one hopes usefully, if imperfectly, drawn) boundaries in Thingspace, so when we ask questions like “what does it mean for a system to ‘be an agent’?”, we’re essentially asking how to most usefully draw[10] that boundary, or more specifically characterize the ‘mass or volume in Thingspace’ to which the label points. Dennett, with the intentional stance, argues that any such satisfactory characterization will primarily be in terms of the system’s behavior (with respect to some observer), not its internal implementation, a point that has been made around here before.

In the end, what we care about are the effects that systems’ behaviors have on the world, not the details of how those behaviors are implemented. This is not to say that understanding how such behavior is cognitively implemented will not be instrumentally useful for better understanding and predicting the behavior of goal-directed agents; having a mechanistic understanding of the implementation seems like the best way to make accurate predictions about how the system will generalize to new (out-of-distribution) inputs. To this end, understanding how the human mind implements goal-directedness seems particularly useful for getting an idea of how we might do something similar with prosaic AI systems.

I can also imagine that we could potentially further constrain the boundary (beyond “systems well-described by the intentional strategy”) we draw around the “agent” cluster in Thingspace by including some cognitive criteria. For example, while explicit internal representations of beliefs or desires might not be necessary for a system to “really be an agent,” we might also believe that sufficiently advanced or “intelligent” intentional systems (especially those implemented via neural networks) will have explicit internal representations of beliefs and desires. If this were the case, then understanding how these beliefs and desires are in general represented or implemented cognitively/mechanistically would facilitate a much finer level of understanding of agency, in the specific context of advanced prosaic systems. In effect, we would be trading off between the generality/applicability of the characterization of agency and the specificity of the predictions that characterization enables us to make (in the context in which it applies).

Indeed, as Dennett mentions, understanding the link between the link between those beliefs and desires that are “predictively attributable” under the intentional stance and potentially-existing-and-corresponding “functionally salient internal state[s] of the machinery, decomposable into functional parts” is perhaps the best approach to understanding why the intentional strategy works in the first place. Again, however, just to be clear, Dennett emphasizes (and I agree) that we should not primarily characterize the boundary in terms of internal representations or the like—the primary characteristic of agency should always be in terms of the behavior of the system under consideration: “It is not that we attribute (or should attribute) beliefs and desires only to things in which we find internal representations, but rather than when we discover some object for which the intentional strategy works, we endeavor to interpret some of its internal states or processes as internal representations.”

"You" and the intentional stance

The intuitive advantage of the intentional strategy is that it merely describes the process by which the human mind automatically ascribes beliefs to other systems it perceives as being or having minds: 1) consider the system as an “agent,” where “agent” points to a fairly primitive representation that the world models of most humans seem to share (this is merely the decision, conscious or unconscious, to apply the intentional stance), 2) deduce its beliefs and desires via commonsense reasoning from available information about its intelligence/rationality and environmental context.

The twin to this advantage, which is perhaps easier to miss, is that it also describes the process by which the human mind models itself as having agency. Our own sense of agency must result, at some level, from some part(s) of the mind “applying the intentional stance” to others. For example, introspection (metacognitive or otherwise) and self-narration (verbal or non-verbal) can be understood as the activity of a module that summarizes mental activity. Additionally, I feel as though there is a legitimate sense in which our conceptual goals and desires (so basically all of them?) are the result of the neocortex building a model of the entire system in which it is embedded as an agent (“applying the intentional stance”) from the signals it receives from the “value function” in the striatum. And verbal beliefs, whether internally experienced or externally expressed, should correspond to a module which is able to understand and produce natural language effectively “translating” the beliefs it ascribes to the mind based on inputs from other parts/modules into a form comprehensible by other higher-level modules (including those necessary for speech production). Even if the human mind models itself as being composed of many “agents” in order to have a model that “leaks” less than modeling itself as a single agent, such an understanding could only result from one part of the mind “applying the intentional stance” (using basically the same pattern or representation it uses to model other humans as agents or itself as a single agent) to the entities it infers being responsible for various patterns it notices in mental activity.

In general, I think that much of the confusion about whether some system that appears agent-y “really is an agent” derives from an intuitive sense that the beliefs and desires we experience internally are somehow fundamentally different from those that we “merely” infer and ascribe to systems we observe externally. I also think that much of this confusion dissolves with the realization that internally experienced thoughts, beliefs, desires, goals, etc. are actually “external” with respect to the parts of the mind that are observing them—including the part(s) of the mind that is modeling the mind-system as a whole as “being an agent” (or a “multiagent mind,” etc.). You couldn't observe thoughts (or the mind in general) at all if they weren't external to "you" (the observer), in the relevant sense.

The most important thing to understand about the intentional stance is “all it really means for a system to be an agent is that its behavior is reliably predictable via the intentional strategy (i.e. as having beliefs and desires, acting on those beliefs to satisfy those desires, etc.).” However, I believe that arriving at a better understanding of the above point about human minds seeing themselves as agents because they’re “applying the intentional stance” to themselves or parts of themselves has been perhaps more helpful for “grokking” the intentional stance well enough for the original question about systems “really” being agents to dissolve and for me to see through the nature of my previous confusion.


  1. Most detailed existing summary I could find is from the literature review on goal-directedness. ↩︎

  2. Thanks to Kaj Sotala for pointing this out in a comment on a draft of this post (previously, I was just drawing the link between “gears-level models” and the design stance)! ↩︎

  3. Cf. Paul Christiano’s mention of “‘justified’ belief” in the context of universality. ↩︎

  4. These heuristics can be applied pretty straightforwardly to humans (since we obviously desire survival, food, shelter, etc.); the question is how we can begin the process of attributing beliefs and desires to an AI if we can’t automatically assume that the AI wants things like survival, food, comfort, entertainment, etc. The set of desires it is possible for AIs or minds-in-general to have is clearly much wider than the set of desires it is possible for humans to have. In this setting, the chicken-and-egg problem with beliefs and desires (where e.g. you attribute the beliefs relevant to the system's desires and the desires that the system believes are the best means to achieving other desires) seems trickier to avoid. ↩︎

  5. Cf. Armstrong and Mindermann, anyone? ↩︎

  6. I’m not very familiar with what Scott Garrabrant has been thinking in the context of human models (and potentially avoiding them), but maybe this hints at the idea that it might just be really difficult to avoid having models, whether implicit or explicit, of humans if you want to do any real-world prediction, even if it’s completely “non-agentic” tool/microscope AI? It seems increasingly difficult to prevent increasingly intelligent systems from discovering this pattern that compresses the data really well, at least until they get much smarter and can model us as e.g. collections of cells (but even then, they’d understand that we model ourselves and other humans this way). In a comment on a draft of this post, Adam Shimi agreed: “My take is that an AI will find the intentional stance really useful for understanding the human, so if it has to model them, it should learn some approximation of it. Then it’s not far from applying it to itself (especially if there are stories in the dataset of humans applying this stance to programs/machines like the AI).” ↩︎

  7. Seems related to the Good Regulator Theorem? ↩︎

  8. The core knowledge model also seems relevant here; it proposes that the human brain contains four systems for representing and reasoning about objects, agents, number, and space, respectively. (Thanks to Kaj Sotala for mentioning this to me!) ↩︎

  9. This is more a reflection of Dennett’s metaphysics than something that is unique to “agents”; I think he would say much the same about e.g. trees, and I would agree, remembering that “agents,” “trees,” and all other things and concepts are representations within the mind’s world model (more on this below). What Dennett is saying is not that it’s completely subjective whether a system “is an agent,” “is a tree,” etc., but rather that “agents” and “trees” are both useful (compressive) encodings of some patterns (that “objectively exist”) in the observable universe. Different minds could encode the same pattern in potentially quite diverse ways, depending on their sensory links to the environment and their desires (which determine which are the “relevant” features of the input which will be preserved under the encoding). “Whether the thing to which the representations point ‘really is’ what the representations themselves are” is basically a nonsensical question; however, how well a given encoding compresses, predicts, and/or explains the observables is not at all subjective, just a matter of information theory! Cf. Real Patterns, also the interlude “Reflections: Real Patterns, Deeper Facts, and Empty Questions” in The Intentional Stance. (Additionally, John Wentworth’s natural abstraction hypothesis in the context of alignment by default?) ↩︎

  10. Note that one must have an application in mind for the concept in order to draw a boundary that is “useful” for that application/in that context. (Thanks to Adam Shimi for reminding me to explicitly point this out.) ↩︎

New Comment
8 comments, sorted by Click to highlight new comments since:

There's no observer-independent fact of the matter about whether a system "is" an agent[9]

Worth saying, I think, that this is fully generally true that there's no observer-independent fact of the matter about whether X "is" Y. That this is true of agents is just particularly relevant to AI.

I mostly agree with everything here, but I think it is understating the extent to which the intentional stance is insufficient for the purposes of AI alignment. I think if you accept "agency = intentional stance", then you need to think "well, I guess AI risk wasn't actually about agency".

A fundamental part of the argument for AI risk is that an AI system will behave in a novel manner when it is deployed out in the world, that then leads to our extinction. The obvious question: why should it behave in this novel manner? Typically, we say something like "because it will be agentic / be goal-directed with the wrong goal".

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

Some previous discussion:

If you then deconfuse agency as "its behavior is reliably predictable by the intentional strategy", I then have the same question: "why is its behavior reliably predictable by the intentional strategy?" Sure, its behavior in the set of circumstances we've observed is predictable by the intentional strategy, but none of those circumstances involved human extinction; why expect that the behavior will continue to be reliably predictable in settings where the prediction is "causes human extinction"?

Overall, I generally agree with the intentional stance as an explanation of the human concept of agency, but I do not think it can be used as a foundation for AI risk arguments. For that, you need something else, such as mechanistic implementation details, empirical trend extrapolations, analyses of the inductive biases of AI systems, etc.

The requirement for its behavior being "reliably predictable" by the intentional strategy doesn't necessarily limit us to postdiction in already-observed situations; we could require our intentional stance model of the system's behavior to generalize OOD. Obviously, to build such a model that generalizes well, you'll want it to mirror the actual causal dynamics producing the agent's behavior as closely as possible, so you need to make further assumptions about the agent's cognitive architecture, inductive biases, etc. that you hope will hold true in that specific context (e.g. human minds or prosaic AIs). However, these are additional assumptions needed to answer question of why an intentional stance model will generalize OOD, not replacing the intentional stance as the foundation of our concept of agency, because, as you say, it explains the human concept of agency, and we're worried that AI systems will fail catastrophically in ways that look agentic and goal-directed... to us.

You are correct that having only the intentional stance is insufficient to make the case for AI risk from "goal-directed" prosaic systems, but having it as the foundation of what we mean by "agent" clarifies what more is needed to make the sufficient case—what about the mechanics of prosaic systems will allow us to build intentional stance models of their behavior that generalize well OOD?

Yeah, I agree with all of that.

Many people believe that they already understand Dennett's intentional stance idea, and due to that will not read this post in detail. That is, in many cases, a mistake. This post makes an excellent and important point, which is wonderfully summarized in the second-to-last paragraph:

In general, I think that much of the confusion about whether some system that appears agent-y “really is an agent” derives from an intuitive sense that the beliefs and desires we experience internally are somehow fundamentally different from those that we “merely” infer and ascribe to systems we observe externally. I also think that much of this confusion dissolves with the realization that internally experienced thoughts, beliefs, desires, goals, etc. are actually “external” with respect to the parts of the mind that are observing them—including the part(s) of the mind that is modeling the mind-system as a whole as “being an agent” (or a “multiagent mind,” etc.). You couldn't observe thoughts (or the mind in general) at all if they weren't external to "you" (the observer), in the relevant sense.

The real point of the intentional stance idea is that there is no fact of the matter about whether something really is an agent, and that point is most potent when applied to ourselves. It is neither the case that we really truly are an agent, nor that we really truly are not an agent.

This post does an excellent job of highlighting this facet. However, I think this post could have been more punchy. There is too much meta-text of little value, like this paragraph:

In an attempt to be as faithful as possible in my depiction of Dennett’s original position, as well as provide a good resource to point back to on the subject for further discussion[1], I will err on the side of directly quoting Dennett perhaps too frequently, at least in this summary section.

In a post like this, do we need to be fore-warned that the author will err perhaps to frequently on the side of directly quoting Dennett, at least in the summary section? No, we don't need to know that. In fact the post does not contain all that many direct quotes.

At the top of the "takeaways" section, the author gives the following caveat:

Editorial note: To be clear, these “takeaways” are both “things Dan Dennett is claiming about the nature of agency with the intentional stance” and “ideas I’m endorsing in the context of deconfusing agency for AI safety.”

The word "takeaways" in the heading already tells us that this section will contain points extracted by the reader that may or may not be explicitly endorsed by the original author. There is no need for extra caveats, it just leads to a bad reading experience.

In the comments section, Rohin makes the following very good point:

I mostly agree with everything here, but I think it is understating the extent to which the intentional stance is insufficient for the purposes of AI alignment. I think if you accept "agency = intentional stance", then you need to think "well, I guess AI risk wasn't actually about agency".

Although we can "see through" agency as not-an-ontologically-fundamental-thing, nevertheless we face the practical problem of what to do about the (seemingly) imminent destruction of the world by powerful AI. What actually should we do about that? The intentional stance not only fails to tell us what to do, it also fails to tell us how any approach to averting AI risk can co-exist with the powerful deconstruction of agency offered by the intentional stance idea itself. If agency is in the eye of the beholder, then... what? What do we actually do about AI risk?

Concepts are generally clusters and I would say that being well-predicted by the Intentional Strategy is one aspect of what is meant by agency.

Another aspect relates to the interior functioning of an object. A very simple model would be to say that we generally expect the object to have a) some goals, b) counterfactual modeling abilities and c) to pursue the goals based on these modeling abilities. This definition is less appealing because it is much more vague and each of the elements in the previous sentence would need further clarification; however this doesn't mean that it is any less of a part of what people are generally imagining when they think of an agent. Humans come pre-equipped with at least a vague and casual sense of what these kinds of terms mean, so the above description is already sufficient for us to say, for example, that a metal ball that seems agentic according to the Intentional Stance because it is controlled by a magnet isn't agentic (on its own) according to the interior functioning stance.

I don't have time to expand on every aspect here (especially since these definitions would require further expansion; and so on), so I'll just focus on the notion of goals. Here are some relevant considerations for being considered as a goal:

  • Human-like goals are more likely to be considered goals than, for example, printing out every number that meets 20 conditions without falling into one of 300 exceptions. However, we would be more likely to accept this as a goal if we were told that there was a simple reason why we were performing a weird analysis (ie. legal compliance) then we'd be more likely to accept this as a goal.
  • We are more likely to consider a system to have goals if it represents them simply, but again, if we're given a sufficient reason we might still accept it as a goal (for example if we were told that the representation was due to the hard drive being protected by encryption).
  • The goals should be used to determine behavior, although we're now moving to part c) of the interior functioning requirements

Note that a large part of the challenge is that we can't imagine every way of interpreting a system, so it would be very easy to say that a system has goals if it meets these three conditions where the conditions are broad enough that everything might be considered to have a goal. So what usually ends up happening is that we pick out properties that would seem to include most things we consider as having goals and seemingly excludes things we generally don't consider to have goals (although we normally just handwave here). And then if someone informs us that our definition picks out too much, then we narrow it by adding tighter conditions. So this isn't really an objective process.

Again, our definitions have used vague language, but that's just how our mind works.

Planned summary for the Alignment Newsletter:

This post describes takeaways from [The Intentional Stance](https://mitpress.mit.edu/books/intentional-stance) by Daniel Dennett for the concept of agency. The key idea is that whether or not some system is an “agent” depends on who is observing it: for example, humans may not look like agents to superintelligent Martians who can predict our every move through a detailed understanding of the laws of physics. A system is an agent relative to an observer if the observer’s best model of the system (i.e. the one that is most predictive) is one in which the system has “goals” and “beliefs”. Thus, with AI systems, we should not ask whether an AI system “is” an agent; instead we should ask whether the AI system’s behavior is reliably predictable by the intentional stance.

How is the idea that agency only arises relative to some observer compatible with our view of ourselves as agents? This can be understood as one “part” of our cognition modeling “ourselves” using the intentional stance. Indeed, a system usually cannot model itself in full fidelity, and so it makes a lot of sense that an intentional stance would be used to make an approximate model instead.

Planned opinion:

I generally agree with the notion that whether or not something feels like an “agent” depends primarily on whether or not we model it using the intentional stance, which is primarily a statement about our understanding of the system. (For example, I expect programmers are much less likely to anthropomorphize a laptop than laypeople, because they understand the mechanistic workings of laptops better.) However, I think we do need an additional ingredient in AI risk arguments, because such arguments make claims about how an AI system will behave in novel circumstances that we’ve never seen before. To justify that claim, we need to have an argument that can predict how the agent behaves in new situations; it doesn’t seem like the intentional stance can give us that information by itself. See also [this comment](https://www.alignmentforum.org/posts/jHSi6BwDKTLt5dmsG/grokking-the-intentional-stance?commentId=rS27NBMu478YrwxBh).

Nice summary :) It's relevant for the post that I'm about to publish that you can have more than one intentional-stance view of the same human. The inferred agent-shaped model depends not only on the subject and the observer, but also on the environment, and on what the observer hopes to get by modeling.