The Intentional Agency Experiment

Alexander Gietelink Oldenziel

We would like to discern the intentions of a hyperintelligent, possibly malicious agent which has every incentive to conceal its evil intentions from us. But what even is intention? What does it mean for an agent to work towards a goal?

Consider the lowly ant and the immobile rock. Intuitively, we feel one has (some) agency and the other doesn't, while a human has more agency than either of them. Yet, a sceptic might object that ants seek out sugar and rocks fall down but that there is no intrinsic difference between the goal of eating yummy sweets and minimising gravitational energy.

Intention is a property that an agent has with respect to a goal. Intention is not a binary value, a number or even a topological vector space. Rather, it is a certain constellation of counterfactuals.

***

Let $W$ be a world, which we imagine as a causal model in the sense of Pearl: a directed acyclic graph with nodes and attached random variables $N_{0}, . . ., N_{k}$ . Let $R$ be an agent. We imagine $R$ to be a little robot - so not a hyperintelligent malignant AI- and we'd like to test whether it has a goal $G$ , say $G = (N_{0} = 10)$ . To do so we are going to run an Intentional Agency Experiment: we ask $R$ to choose an action $A_{1}$ from its possible actions $B = {a_{1}, . . ., a_{n}}$ .

Out of the possible actions $B$ one $a_{b e s t}$ is the 'best' action for $R$ if it has goal $G = (N_{0} = 10)$ in the sense that $P (G | d o (a_{b e s t})) \geq P (G | d o (a_{i}))$ for $i = 1, . . ., n$

If $R$ doesn't choose $A_{1} = d o (a_{b e s t})$ , great! We're done; $R$ doesn't have goal $G$ . If $R$ does choose $A_{1} = d o (a_{b e s t})$ , we provide it with a new piece of (counterfactual) information $P (G | A_{1}) = 0$ and offer the option of changing its action. From the remaining actions there is one next best actions $a_{n b e s t}$ . Given the information $P (G | A_{1}) = 0$ if $R$ does not choose $A_{2} = d o (a_{n b e s t})$ we stop, if $R$ does we provide it with the information $P (G | A_{2}) = 0$ and continue as before.

At each round we assign more and more agency to $R$ . Rather, than a binary 'Yes, $R$ has agency' or 'No, $R$ has no agency' we imagine a continuum going from a rock, which has no possible actions, to an ant, which might pass some of the tests but not all, to humans and beyond.

***

Q: What if $R$ isn't acting rational? What if it doesn't know all the details of $W$ ? What if it knows more? What if $R$ has bounded computation? What if...

A: The above is merely a simple model that tries to capture intent; one can complicate it as needed. Most of these objections come down to the possible inability of $R$ to choose the best action (given infinite compute and full knowledge of $W$ ). To remedy this we might allow the Intentional Agency Experiment to continue if $R$ chooses an action that is close to optimal but not optimal. We may introduce a Time to Think parameter when we consider bounded computational agents, etc. Once again, the point is not to assign a binary value of goal intention to an agent, rather it is to assign it a degree of agency.

Q: What if $R$ knows that we are testing and tries to deceive us?

A: Yes, this breaks the model.

Q: Counterfactuals are weird and might not exist. Even if they did, the Intentional Agency Experiment is impossible to execute in practise.

Despite the Intentional Agency Experiment being an idealisation, we may approximate it in the real world. For instance, if we'd like to test the intention of an ant to seek out a sugar source (as opposed to a random walk) we might first check if it moves towards the sugar source; if it does we block off this route towards the sugar source and sees whether it tries to circumvent it. In fact, it could be argued that this is the way we test agency in real life.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

4

The Intentional Agency Experiment

4