Which counterfactuals should an AI follow?

Stuart_Armstrong

Thanks to Rebecca Gorman for her help with this post.

In the past, I've talked about counterfactual incentives. In that, the agent was motivated to maximise not an actual reward, but the (counterfactual) reward it would have gotten if things had been different.

"If things had been different" is ambiguous, though - different in what way, and with what consequences? In that post, I assumed an "evidential" counterfactual, but there are other ways of defining a counterfactual, and they have different consequences. This is important for designing agents that can use counterfactual reasoning for AI safety.

So this post will look at four kinds of counterfactuals, termed "causal" (Pearl's "do" operator), "evidential" (standard conditioning), "closest world", and "closest model", and see how they differ.

How do humans understand counterfactuals?

Imagine you are a jury-member in the trial of one RM, accused of killing one Tybalt C. Your role is to weigh up the evidence, including your impressions of the character of RM and any of the witnesses.

At one point in the trial, a journalist giving evidence blurts out:

"But RM was involved in a street brawl with the Capulet gang!"

The judge immediately interrupts, and says:

"That conviction is spent. The jury will disregard the implications of that statement, and act as if RM was not involved in the street brawl."

If we wanted to obey the judge's instructions, how should we do it? They obviously want us to reason as if we were in a counterfactual universe, but which counterfactual?

Initial model

Let's model the situation in the following way. You have to decide whether RM is innocent. You've seen some evidence at the trial. You've heard the journalist's statement and the judge's response. You wonder whether RM is of bad character (which is very relevant to assessing his guilt). You also wonder whether RM was involved in the brawl with the Capulets or not.

All this can be encoded in a causal graph, where each node is a true/false boolean:

The pale pink nodes are the ones that you've personally observed; the blue nodes are the ones that you're inferring. The arrows denote the causal structure.

The red arrow denotes that, if the source node is true, then the target node is false (with probability ). So if RM is of bad character, then, based on the other evidence you've seen, he's almost certainly guilty.

Is RM of bad character? That's certainly possible - he's a bit of a Romeo. But it's the brawl node that will be determinant here. If he was involved with brawl with the Capulet gang, then he's almost certainly of bad character. That fact is denoted by the green arrow: if the top node is true, then the "RM is of bad character" is also true, with probability $99.9 %$ .

Finally, we have the hollow green arrow that connects the brawl and the statements by the journalist and judge. This hollow green arrow denotes an 'if and only if' relation: you estimate that there is a $99.9 %$ probability that if the brawl node is true, the statement node is also true, and if the brawl node were false, the statement node would also be false, with probability $99.9 %$ .

The judge has instructed you to take the brawl node as false, so the network should be:

But what does that mean; what is the big $NOT$ actually doing?

"Causal" counterfactual

In the simplest, "causal" counterfactual, the brawl node is cut away from any preceding nodes, and is set to "false^[1]".

Is there a problem with this? Well there is one obvious issue: what were the judge and journalist talking about, in that case? Remember that we've set the brawl node to "false"; RM was not involve in the brawl. So why would a journalist mention that, and then a judge tell use to disregard that? Yes, the judge's instructions to act as if something hadn't happened, is evidence that it happened^[2].

How else could we account for these pronouncements? Well, maybe we could imagine a conspiracy that explains all this, and that put pressure on the judge and journalist. Pretty unlikely, of course, but maybe the best explanation for the simultaneous "RM no involved in the brawl" and "judge and journalist behave as if RM was involved in the brawl":

Does this work? Well, a problem with this setup is that not only does the conspiracy explain the judge and journalist's pronouncements, but it also implies that RM is innocent - why have a conspiracy to frame a guilty man?

This counterfactual seems incorrect - when the judge said "act as if RM was not involved in the bank brawl" it's pretty clear that they didn't mean "act as if there was a conspiracy explaining my actions, and hence find RM innocent."

This is a general problem with causal counterfactuals: to explain some of the observations while setting a specific node to a counterfactual value, we will generally need to assume odd and unlikely explanations. And unlikely explanations will have unlikely consequences.

Which way does causality flow?

There's another problem with the causal conspiracy, though it doesn't quite break the system the same way the previous problem did.

Suppose for example that RM had a tattoo identifying him as part of the Montague gang, and that the jury had seen this tattoo. Suppose further that the jury knew that everyone in the Montague gang had taken part in the brawl with the Capulets. Then we would get the following setup:

Note the equivalence arrows between the "RM part of Montague gang" node, and the brawl and tattoo nodes. The causal counterfactual will cut the top arrow; nevertheless, based on their observations, the jury seems to be required to assume that "RM is in the Montague gang", "everyone in the Montague gang was involved in the brawl", and "RM was not involved in the brawl". This is, at the very least, an inelegant set of beliefs.

And this becomes a problem if we inverse the direction of the causal arrow:

Why might we reverse that arrow? Well, maybe the brawl with the Capulets was an initiation ceremony for the Montague gang. Thus we can see being in the Montague gang as causing the brawl with the Capulets, or the brawl causing being in the Montague gang. This is a general problem with causal graphs; most laws of physics are time reversible, so many arrows can be drawn in several directions.

But of course, if we reverse the arrow, this changes the effect of the cutting off the arrows leading to the brawl node. So causal counterfactuals are dependent on the details of the causal graph - details that can be filled in multiple, equivalent ways.

"Evidential" counterfactual

In an evidential counterfactual, the brawl node is still set to false, but no arrows are cut whether or not they are going into or coming out of the brawl node^[3].

This still has the "vast conspiracy" problem of causal counterfactuals. But it has a similar problem with arrows going into the brawl node, not just with arrows coming out of it - which is what we would expect, since inwards arrows are not being cut.

So, to explain why RM has the tattoo, but was not involved in the brawl, we might have to posit a mad tattooist going around tattooing people while they sleep. And the presence of a mad tattooist might cause us to think that weird things are going on here, so maybe RM is innocent:

This is the model of counterfactual that we used in this post here. And the main problems mentioned in that post - trying to reconcile impossible or at least very unlikely data - were variants of the "vast conspiracy" or "mad tattooist" problems here.

Closest (impossible) world

In the previous counterfactuals, we assumed that our observations were correct. So we had to fit the counterfactual assumptions uncomfortably around our observations.

But we could take another approach, which is that the judge is asking us to imagine we were in a counterfactual world that might well be impossible. So we don't need consistency with our own observations; we just need to modify the world as little as needed to get something we can reason with.

In this case, the simplest is to negate two of our observations; then we can produce the following graph, and reason reasonably, which keeps "vast conspiracy" and "mad tattooist" as very unlikely and they don't mess up the analysis:

In this case, the "RM in Montague gang" node would be set to false, but have little effect on the other nodes, which will be assessed independently.

Note that for this kind of counterfactual to work, we need some sort of distance measure between worlds. This distance measure will formalise the intuitive idea of "closest world". This might be discussed in a later post.

Closest model

A closest world approach is plausible if negating some observations is a minor process. A world in which the judge and journalist had not said what they did, and RM's didn't have a tattoo, is an impossible world, one we know is false. But it's a pretty likely impossible world; we can easily imagine the world in which that happens.

But what if we are asked to negate something more significant, something more core to how we understand the world?

Maybe you're a close friend of RM, you've known him since youth. Maybe you saw him planning the brawl over several days; maybe you saw him dreaming of the brawl for many years. Maybe you saw the brawl with your own eyes (why are you a juror in this trial then? Well, maybe Verona is pretty corrupt). Maybe RM sent you videos of his role in the brawl, and the two of you discussed it at length.

There's no need to belabour the point. Which is that denying RM's role in the brawl requires denying huge amounts of your own personal experience:

How could you - and by extension an AI - deal with this? One way is to imagine how you would have interpreted the judge's command if you were not so knowledgeable - if you were a generic jury member judging a generic accused.

Expert difficulties

What if you were asked to assume that dinosaurs went extinct 55 million years ago? This seems like something you and I could do quite easily.

An expert palaeontologist, however, would have great difficulty with that. They've been studying fossils their whole life, and they know that the true date of the (non avian) dinosaur extinction is around 66 million years ago. All their life experience points against the assumption that you or me would be capable of accepting.

Similarly, accepting the counterfactual "the globe isn't warming" is much easier to do if you're not a climatologist or closely connected to their world.

So the problem with you dealing with RM and the brawl, is that you are, in this instance, an expert on that subject.

Closest non-expert model

One way of resolving this is to find a model, as close as possible to yours, but where you yourself were not an expert. In this case, the salient facts are that taking part in the brawl was a bad thing that reflects poorly on RM's character - or rather, on the character of the accused that you are now judging. That more generic model could be:

And, in that case, there would be no problem doing a "closest world" counterfactual on this graph; just ignore the brawl in the assessment of RM's character.

Choice of closest model

The choice of the closest model depends on the why you/the expert/the AI are supposed to be doing counterfactual reasoning.

For example, maybe a famous TV commentator said that the dinosaurs went extinct 55 million years ago, or that global warming isn't happening. In that case, the closest model counterfactual might simply be requiring you to imagine that the TV commentator is well-researched and accurate; you can ignore the science completely, and just focus on the commentator's personality.

Or maybe we're imagining that there is a conspiracy of palaeontologists/climatologists to conceal the extinction date of dinosaurs/the temperature trend of the planet. In that case, experts might be expected to ignore their own personal knowledge of conspiracies and non-conspiracies, and view the counterfactual conspiracy as an outsider would.

So what you are being asked to counterfactual over can be less important as to why you are being asked to assume a counterfactual. In the case of a trial, it's so the accused can have a fair-ish trial; that informs what closest model you should use to implement the counterfactual.

In conclusion

So, in defining a counterfactual, it does not suffice to simply point at a node and require that its value be set to something. There are multiple ways this can be interpreted, and care must be taken to ensure that the AI gets the right interpretation, with the right implications.

This is Pearl's $do$ operator, $P (X | do (Y = y))$ . ↩︎
Even if we ignored the evidential aspect of the judge's instructions here, we still have to deal with the journalist's pronouncements. ↩︎
This is a standard conditional, $P (X | Y = y)$ . ↩︎

[-]Charlie Steiner4y40

Very nice overview! Of course, I think most of the trick is crammed into that last bit :) How do you get a program to find the "common-sense" implied model of the world to use for counterfactuals.

[-]Chris_Leong4y40

I believe that we need to take a Conceputal Engineering approach here. That is, I don't see counterfactuals as intrinsically part of the world, but rather someone we construct. The question to answer is what purpose are we constructing these for? Once we've answer this question, we'll be 90% of the way towards constructing them.

As far as I can see, the answer is that we imagine a set of possible worlds and we notice that agents that use certain notions of counterfactuals tend to perform better than agents that don't. Of course, this raises the question of which possible worlds to consider, at which point we notice that this whole thing is somewhat circular.

However, this is less problematic than people think. Just as we can only talk about what things are true after already having taken some assumptions to be true (see Where Recursive Justification hits Bottom), it seems plausible that we might only be able to talk about possibility after having already taken some things to be possible.

[-]Ofer4y30

Maybe "logical counterfactuals" are also relevant here (in the way I've used them in this post). For example, consider a reward function that depends on whether the first 100 digits after the th digit in the decimal representation of $π$ are all 0. I guess this example is related to the "closest non-expert model" concept.