Planned summary for the previous post for the Alignment Newsletter:
Consider a setting in which an agent can play a game against a predictor. The agent can choose to say zero or one. It gets 3 utility if it says something different from the predictor, and -1 utility if it says the same thing. If the predictor is near-perfect, but the agent models itself as having access to unpredictable randomness, then the agent will continually try to randomize (which it calculates has expected utility 1), and will continually lose.
Planned summary for this post:
The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that it is not actually able to use unpredictable randomness, and so it will no longer expect a 50% chance of winning, and it will stop playing the game.
If the predictor is near-perfect, but the agent models itself as having access to unpredictable randomness, then the agent will continually try to randomize (which it calculates has expected utility 1), and will continually lose.
It's actually worse than that for CDT; the agent is not actually trying to randomise, it is compelled to model the predictor as a process that is completely disconnected from its own actions, so it can freely pick the action that the predictor is least likely to pick - according to the CDT's modelling of it. Or pick zero in the case of a tie. So the CDT agent is actually deterministic, and even if you gave it a source of randomness, it wouldn't see any need to use it.
The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that it is not actually able to use unpredictable randomness, and so it will no longer expect a 50% chance of winning, and it will stop playing the game.
[...] then it can learn that the predictor can actually predict the agent successfully, and so will no longer expect a 50% [...]
Thanks! I changed it to:
If the predictor is near-perfect, but the agent models its actions as independent of the predictor (since the prediction was made in the past), then the agent will have some belief about the prediction and will choose the less likely action for expected utility at least 1, and will continually lose.
The problem with the previous agent is that it never learns that it has the wrong causal model. If the agent is able to learn a better causal model from experience, then it can learn that the predictor can actually predict the agent successfully, and so will no longer expect a 50% chance of winning, and it will stop playing the game.
What you wrote is good, and not worth changing. But I wanted to mention that CDT is even more bonkers than that: the prediction can be made in the future, just as long as there is no causal path to how the predictor is predicting. In some cases, the predictor can even know the action taken, and still predict in a way that CDT thinks is causally disconnected.
I think that your reasoning here is essentially the same thing I was talking about before:
...the usual philosophical way of thinking about decision theory assumes that the model of the environment is given, whereas in our way of thinking, the model is learned. This is important: for example, if AIXI is placed in a repeated Newcomb's problem, it will learn to one-box, since its model will predict that one-boxing causes the money to appear inside the box. In other words, AIXI might be regarded as a CDT, but the learned "causal" relationships are not the same as physical causality
Since then I evolved this idea into something that wins in counterfactual mugging as well, using quasi-Bayesianism.
There are some minor differences; your approach learns the whole model, whereas mine assumes the model is given, and learns only the "acausalish" aspects of it. But they are pretty similar.
One problem you might have, is learning the acausal stuff in the mid-term. If the agent learns that causality exists, and then that in the Newcomb problem is seems to have a causal effect, then it may search a lot for the causal link. Eventually this won't matter (see here), but in the mid-term it might be a problem.
Or not. We need to test more ^_^
Well, being surprised by Omega seems rational. If I found myself in a real life Newcomb problem I would also be very surprised and suspect a trick for a while.
Moreover, we need to unpack "learns that causality exists". A quasi-Bayesian agent will eventually learn that it is part of a universe ruled by the laws of physics. The laws of physics are the ultimate "Omega": they predict the agent and everything else. Given this understanding, it is not more difficult than it should be to understand Newcomb!Omega as a special case of Physics!Omega. (I don't really have an understanding of quasi-Bayesian learning algorithms and how learning one hypothesis affects the learning of further hypotheses, but it seems plausible that things can work this way.)
Inspired by my post on problems with causal decision theory (CDT), here is a hacked version of CDT that seems to be able to imitate timeless decision theory (TDT) and functional decision theory[1] (FDT), as well as updateless decision theory (UDT) under certain circumstances.
Call this ACDT, for (a)causal decision theory. It is, essentially, CDT which can draw extra, acausal arrows on the causal graphs, and which attempts to figure out which graph represents the world it's in. The drawback is its lack of elegance; the advantage, if it works, is that it's simple to specify and focuses attention on the important aspects of deducing the graph.
Defining ACDT
CDT and the Newcomb problem
In the Newcomb problem, there is a predictor Ω who leaves two boxes, and predicts whether you will take one ("one-box") or both ("two-box"). If Ω predicts you will one-box, it had put a large prize in that first box; otherwise that box is empty. There is always a small consolation prize in the second box.
In terms of causal graphs, we can represent it this way:
The dark red node is the decision node, which the agent can affect. The green node is a utility node, whose value the agent cares about.
The CDT agent uses the "do" operator from Pearl's Causality. Essentially all the incoming arrows to the decision node are cut (though the CDT agent keeps track of any information gained that way), then the CDT agent maximises its utility by choosing its action:
In this situation, the CDT agent will always two-box, since it treats Ω's decision as fixed, and in that case two-boxing dominates, since you get whatever's in the first box, plus the consolation prize.
ACDT algorithm
The ACDT algorithm is similar, except that when it cuts the causal links to its decision, it also adds potential links from that decision node to all the other nodes in the graph. Then it attempts to figure out which diagram is correct, and then maximises its utility in the CDT way.
Note that ACDT doesn't take a position on what these extra links are - whether they are pointing back in time or are reflecting some more complicated structure (such as the existence of predictors). It just assumes the links could be there, and then works from that.
In a sense, ACDT can be seen as anterior to CDT. How do we know that causality exists, and the rules it runs on? From our experience in the world. If we lived in a world where the Newcomb problem or the predictors exist problem were commonplace, then we'd have a different view of causality.
It might seem gratuitous and wrong to draw extra links coming out of your decision node - but it was also gratuitous and wrong to cut all the links that go into your decision node. Drawing these extra arrows undoes some of the damage, in a way that a CDT agent can understand (they don't understand things that cause their actions, but they do understand consequences of their actions).
ACDT and the Newcomb problem
As well as the standard CDT graph above, ACDT can also consider the following graph, with a link from its decision to Ω's prediction:
It now has to figure out which graph represents the better structure for the situation it finds itself in. If it's encountered the Newcomb problem before, and tried to one-box and two-box a few times, then it knows that the second graph gives more accurate predictions. And so it will one-box, just as well as the TDT family does.
Generalising from other agents
If the ACDT agent has not encountered Ω themselves, but has seen it do the Newcomb problem for other agents, then the "figure out the true graph" becomes more subtle. UDT and TDT are built from the assumption that equivalent algorithms/agents in equivalent situations will produce equivalent results.
But ACDT, built out of CDT and its solipsistic cutting process, has no such assumptions - at least, not initially. It has to learn that the fate of other, similar agents, is evidence for its own graph. Once it learns that generalisation, then it can start to learn from the experience of others.
ACDT on other decision problems
Predictors exist
Each round of the predictors exist has a graph similar to the Newcomb problem, with the addition of a node to repeat the game:
After a few rounds, the ACDT agent will learn that the following graph best represents its situation:
And it will then swiftly choose to leave the game.
Prisoner's dilemma with identical copy of itself
If confronted by the prisoner's dilemma with an identical copy of itself, the ACDT agent, though unable to formalise "we are identical", will realise that they always make the same decision:
And it will then choose to cooperate.
Parfit's hitchhiker
The Parfit's hitchhiker problem is as follows:
For ACDT, it will learn the following graph:
And will indeed pay the driver.
XOR blackmail
XOR blackmail is one of my favourite decision problems.
The CDT agent will have the following graph:
And the CDT agent will make the simple and correct decision not to pay.
ACDT can eventually reach the same conclusion, but may require more evidence. It also has to consider graphs of the following sort:
The error of evidential decision theory (EDT) is, in effect, to act as if the light green arrow existed: that they can affect the existence of the termites through their decision.
ACDT, if confronted with similar problems often enough, will eventually learn that the light green arrow has no effect, while the dark green one does have an effect (more correctly: the model with the dark green arrow is more accurate, while the light green arrow doesn't add accuracy). It will then refuse to pay, just like the CDT agent does.
Note that we might define ACDT as only creating links with its own parent nodes - putting back the links it cut, but in the other direction. In that case it would only consider links with "Your decision algorithm" and "Letter sent", not with "Termites in house?", and would never pay. But note that "Your decision algorithm" is logical node, that might not exist in physical reality; that's why I designed ACDT to allow links to arbitrary nodes, not just the ones that are its ancestors, so it can capture more models about how the world works.
Not UDT: counterfactual mugging
The ACDT agent described above differs from UDT in that it doesn't pay the counterfactual mugger:
Non-coincidentally, this problem is difficult to represent in a causal graph. One way of seeing it could be this way:
Here the behaviour of the agent in the tails world, determines Ω's behaviour in the heads world. It would be tempting to try and extend ACDT, by drawing an arrow from that decision node to the Ω node in the heads world.
But that doesn't work, because that decision only happens in the tails world - in the heads world, the agent has no decision to make, so ACDT will do nothing. And in the tails world, the heads world is only counterfactually relevant.
Now ACDT, like EDT, can learn, in some circumstances, to pay the counterfactual mugger. If this scenario happens a lot, then it can note that agents that pay in the tails world get rewarded in the heads world, thus getting something like this:
But that's a bit too much of a hack, even for a hack-y method like this. More natural and proper would be to have the ACDT agent not use its decision as the node to cut-and-add-links from, but its policy (or, as in this post, its code). In that case, the counterfactual mugging can be represented as a graph by the ACDT agent:
Fully acausal trade
The ACDT agent might have issues with fully acausal trade (though, depending on your view, this might be a feature not a bug).
The reason being, that since the ACDT agent never gets to experience acausal trade, it never gets to check whether there is a link between it and hypothetical other agents - imagine a Newcomb problem where you never get to see the money (which may be going to a charity you support - but that charity may not exist either), nor whether Ω exists.
If an ACDT ever discovered acausal trade, it would have to do so in an incremental fashion. It would first have to become comfortable enough with prediction problems so that drawing links to predictors is a natural thing for it to do. It would have to become comfortable enough with hypothetical arguments being correct, that it could generalise to situations where it cannot ever get any empirical evidence.
So whether an ACDT agent ever engages in fully acausal trade, depends on how it generalises from examples.
Neural nets learning to be ACDT
It would be interesting to program a neural net ACDT agent, based on these example. If anyone is interested in doing so, let me know and go ahead.
Learning graphs and priors over graphs
The ACDT agent is somewhat slow and clunky at learning, needing quite a few examples before it can accept unconventional setups.
If we want it to go faster, we can choose to modify its priors. For example, we can look at what evidence would convince us that an accurate predictor existed, and put a prior that would have a certain graph, conditional on seeing that evidence.
Or if we want to be closer to UDT, we could formalise statements about algorithms, and about their features and similarities (or formalise mathematical results about proofs, and about how to generalise from known mathematical results). Adding that to the ACDT agent gives an agent much closer to UDT.
So it seems that ACDT+"the correct priors", is close to various different acausal agent designs.
Since FDT is still somewhat undefined, I'm viewing as TDT-like rather than UDT-like for the moment. ↩︎