I think you can put this scheme on a nicer foundation by talking about strategies rather than actions, and by letting the AI have some probability distribution over .
Then you just use the strategy that maximizes . You can also think of this as doing a simplification of the expected utility calculation that bakes in the assumption that the AI can't change .
You can then reintroduce the action with the observation that the AI will also be well-behaved if it maximizes .
In this example, it's clear that is a special node. However, the AI only deduced that because, under , determines . It's perfectly plausible that under action , say, instead determines it. Under and , none of those nodes have any impact.
Therefore we need to be a special strategy, as it allows us to identify what nodes connect with . The advantage of this method is that it lets the AI find the causal graph and compute the dependencies.
Agree strategies are better than actions.
Abstractly, I think of this as adding a utility node, U, with no parents, and having the agent try to maximize the expected value of U.
I think there are some implicit assumptions (which seem reasonable for many situations, prime facie) about the agent's ability to learn about U via some observations when taking null actions (i.e. A and U share some descendant(s), D, and A knows something about P(D | U, A=null).
RE: the last bit, it seems like you can define learning from manipulating in a straightforward way similar to what is proposed here. The intuition is that the humans belief about U should be collapsing around a point, u* (in the absence of interference by the AI), and the AI helps learning if it accelerates this process. If this is literally true, then we can just say that learning is accelerated (at tstep t) if the probability H assigns to u* is higher given an agents action a than it would be given the null action, i.e.
P_H_t(u* | A_0 = a) > P_H_t(u* | A_0 = A1 = ... = null).
So after talking w/Stuart, I guess what he means by "humans learning from the AI’s actions" is that what humans' beliefs about U converges to actually changes (for the better). I'm not sure if that's really desirable, atm.
On a separate note, my proposal has the practical issue that this agent only views its own potential influence on u* as undesirable (and not other agents'). So I think ultimately we want a more rich set of counter-factuals, including, e.g. that humans continue to exist indefinitely (otherwise P_Ht becomes undefined when humanity is extinct).
A putative new idea for AI control; index here.
I've previously talked about stratified indifference/learning. In this short post, I'll try and present the idea, as simply and clearly as possible.
The truth is yet written
Assume the AI maximising a utility W, but is unsure whether W is equal to utility u or instead utility v. Humans have written the true utility on a paper somewhere, and, in an hour, the AI will look at that paper to check what W is. Let Iu be the indicator function that u is written on that paper in an hour; Iv=1−Iu is the obvious converse.
Thus W=Iuu+Ivv.
Here's the problem. There is nothing stopping the AI from looking up the paper currently. And there is nothing stopping the AI from rewriting the paper, or substituting another (Iu is determined by the paper in an hour's time, however that paper got there). In terms of causal graphs, we have the following setup:
Here W= is the node that represents the utility that W is ultimately equal to, W0 u/v is what is currently written on the paper (from now on, written as $ \mathbf{W_0}$ for short), W1 u/v is what will be written on the paper at decision time (written as W1), and Act is the AI's action. Note that W1 determines W= directly, while it itself is determined by W0 (the past paper) and Act (the AI's actions).
Default policy
Now assume that there is a default action or policy ∅ where the AI simply does nothing. In that case, the value of W0 determines W1 completely. The AI has another action OW (short for overwrite) in which it overwrites the paper's choice (this subdivides into ORu, overwrite with u, and ORv, overwrite with v). Given OW, W1 is then independent of W0.
Generally speaking, the AI will always choose OW, as one of u and v will be easier to maximise, so the AI will choose that one to maximise.
So, informally, what do we really want to happen? We want the AI to maximise u, if u would have been chosen, absent its intervention (and similarly for v). "Absent its intervention" seems to designate equality between the nodes W0 and W=. More generally, it designates what happens when Act=∅.
Stratification
So let's modify the AI's expected utility calculation to meet that informal desiderata. First, let's look at the universe where, absent AI intervention, W=u. What that means is that, conditional on Act=∅, W= is set to u.
In this universe, W0 and W1 are both set to u. Now, it makes no sense to keep W1 being set to u, since that node is a consequence of Act as well as W0.
So we'll define a probability distribution P∅,u (and corresponding expectation E∅,u) over all nodes that are not descendants of Act (to do so, we'll assume that all nodes that are descendants of Act and ancestors of W= are endogenous variables that are deterministic functions of their inputs; the stochasticity is put into exogenous variables, see this paper for more details). Then we will make the AI choose its actions by calculating the expectation of u, given P∅,u.
So, more precisely, this is the expectation of u in the universes where Iu would have been 1 if the AI had chosen to do nothing -- thus Act=∅.
So if the AI choose action a, this is E∅,u(u|Act=a). Note that, in this instance, this is equal to the classical expectation E(u|Act=a,W0=u).
What happens if the AI choose Act=ORv, to overwrite the paper with v? This is irrelevant for the expected utility calculation, since we're using E∅,u(u|Act=a): this is the expectation of u (the utility that would have been chosen, if Act=∅). The actual values of node W= becomes irrelevant, only the counterfactual value it would have had. So changing W= does not benefit the AI. In fact, if the AI is time-inconsistent, it makes things worse, as it means its future self will be maximising v instead of u.
Then we do the same thing for v, resulting in the formula for the stratified expectation Es:
In a sense, the AI should maximise W′=u(Iu|∅)+v(Iv|∅).
If you just want to get the basic idea of stratification, you need read no further than this. The rest is just some examples and potential issues.
Lottery and outcome pumps
Some versions of indifference have problems with infinite improbability drives.
A simple example of this would be a situation where the AI buys a lottery ticket and decides that it will overwrite the paper with u if the ticket wins, and with w if the ticket loses. Call this action Act=iff.
Presumably, it can use the money gained to increase utility, so E(u|W==u,Act=iff) is rather high (as W=u now implies that the lottery was won).
This gives the following causal graph:
Now let's look at the stratified expectation. Given W==u and Act=∅, we can compute P∅,u. This is simply P(W0=u)=1, and the value of Lotto being what it was before, and independent of W0.
Thus Eu,∅(u|Act=iff) will continue to treat the node Lotto as an independent variable with the same probability as under Act=∅. Choosing iff thus does not help the AI's expected stratified utility.
Humans changing their minds
Another situation to consider is one where humans have written the paper initially, but decided to change their minds at a later date. This can be captured in the following diagram:
Here the node Hum determines how the humans change their minds and rewrite the paper. What is the dotted arrow between Act and Hum? That represents whether or not the change is a consequence of the AI's actions or not. If Hum is independent of the AI's actions -- if we would have changed our minds even if Act=∅ -- then P∅,u will set Hum=u and W0 to whatever its prior was. Thus Hum determines W=, overriding W0, the human's initial decision.
So far, so good. But what if the change in policy is a consequence of the AI's actions? What if the humans only change the utility because they see the AI behaving badly? Then Hum is a descendant node from Act, and stratification doesn't apply to it. Now, we could decompose that node further as:
Here Hum Fun are the fundamentals of human personalities. Since it is not a descendant node from Act, stratification will put a probability distribution over it. However, this still doesn't allow us to intervene, given the AI's action. Under stratification, the AI's sole concern is still "would Hum Fun have lead to u (or v) under Act=∅", rather than correcting itself under a different choice of actions.
How to model humans learning from the AI's actions is hard. Indeed, it might be fundamentally hard: you want to distinguish humans learning from the AI (legit) from the AI manipulating humans (non-legit) and this likely can't be done by just looking at the causal graph.
Notes on default policy
Note that ∅ might be a really bad policy (take people apart to see what they prefer). Then, if it's pretty clear that taking people apart is a bad idea according to the preferences that will emerge from that process, the AI will not choose ∅, but will instead leave people together and try to deduce what the outcome of that process would have been.
Alternatively, if doing a logical counterfactual over its own action is bad, we could have the AI take ∅ to not be its own action, but some other causal process that would have prevented the AI from being turned on in the first place.