(A longer text-based version of this post is also available on MIRI's blog here, and the bibliography for the whole sequence can be found here.)

New Comment
21 comments, sorted by Click to highlight new comments since:

Cross-posting some comments from the MIRI Blog:

Konstantin Surkov:

Re: 5/10 problem
I don't get it. Human is obviously (in that regard) an agent reasoning about his actions. Human also will choose 10 without any difficulty. What in human decision making process is not formalizable here? Assuming we agree that 10 is rational choice.

Abram Demski:

Suppose you know that you take the $10. How do you reason about what would happen if you took the $5 instead? It seems easy if you know how to separate yourself from the world, so that you only think of external consequences (getting $5). If you think about yourself as well, then you run into contradictions when you try to imagine the world where you take the $5, because you know it is not the sort of thing you would do. Maybe you have some absurd predictions about what the world would be like if you took the $5; for example, you imagine that you would have to be blind. That's alright, though, because in the end you are taking the $10, so you're doing fine.
Part of the point is that an agent can be in a similar position, except it is taking the $5, knows it is taking the $5, and unable to figure out that it should be taking the $10 instead due to the absurd predictions it makes about what happens when it takes the $10. It seems kind of hard for a human to end up in that situation, but it doesn't seem so hard to get this sort of thing when we write down formal reasoners, particularly when we let them reason about themselves fully (as natural parts of the world) rather than only reasoning about the external world or having pre-programmed divisions (so they reason about themselves in a different way from how they reason about the world).

It was not until reading this that I really understood that I am in the habit of reasoning about myself as just a part of the environment.

The kicker is that we don't reason directly about ourselves as such, we use a simplified model of ourselves. And we're REALLY GOOD at using that model for causal reasoning, even when it is reflective, and involves multiple levels of self-reflection and counterfactuals - at least when we bother to try. (We try rarely because explicit modelling is cognitively demanding, and we usually use defaults / conditioned reasoning. Sometimes that's OK.)

Example: It is 10PM. A 5-page report is due in 12 hours, at 10AM.

Default: Go to sleep at 1AM, set alarm for 8AM. Result: Don't finish report tonight, have too little time to do so tomorrow.

Conditioned reasoning: Stay up to finish the report first. 5 hours of work, and stay up until 3AM. Result? Write bad report, still feel exhausted the next day

Counterfactual reasoning: I should nap / get some amount of sleep so that I am better able to concentrate, which will outweigh the lost time. I could set my alarm for any amount of time; what amount does my model of myself imply will lead to an optimal well-rested / sufficient time trade-off?

Self-reflection problem, second use of mini-self model: I'm worse at reasoning at 1AM than I am at 10PM. I should decide what to do now, instead of delaying until then. I think going to sleep at 12AM and waking at 3AM gives me enough rest and time to do a good job on the report.

Consider counterfactual and impact: How does this impact the rest of my week's schedule? 3 hours is locally optimal, but I will crash tomorrow and I have a test to study for the next day. Decide to work a bit, go to sleep at 12:30 and set alarm for 5:30AM. Finish the report, turn it in by 10AM, then nap another 2 hours before studying.

We built this model based on not only small samples of our own history, but learning from others, incorporating data about seeing other people's experiences. We don't consider staying up all night and then driving to handing the report, because we realize exhausted driving is dangerous - because we heard stories of people doing so, and know that we would be similarly unsteady. Is a person going to explore and try different strategies by staying up all night and driving? If you die, you can't learn from the experience - so you have good ideas ab out what parts of the exploration space are safe to try. You might use Adderall because it's been tried before and is relatively safe, but you don't ingest arbitrary drugs to see if they help you think.

BUT an AI doesn't (at first) have that sample data to reason from, nor does a singleton have observation of other near-structurally identical AI systems and the impacts of their decisions, nor does it have a fundamental understanding about what is safe to explore.

I don't quite follow why 5/10 example presents a problem.

Conditionals with false antecedents seem nonsensical from the perspective of natural language, but why is this a problem for the formal agent? Since the algorithm as presented doesn't actually try to maximize utility, everything seems to be alright. In particular, there are 4 valid assignments: 

The algorithm doesn't try to select an assignment with largest , but rather just outputs  if there's a valid assignment with , and  otherwise. Only  fulfills the condition, so it outputs  and  also seem nonsensical because of false antecedents but with attached utility  - would that be a problem too? 

For this particular problem, you could get rid of assignments with nonsensical values by also considering an algorithm with reversed outputs and then taking the intersection of valid assignments, since only  satisfies both algorithms. 

Hmm. I'm not following. It seems like you follow the chain of reasoning and agree with the conclusion:

The algorithm doesn't try to select an assignment with largest , but rather just outputs  if there's a valid assignment with , and  otherwise. Only  fulfills the condition, so it outputs .

This is exactly the point: it outputs 5. That's bad! But the agent as written will look perfectly reasonable to anyone who has not thought about the spurious proof problem. So, we want general tools to avoid this kind of thing. For the case of proof-based agents, we have a pretty good tool, namely MUDT (the strategy of looking for the highest-utility such proof rather than any such proof). (However, this falls prey to the Troll Bridge problem, which looks pretty bad.)

Conditionals with false antecedents seem nonsensical from the perspective of natural language, but why is this a problem for the formal agent?

More generally, the problem is that for formal agents, false antecedents cause nonsensical reasoning. EG, for the material conditional (the usual logical version of conditionals), everything is true when reasoning from a false antecedent. For Bayesian conditionals (the usual probabilistic version of conditionals), probability zero events don't even have conditionals (so you aren't allowed to ask what follows from them).

Yet, we reason informally from false antecedents all the time, EG thinking about what would happen if 

So, false antecedents cause greater problems for formal agents than for natural language.

For this particular problem, you could get rid of assignments with nonsensical values by also considering an algorithm with reversed outputs and then taking the intersection of valid assignments, since only  satisfies both algorithms. 

The problem is also "solved" if the agent thinks only about the environment, ignoring its knowledge about its own source code. So if the agent can form an agent-environment boundary (a "cartesian boundary") then the problem is already solved, no need to try reversed outputs.

The point here is to do decision theory without such a boundary. The agent just approaches problems with all of its knowledge, not differentiating between "itself" and "the environment".

While I agree that the algorithm might output 5, I don't share the intuition that it's something that wasn't 'supposed' to happen, so I'm not sure what problem it was meant to demonstrate. I thought of a few ways to interpret it, but I'm not sure which one, if any, was the intended interpretation:

a) The algorithm is defined to compute argmax, but it doesn't output argmax because of false antecedents. 

- but I would say that it's not actually defined to compute argmax, therefore the fact that it doesn't output argmax is not a problem.

b) Regardless of the output, the algorithm uses reasoning from false antecedents, which seems nonsensical from the perspective of someone who uses intuitive conditionals, which impedes its reasoning.

- it may indeed seem nonsensical, but if 'seeming nonsensical' doesn't actually impede its ability to select actions wich highest utility (when it's actually defined to compute argmax), then I would say that it's also not a problem. Furthermore, wouldn't MUDT be perfectly satisfied with the tuple ? It also uses 'nonsensical' reasoning 'A()=5 => U()=0' but still outputs action with highest utility.

c) Even when the use of false antecedents doesn't impede its reasoning, the way it arrives at its conclusions is counterintuitive to humans, which means that we're more likely to make a catastrophic mistake when reasoning about how the agent reasons.

- Maybe? I don't have access to other people's intuitions, but when I read the example, I didn't have any intuitive feeling of what the algorithm would do, so instead I just calculated all assignments , eliminated all inconsistent ones and proceeded from there. And this issue wouldn't be unique to false antecedents, there are other perfectly valid pieces of logic that might nonetheless seem counterintuitive to humans, for example the puzzle with islanders and blue eyes.


Yet, we reason informally from false antecedents all the time, EG thinking about what would happen if

When I try to examine my own reasoning, I find that when I do so, I'm just selectively blind to certain details and so don't notice any problems. For example: suppose the environment calculates "U=10 if action = A; U=0 if action = B" and I, being a utility maximizer, am deciding between actions A and B. Then I might imagine something like "I chose A and got 10 utils", and "I chose B and got 0 utils" - ergo, I should choose A. 

But actually, if I had thought deeper about the second case, I would also think "hm, because I'm determined to choose the action with highest reward I would not choose B. And yet I chose B. This is logically impossible! OH NO THIS TIMELINE IS INCONSISTENT!" - so I couldn't actually coherently reason about what could happen if I chose B. And yet, I would still be left with the only consistent timeline where I choose A, which I would promptly follow, and get my maximum of 10 utils. 

The problem is also "solved" if the agent thinks only about the environment, ignoring its knowledge about its own source code.

The idea with reversing the outputs and taking the assignment that is valid for both versions of the algorithm seemed to me to be closer to the notion "but what would actually happen if you actually acted differently", i.e. avoiding seemingly nonsensical reasoning while preserving self-reflection. But I'm not sure when, if ever, this principle can be generalized. 

While I agree that the algorithm might output 5, I don't share the intuition that it's something that wasn't 'supposed' to happen, so I'm not sure what problem it was meant to demonstrate.

OK, this makes sense to me. Instead of your (A) and (B), I would offer the following two useful interpretations:

1: From a design perspective, the algorithm chooses 5 when 10 is better. I'm not saying it has "computed argmax incorrectly" (as in your A); an agent design isn't supposed to compute argmax (argmax would be insufficient to solve this problem, because we're not given the problem in the format of a function from our actions to scores), but it is supposed to "do well". The usefulness of the argument rests on the weight of "someone might code an agent like this on accident, if they're not familiar with spurious proofs". Indeed, that's the origin of this code snippet -- something like this was seriously proposed at some point.

2: From a descriptive perspective, the code snippet is not a very good description of how humans would reason about a situation like this (for all the same reasons).

When I try to examine my own reasoning, I find that when I do so, I'm just selectively blind to certain details and so don't notice any problems. For example: suppose the environment calculates "U=10 if action = A; U=0 if action = B" and I, being a utility maximizer, am deciding between actions A and B. Then I might imagine something like "I chose A and got 10 utils", and "I chose B and got 0 utils" - ergo, I should choose A. 

Right, this makes sense to me, and is an intuition which I many people share. The problem, then, is to formalize how to be "selectively blind" in an appropriate way such that you reliably get good results.

More generally, the problem is that for formal agents, false antecedents cause nonsensical reasoning

No, it's contradictory assumptions. False but consistent assumptions are dual to consistent-and-true assumptions...so you can only infer a mutually consistent set of propositions from either.

To put it another way, a formal system has no way of knowing what would be true or false for reasons outside itself, so it has no way of reacting to a merely false statement. But a contradiction is definable within a formal system.

To.put it yet another way... contradiction in, contradiction out

Yep, agreed. I used the language "false antecedents" mainly because I was copying the language in the comment I replied to, but I really had in mind "demonstrably false antecedents".

Nitpick: in the final sentence of the first paragraph, shouldn't it end with "including "?

I'm not convinced that an inconsequential grain of uncertainty couldn't handle this 5-10 problem. Consider an agent whose actions are probability distributions on {5,10} that are nowhere 0. We can call these points in the open affine space spanned by the points 5 and 10. U is then a linear function from this affine space to utilities. The agent would search for proofs that U is some particular such linear function. Once it finds one, it uses that linear function to compute the optimal action. To ensure that there is an optimum, we can adjoin infinitesimal values to the possible probabilities and utilities.

If the agent were to find a proof that the linear function is the one induced by mapping 5 to 5 and 10 to 0, it would return (1-ε)⋅5+ε⋅10 and get utility 5+5ε instead of the expected 5-5ε, so Löb's theorem wouldn't make this self-fulfilling.

So, your suggestion is not just an inconsequential grain of uncertainty, it is an grain of exploration. The agent actually does take 10 with some small probability. If you try to do this with just uncertainty, things would be worse, since that uncertainty would not be justified.

One problem is that you actually do explore a bunch, and since you don't get a reset button, you will sometimes explore into irreversible actions, like shutting yourself off. However, if the agent has a source of randomness, and also the ability to simulate worlds in which that randomness went another way, you can have an agent that with probability does not explore ever, and learns from the other worlds in which it does explore. So, you can either explore forever, and shut yourself off, or you can explore very very rarely and learn from other possible worlds.

The problem with learning from other possible worlds is to get good results out of it, you have to assume that the environment does not also learn from other possible worlds, which is not very embedded.

But you are suggesting actually exploring a bunch, and there is a problem other than just shutting yourself off. You are getting past this problem in this case by only allowing linear functions, but that is not an accurate assumption. Let's say you are playing matching pennies with Omega, who has the ability to predict what probability you will pick but not what action you will pick.

(In matching pennies, you each choose H or T, you win if they match, they win if they don't.)

Omega will pick H if your probability of H is less that 1/2 and T otherwise. Your utility as a function of probability is piecewise linear with two parts. Trying to assume that it will be linear will make things messy.

There is this problem where sometimes the outcome of exploring into taking 10, and the outcome of actually taking 10 because it is good are different. More on this here.

I am talking about the surreal number ε, which is smaller than any positive real. Events of likelihood ε do not actually happen, we just keep them around so the counterfactual reasoning does not divide by 0.

Within the simulation, the AI might be able to conclude that it just made an ε-likelihood decision and must therefore be in a counterfactual simulation. It should of course carry on as it were, in order to help the simulating version of itself.

Why shouldn't the environment be learning?

To the Omega scenario I would say that since we have an Omega-proof random number generator, we get new strategic options that should be included in the available actions. The linear function then goes from the ε-adjoined open affine space generated by {Pick H with probability p | p real, non-negative and at most 1} to the ε-adjoined utilities, and we correctly solve Omega's problem by using p=1/2.

Yeah, so its like you have this private data, which is an infinite sequence of bits, and if you see all 0's you take an exploration action. I think that by giving the agent these private bits and promising that the bits do not change the rest of the world, you are essentially giving the agent access to a causal counterfactual that you constructed. You don't even have to mix with what the agent actually does, you can explore with every action and ask if it is better to explore and take 5 or explore and take 10. By doing this, you are essentially giving the agent access to a causal counterfactual, because conditioning on these infinitesimals is basically like coming in and changing what the agent does. I think giving the agent a true source of randomness actually does let you implement CDT.

If the environment learns from the other possible worlds, It might punish or reward you in one world for stuff that you do in the other world, so you cant just ask which world is best to figure out what to do.

I agree that that is how you want to think about the matching pennies problem. However the point is that your proposed solution assumed linearity. It didn't empirically observe linearity. You have to be able to tell the difference between the situations in order to know not to assume linearity in the matching pennies problem. The method for telling the difference is how you determine whether or not and in what ways you have logical control over Omega's prediction of you.

I posit that linearity always holds. In a deterministic universe, the linear function is between the ε-adjoined open affine space generated by our primitive set of actions and the ε-adjoined utilities. (Like in my first comment.)

In a probabilistic universe, the linear function is between the ε-adjoined open affine space generated by (the set of points in) the closed affine space generated by our primitive set of actions and the ε-adjoined utilities. (Like in my second comment.)

I got from one of your comments that assuming linearity wards off some problem. Does it come back in the probabilistic-universe case?

My point was that I don't know where to assume the linearity is. Whenever I have private randomness, I have linearity over what I end up choosing with that randomness, but not linearity over what probability I choose. But I think this is non getting at the disagreement, so I pivot to:

In your model, what does it mean to prove that U is some linear affine function? If I prove that my probability p is 1/2 and that U=7.5, have I proven that U is the constant function 7.5? If there is only one value of p, it is not defined what the utility function is, unless I successfully carve the universe in such a way as to let me replace the action with various things and see what happens. (or, assuming linearity replace the probability with enough linearly independent things (in this case 2) to define the function.

In the matching pennies game, would be proven to be . A could maximize this by returning ε when isn't , and (where ε is so small that this is still infinitesimally close to 1) when is .

The linearity is always in the function between ε-adjoined open affine spaces. Whether the utilities also end up linear in the closed affine space (ie nobody cares about our reasoning process) is for the object-level information gathering process to deduce from the environment.

You never prove that you will with certainty decide . You always leave a so-you're-saying-there's-a chance of exploration, which produces a grain of uncertainty. To execute the action, you inspect the ceremonial Boltzmann Bit (which is implemented by being constantly set to "discard the ε"), but which you treat as having an ε chance of flipping.

The self-modification module could note that inspecting that bit is a no-op, see that removing it would make the counterfactual reasoning module crash, and leave up the Chesterton fence.

But how do you avoid proving with certainty that p=1/2?

Since your proposal does not say what to do if you find inconsistent proofs that the linear function is two different things, I will assume that if it finds multiple different proofs, it defaults to 5 for the following.

Here is another example:

You are in a 5 and 10 problem. You have twin that is also in a 5 and 10 problem. You have exactly the same source code. There is a consistency checker, and if you and your twin do different things, you both get 0 utility.

You can prove that you and your twin do the same thing. Thus you can prove that the function is 5+5p. You can also prove that your twin takes 5 by Lob's theorem. (You can also prove that you take 5 by Lob's theorem, but you ignore that proof, since "there is always a chance") Thus, you can prove that the function is 5-5p. Your system doesn't know what to do with two functions, so it defaults to 5. (If it is provable that you both take 5, you both take 5, completing the proof by Lob.)

I am doing the same thing as before, but because I put it outside of the agent, it does not get flagged with the "there is always a chance" module. This is trying to illustrate that your proposal takes advantage of a separation between the agent and the environment that was snuck in, and could be done incorrectly.

Two possible fixes:

1) You could say that the agent, instead of taking 5 when finding inconsistency takes some action that exhibits the inconsistency (something that the two functions give different values). This is very similar to the chicken rule, and if you add something like this, you don't really need the rest of your system. If you take an agent that whenever it proves it does something, it does something else. This agent will prove (given enough time) that if it takes 5 it gets 5, and if it takes 10 it gets 10.

2) I had one proof system, and just ignored the proofs that I found that I did a thing. I could instead give the agent a special proof system that is incapable of proving what it does, but how do you do that? Chicken rule seems like the place to start.

One problem with the chicken rule is that it was developed in a system that was deductively closed, so you can't prove something that passes though a proof of P without proving P. If you violate this, by having a random theorem prover, you might have an system that fails to prove "I take 5" but proves "I take 5 and 1+1=2" and uses this to complete the Lob loop.

I can't prove what I'm going to do and I can't prove that I and the twin are going to do the same thing, because of the Boltzmann Bits in both of our decision-makers that might turn out different ways. But I can prove that we have a chance of doing the same thing, and my expected utility is , rounding to once it actually happens.

I think I'm missing something with the Löb's theorem example.

If  can be proved under the theorem, then can't  also be proved?  What's the cause of the asymmetry that privileges taking $5 in all scenarios where you're allowed to search for proofs for a long time?

Agreed. The asymmetry needs to come from the source code for the agent.

In the simple version I gave, the asymmetry comes from the fact that the agent checks for a proof that x>y before checking for a proof that y>x. If this was reversed, then as you said, the Lobian reasoning would make the agent take the 10, instead of the 5.

In a less simple version, this could be implicit in the proof search procedure. For example, the agent could wait for any proof of the conclusion x>y or y>x, and make a decision based on whichever happened first. Then there would not be an obvious asymmetry. Yet, the proof search has to go in some order. So the agent design will introduce an asymmetry in one direction or the other. And when building theorem provers, you're not usually thinking about what influence the proof order might have on which theorems are actually true; you usually think of the proofs as this static thing which you're searching through. So it would be easy to mistakenly use a theorem prover which just so happens to favor 5 over 10 in the proof search.