A comment I made on social media last year about why MIRI cares about making progress on decision theory:


We aren't working on decision theory in order to make sure that AGI systems are decision-theoretic, whatever that would involve. We're working on decision theory because there's a cluster of confusing issues here (e.g., counterfactuals, updatelessness, coordination) that represent a lot of holes or anomalies in our current best understanding of what high-quality reasoning is and how it works.

As an analogy: it might be possible to build a probabilistic reasoner without having a working understanding of classical probability theory, through sufficient trial and error. (Evolution "built" humans without understanding probability theory.) But you'd fundamentally be flying blind when it comes to designing the system — to a large extent, you couldn't predict in advance which classes of design were likely to be most promising to consider, couldn't look at particular proposed designs and make good advance predictions about safety/capability properties of the corresponding system, couldn't identify and address the root causes of problems that crop up, etc.

The idea behind looking at (e.g.) counterfactual reasoning is that counterfactual reasoning is central to what we're talking about when we talk about "AGI," and going into the development process without a decent understanding of what counterfactual reasoning is and how it works means you'll to a significantly greater extent be flying blind when it comes to designing, inspecting, repairing, etc. your system. The goal is to be able to put AGI developers in a position where they can make advance plans and predictions, shoot for narrow design targets, and understand what they're doing well enough to avoid the kinds of kludgey, opaque, non-modular, etc. approaches that aren't really compatible with how secure or robust software is developed.

Nate's way of articulating it:

The reason why I care about logical uncertainty and decision theory problems is something more like this: The whole AI problem can be thought of as a particular logical uncertainty problem, namely, the problem of taking a certain function f : QR and finding an input that makes the output large. To see this, let f be the function that takes the AI agent’s next action (encoded in Q) and determines how "good" the universe is if the agent takes that action. The reason we need a principled theory of logical uncertainty is so that we can do function optimization, and the reason we need a principled decision theory is so we can pick the right version of the "if the AI system takes that action..." function.

The work you use to get to AGI presumably won't look like probability theory, but it's still the case that you're building a system to do probabilistic reasoning, and understanding what probabilistic reasoning is is likely to be very valuable for doing that without relying on brute force and trial-and-error. Similarly, the work that goes into figuring out how to design a rocket, actually building one, etc. doesn't look very much like the work that goes into figuring out that there's a universal force of gravity that operates by an inverse square law; but you'll have a vastly easier time approaching the rocket-building problem with foresight and an understanding of what you're doing if you have a mental model of gravitation already in hand.

In pretty much the same way, developing an understanding of roughly what counterfactuals are and how they work won't get you to AGI, and the work of implementing an AGI design won't look like decision theory, but you want to have in mind an understanding of what "AGI-style reasoning" is (including "what probabilistic reasoning about empirical propositions is" but also "what counterfactual reasoning is", "what probabilistic reasoning about mathematical propositions is", etc.), and very roughly how/why it works, before you start making effectively irreversible design decisions.


Eliezer adds:

I do also remark that there are multiple fixpoints in decision theory. CDT does not evolve into FDT but into a weirder system Son-of-CDT. So, as with utility functions, there are bits we want that the AI does not necessarily generate from self-improvement or local competence gains.

New Comment
13 comments, sorted by Click to highlight new comments since: Today at 2:52 PM

I don't suppose you could clarify what the unresolved issues in decision theory are. What are the biggest issues that haven't been solved for UDT or FDT? What is a co-ordination problem that hasn't been solved? And what still isn't known about counterfactuals?

What are the biggest issues that haven't been solved for UDT or FDT?

UDT was a fairly simple and workable idea in classical Bayesian settings with logical omniscience (or with some simple logical uncertainty treated as if it were empirical uncertainty), but it was always intended to utilize logical uncertainty at its core. Logical induction, our current-best theory of logical uncertainty, doesn't turn out to work very well with UDT so far. The basic problem seems to be that UDT required "updates" to be represented in a fairly explicit way: you have a prior which already contains all the potential things you can learn, and an update is just selecting certain possibilities. Logical induction, in contrast, starts out "really ignorant" and adds structure, not just content, to its beliefs over time. Optimizing via the early beliefs doesn't look like a very good option, as a result.

FDT requires a notion of logical causality, which hasn't appeared yet.

What is a co-ordination problem that hasn't been solved?

Taking logical uncertainty into account, all games become iterated games in a significant sense, because players can reason about each other by looking at what happens in very close situations. If the players have T seconds to think, they can simulate the same game but given t<<T time to think, for many t. So, they can learn from the sequence of "smaller" games.

This might seem like a good thing. For example, single-shot prisoner's dilemma has just a Nash equilibrium of defection. Iterated play cas cooperative equilibria, such as tit-for-tat.

Unfortunately, the folk theorem of game theory implies that there are a whole lot of fairly bad equilibria for iterated games as well. It is possible that each player enforces a cooperative equilibrium via tit-for-tat-like strategies. However, it is just as possible for players to end up in a mutual blackmail double bind, as follows:

Both players initially have some suspicion that the other player is following strategy X: "cooperate 1% of the time if and only if the other player is playing consistently with strategy X; otherwise, defect 100% of the time." As a result of this suspicion, both players play via strategy X in order to get the 1% cooperation rather than 0%.

Ridiculously bad "coordination" like that can be avoided via cooperative oracles, but that requires everyone to somehow have access to such a thing. Distributed oracles are more realistic in that each player can compute them just by reasoning about the others, but players using distributed oracles can be exploited.

So, how do you avoid supremely bad coordination in a way which isn't too badly exploitable?

And what still isn't known about counterfactuals?

The problem of specifying good counterfactuals sort of wraps up any and all other problems of decision theory into itself, which makes this a bit hard to answer. Different potential decision theories may lean more or less heavily on the counterfactuals. If you lead toward EDT-like decision theories, the problem with counterfactuals is mostly just the problem of making UDT-like solutions work. For CDT-like decision theories, it is the other way around; the problem of getting UDT to work is mostly about getting the right counterfactuals!

The mutual-blackmail problem I mentioned in my "coordination" answer is a good motivating example. How do you ensure that the agents don't come to think "I have to play strategy X, because if I don't, the other player will cooperate 0% of the time?"

My basic opinion is that this makes sense and agrees with my model. On the other hand, I'm not planning to switch to working on decision theory now, so perhaps I should say why. Partly it's that I have a comparative advantage at ML work, but it's also an impression that Agent Foundations will not help much with the first powerful AI systems we build. On one axis, I wouldn't be surprised if the first powerful AI systems don't look like the good reasoners that MIRI studies, and so Agent Foundations research won't apply. On another axis, Agent Foundations seems like a hard problem that we may not solve before powerful AI systems are created.

I do find it plausible that to build _aligned_ AI systems that are _much_ more powerful than humans, we must understand it at the level of Agent Foundations understanding. (Though I also find the opposite statement plausible.) However, I think we will first build powerful AI systems that are not that much more powerful than humans, and that direct alignment of ML techniques will be sufficient to make that safe (even though they do pose an x-risk). (I suspect this is where my main disagreement with people at MIRI is.) We can then use those systems to help us solve Agent Foundations before we scale up.

I'm not sure what it means for this work to "not apply" to particular systems. It seems like the claim is that decision theory is a way to understand AI systems in general and reason about what they will do, just as we use other theoretical tools to understand current ML systems. Can you spell this out a bit more? (Note that I'm also not really sure what it means for decision theory to apply to all AI systems: I can imagine kludgy systems where it seems really hard in some sense to understand their behavior with decision theory, but I'm not confident at all)

I claim (with some confidence) that Updateless Decision Theory and Logical Induction don't have much to do with understanding AlphaGo or OpenAI Five, and you are better off understanding those systems using standard AI/ML thinking.

I further claim (with less confidence) that in a similar way, at the time that we build our first powerful AI systems, the results of Agent Foundations research at that time won't have much to do with understanding those powerful AI systems.

Does that explain what it means? And if so, do you disagree with either of the claims?

I claim (with some confidence) that Updateless Decision Theory and Logical Induction don't have much to do with understanding AlphaGo or OpenAI Five, and you are better off understanding those systems using standard AI/ML thinking.

Eh, this is true, but it's also true that causal decision theory, game theory, and probability theory have a lot to do with how to understand how to build AlphaZero or OpenAI Five (and by extension, those systems themselves). I think the relevant question here must be whether you think the embedded agency program can succeed as much as the classical decision theory/probability theory program, and whether conditional on that success it can be as influential (probably with a shorter lag between the program succeeding and wanting to influence AI development).

Yeah, my second claim is intended to include that scenario as well. That is, if embedded agency succeeded and significantly influenced the development of the first powerful AI systems, I would consider my second claim to be false.

This scenario (of embedded agency influencing AI development) would surprise me conditional on short timelines. Conditional on long timelines, I'm not sure, and would want to think about it more.

Note also that in a world where you can't build powerful AI without Agent Foundations, it's not a big loss if you don't work on Agent Foundations right now. The worry is in a world where you can build powerful AI without Agent Foundations, but it leads to catastrophe. I'm focusing on the worlds in which that is true and in which powerful AI is developed soon.

That is all sensible, I was just slightly annoyed by what I read as an implication that "AlphaGo doesn't use UDT therefore advanced AI won't" or something.

I agree with both your claims, but maybe with less confidence than you (I also agree with DanielFilan's point below).

Here are two places I can imagine MIRI's intuitions here coming from, and I'm interested in your thoughts on them:

(1) The "idealized reasoner is analogous to a Carnot engine" argument. It seems like you think advanced AI systems will be importantly disanalogous to this idea, and that's not obvious to me.

(2) 'We might care about expected utility maximization / theoretical rationality because there is an important sense in which you are less capable / dumber / irrational if e.g. you are susceptible to money pumps. So advanced agents, since they are advanced, will act closer to ideal agents.'

(I don't have much time to comment so sorry if the above is confusing)

(1) I am unsure whether there exists an idealized reasoner analogous to a Carnot engine (see Realism about rationality). Even if such a reasoner exists, it seems unlikely that we will a) figure out what it is, b) understand it in sufficient depth, and c) successfully use it to understand and improve ML techniques, before we get powerful AI systems through other means. Under short timelines, this cuts particularly deeply, because a) there's less time to do all of these things and b) it's more likely that advanced AI is built out of "messy" deep learning systems that seem less amenable to this sort of theoretical understanding.

(2) I certainly agree that all else equal, advanced agents should act closer to ideal agents. (Assuming there is such a thing as an ideal agent.) I also agree that advanced AI should be less susceptible to money pumps, from which I learn that their "preferences" (i.e. world states that they work to achieve) are transitive. I'm also on board that more advanced AI systems are more likely to be described by some utility function that they are maximizing the expected utility of, per the VNM theorem. I don't agree that the utility function must be simple, or that the AI must be internally reasoning by computing the expected utility over all actions and then choosing the one that's highest. I would be extremely surprised if we built powerful AI such that when we say the English sentence "make paperclips" it acts in accordance with the utility function U(universe history) = number of paperclips in the last state of the universe history. I would be very surprised if we built powerful AI such that we hardcode in the above utility function and then design the AI to maximize its expected value.

The Value Learning sequence expands on position 2, especially in Chapter 2. The conclusion is a short version of it, but still longer than the parent comment.

I still don't understand the whole deal about counterfactuals, exemplified as "If Oswald had not shot Kennedy, then someone else would have". Maybe MIRI means something else by the counterfactuals?

If it's the counterfactual conditionals, then the approach is pretty simple, as discussed with jessicata elsewhere: there is the macrostate of the world (i.e. a state known to a specific observer, which consists of many possible substates, or microstates) of the world, one of these microstates led to the observed macroscopic event, some other possible microstates would have led to the same or different possible macrostates, e.g. Oswald shoots Kennedy, Oswald's gun jams, someone else shooting Kennedy, and so on. The problem is constructing a set of microstates and their probability distribution that together lead to the pre-shooting macrostate. Once you know those, you can predict the odds of each post-shooting-time macrostate. When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.

I posted about it before, but just to reiterate my question. If you can "simply" count possible (micro-)states and their probabilities, then what is there except this simple counting?

Just to give an example, of, say, the Newcomb's problem, the pre-decision microstates of the brain of the "agent", while known to the Predictor, are not known to the agent. Some of these microstates lead to the macrostate corresponding to two-boxing, and some lead to the macrostate corresponding to one-boxing. Knowing what microstates these might be, and assigning our best-guess probabilities to them lets us predict what action an agent would take, if not as perfectly as the Predictor would, then as well as we ever can. What do UDT or FDT say beyond that, or contrary to that?

When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.

This doesn't give very useful answers when the state evolution is nearly deterministic, such as an agent made of computer code.

For example, consider an agent trying to decide whether to turn left or turn right. Suppose for the sake of argument that it actually turns left, if you run physics forward. Also suppose that the logical uncertainty has figured that out, so that the best-estimate macrostate probabilities are mostly on that. Now, the agent considers whether to turn left or right.

Since the computation (as pure math) is deterministic, counterfactuals which result from supposing the state evolution went right instead of left mostly consist of computer glitches in which the hardware failed. This doesn't seem like what the agent should be thinking about when it considers the alternative of going right instead of left. For example, the grocery store it is trying to get to could be on the right-hand path. The potential bad results of a hardware failure might outweigh the desire to turn toward the grocery store, so that the agent prefers to turn left.

For this story to make sense, the (logical) certainty that the abstract algorithm decides to turn left in this case has to be higher than the confidence that hardware will not fail, so that turning right seems likely to imply hardware failure. This can happen due to Löb's theorem: the whole above argument, as a hypothetical argument, suggests that the agent would turn left on a particular occasion if it happened to prove ahead of time that its abstract algorithm would turn left (since it would then be certain that turning right implied a hardware failure). But this means a proof of left-turning results in left-turning. Löb's theorem, left-turning is indeed provable.

The Newcomb's-problem example you give also seems problematic. Again, if the agent's algorithm is deterministic, it does basically one thing as long as the initial conditions are such that it is in Newcomb's problem. So, essentially all of the uncertainty about the agent's action is logical uncertainty. I'm not sure exactly what your intended notion of counterfactual is, but, I don't see how reasoning about microstates helps the agent here.