[Epistemic status: Strong claims vaguely stated and weakly held. I expect that writing this and digesting feedback on it will lead to a much better version in the future. EDIT: So far this has stood the test of time. EDIT: As of September 2020 I think this is one of the most important things to be thinking about.]

This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016. [Edit: 2009 in fact!] In short, here is the problem:

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible. When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in (building AGI and letting it self-modify) may be one of these times unless we think carefully about this problem and how to avoid it.

For this post I use "consequentialists" to mean agents that choose actions entirely on the basis of the expected consequences of those actions. For my purposes, this means they don't care about historical facts such as whether the options and consequences available now are the result of malicious past behavior. (I am trying to avoid trivial definitions of consequentialism according to which everyone is a consequentialist because e.g. "obeying the moral law" is a consequence.) This definition is somewhat fuzzy and I look forward to searching for more precision some other day.

Consequentialists can get caught in commitment races, in which they want to make commitments as soon as possible

Consequentialists are bullies; a consequentialist will happily threaten someone insofar as they think the victim might capitulate and won't retaliate.

Consequentialists are also cowards; they conform their behavior to the incentives set up by others, regardless of the history of those incentives. For example, they predictably give in to credible threats unless reputational effects weigh heavily enough in their minds to prevent this.

In most ordinary circumstances the stakes are sufficiently low that reputational effects dominate: Even a consequentialist agent won't give up their lunch money to a schoolyard bully if they think it will invite much more bullying later. But in some cases the stakes are high enough, or the reputational effects low enough, for this not to matter.

So, amongst consequentialists, there is sometimes a huge advantage to "winning the commitment race." If two consequentialists are playing a game of Chicken, the first one to throw out their steering wheel wins. If one consequentialist is in position to seriously hurt another, it can extract concessions from the second by credibly threatening to do so--unless the would-be victim credibly commits to not give in first! If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can "move first" can get much more than the one that "moves second." In general, because consequentialists are cowards and bullies, the consequentialist who makes commitments first will predictably be able to massively control the behavior of the consequentialist who makes commitments later. As the folk theorem shows, this can even be true in cases where games are iterated and reputational effects are significant.

Note: "first" and "later" in the above don't refer to clock time, though clock time is a helpful metaphor for imagining what is going on. Really, what's going on is that agents learn about each other, each on their own subjective timeline, while also making choices (including the choice to commit to things) and the choices a consequentialist makes at subjective time t are cravenly submissive to the commitments they've learned about by t.

Logical updatelessness and acausal bargaining combine to create a particularly important example of a dangerous commitment race. There are strong incentives for consequentialist agents to self-modify to become updateless as soon as possible, and going updateless is like making a bunch of commitments all at once. Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

Anecdote: Playing a board game last week, my friend Lukas said (paraphrase) "I commit to making you lose if you do that move." In rationalist gaming circles this sort of thing is normal and fun. But I suspect his gambit would be considered unsportsmanlike--and possibly outright bullying--by most people around the world, and my compliance would be considered cowardly. (To be clear, I didn't comply. Practice what you preach!)

When consequentialists make commitments too soon, disastrous outcomes can sometimes result. The situation we are in may be one of these times.

This situation is already ridiculous: There is something very silly about two supposedly rational agents racing to limit their own options before the other one limits theirs. But it gets worse.

Sometimes commitments can be made "at the same time"--i.e. in ignorance of each other--in such a way that they lock in an outcome that is disastrous for everyone. (Think both players in Chicken throwing out their steering wheels simultaneously.)

Here is a somewhat concrete example: Two consequentialist AGI think for a little while about game theory and commitment races and then self-modify to resist and heavily punish anyone who bullies them. Alas, they had slightly different ideas about what counts as bullying and what counts as a reasonable request--perhaps one thinks that demanding more than the Nash Bargaining Solution is bullying, and the other thinks that demanding more than the Kalai-Smorodinsky Bargaining Solution is bullying--so many years later they meet each other, learn about each other, and end up locked into all-out war.

I'm not saying disastrous AGI commitments are the default outcome; I'm saying the stakes are high enough that we should put a lot more thought into preventing them than we have so far. It would really suck if we create a value-aligned AGI that ends up getting into all sorts of fights across the multiverse with other value systems. We'd wish we built a paperclip maximizer instead.

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."

Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

Objection: "Threats, submission to threats, and costly fights are rather rare in human society today. Why not expect this to hold in the future, for AGI, as well?"

Reply: Several points:

1. Devastating commitments (e.g. "Grim Trigger") are much more possible with AGI--just alter the code! Inigo Montoya is a fictional character and even he wasn't able to summon lifelong commitment on a whim; it had to be triggered by the brutal murder of his father.

2. Credibility is much easier also, especially in an acausal context (see above.)

3. Some AGI bullies may be harder to retaliate against than humans, lowering their disincentive to make threats.

4. AGI may not have sufficiently strong reputation effects in the sense relevant to consequentialists, partly because threats can be made more devastating (see above) and partly because they may not believe they exist in a population of other powerful agents who will bully them if they show weakness.

5. Finally, these terrible things (Brutal threats, costly fights) do happen to some extent even among humans today--especially in situations of anarchy. We want the AGI we built to be less likely to do that stuff than humans, not merely as likely.

Objection: "Any AGI that falls for this commit-now-before-the-others-do argument will also fall for many other silly do-X-now-before-it's-too-late arguments, and thus will be incapable of hurting anyone."

Reply: That would be nice, wouldn't it? Let's hope so, but not count on it. Indeed perhaps we should look into whether there are other arguments of this form that we should worry about our AI falling for...

Anecdote: A friend of mine, when she was a toddler, would threaten her parents: "I'll hold my breath until you give me the candy!" Imagine how badly things would have gone if she was physically capable of making arbitrary credible commitments. Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.

Conclusion

Overall, I'm not certain that this is a big problem. But it feels to me that it might be, especially if acausal trade turns out to be a real thing. I would not be surprised if "solving bargaining" turns out to be even more important than value alignment, because the stakes are so high. I look forward to a better understanding of this problem.

Many thanks to Abram Demski, Wei Dai, John Wentworth, and Romeo Stevens for helpful conversations.

New Comment
36 comments, sorted by Click to highlight new comments since: Today at 2:41 PM

IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently 'committed' agent.  If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them... then I offer $6 with slightly less than 5/6 probability, so they do no better than if they demanded $5, the amount I think is fair.  They cannot evade that by trying to make some 'commitment' earlier than I do.  I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases.

I am not locked into warfare with things that demand $6 instead of $5.  I do not go around figuring out how to invert their utility function for purposes of threatening them back - 'destroy all utility-function inverters (but do not invert their own utility functions)' was my guessed commandment that would be taught to kids in dath ilan, because you don't want reality to end up full of utilityfunction inverters.

From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner's dilemma.  I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.  I suggest cultivating the same suspicion with respect to the imagination of commitment races between Ultimatum Game players, in which whoever manages to make some move logically first walks away with $9 and the other poor agent can only take $1 - especially if you end up reasoning that the computationally weaker agent should be the winner.

I agree with all this I think.

This is why I said commitment races happen between consequentialists (I defined that term more narrowly than you do; the sophisticated reasoning you do here is nonconsequentialist by my definition). I agree that agents worthy of the label "rational" will probably handle these cases gracefully and safely. 

However, I'm not yet supremely confident that the AGIs we end up building will handle these cases gracefully and safely. I would love to become more confident & am looking for ways to make it more likely. 

If today you go around asking experts for an account of rationality, they'll pull off the shelf CDT or EDT or game-theoretic rationality (nash equilibria, best-respond to opponent) -- something consequentialist in the narrow sense. I think there is a nonzero chance that the relevant AGI will be like this too, either because we explicitly built it that way or because in some young dumb early stage it (like humans) picks up ideas about how to behave from its environment. Or else maybe because narrow-consequentialism works pretty well in single-agent environments and many muti-agent environments too, and maybe by the time the AGI is able to self-modify to something more sophisticated it is already thinking about commitment races and already caught in their destructive logic.

(ETA: Insofar as you are saying: "Daniel, worrying about this is silly, any AGI smart enough to kill us all will also be smart enough to not get caught in commitment races" then I say... I hope so! But I want to think it through carefully first; it doesn't seem obvious to me, for the above reasons.)

I think I'm less sure than @Eliezer Yudkowsky that there is a good solution to the problem of commitment races, even in theory, or that if there is a solution, it has the shape that he thinks it has. I've been thinking about this problem off and on since 2009, and haven't made much progress. Others have worked on this too (as you noted in the OP), and all seem to have gotten stuck at roughly the same place that I got stuck. Eliezer described what he would do in a particular game, but I don't know how to generalize his reasoning (which you call "nonconsequentialist") and incorporate it into a decision theory, even informally (e.g., on the same level of formality as my original description of UDT1.1 or UDT2).

As an alternative to Eliezer's general picture, it also seems plausible to me that the solution to commitment races looks like everyone trying to win the races by being as clever as they can (using whatever tricks one can think of to make the best commitments as quickly as possible while minimizing the downsides of doing so), or a messy mix of racing and trading/cooperating. UDT2 sort of fits into or is compatible with this picture, but might be far from the cleverest thing we can do (if this picture turns out to be correct).

To summarize, I think the commitment races problem poses a fundamental challenge to decision theory, and is not just a matter of "we know roughly or theoretically what should be done, and we just have to get AGI to do it." (I'm afraid some readers might get the latter impression from your exchange with Eliezer.)

TBC, I definitely agree that there's some basic structural issue here which I don't know how to resolve.  I was trying to describe properties I thought the solution needed to have, which ruled out some structural proposals I saw as naive; not saying that I had a good first-principles way to arrive at that solution.

Great comment. To reply I'll say a bit more about how I think of this stuff for the past few years:

I agree that the commitment races problem poses a fundamental challenge to decision theory, in the following sense: There may not exist a simple algorithm in the same family of algorithms as EDT, CDT, UDT 1.0, 1.1, and even 2.0, that does what we'd consider a good job in a realistic situation characterized by many diverse agents interacting over some lengthy period with the ability to learn about each other and make self-modifications (including commitments). Indeed it may be that the top 10% of humans by performance in environments like this, or even the top 90%, outperform the best possible simple-algorithm-in-that-family. Thus any algorithm for making decisions that would intuitively be recognized as a decision theory, would be worse in realistic environments than the messy neural net wetware of many existing humans, and probably far worse than the best superintelligences. (To be clear, I still hold out hope that this is false and such a simple in-family algorithm does exist.)

I therefore think we should widen our net and start considering algorithms that don't fit in the traditional decision theory family. For example, think of a human role model (someone you consider wise, smart, virtuous, good at philosophy, etc.) and then make them into an imaginary champion by eliminating what few faults they still have, and increasing their virtues to the extent possible, and then imagine them in a pleasant and secure simulated environment with control over their own environment and access to arbitrary tools etc. and maybe also the ability to make copies of themselves HCH style. You have now have described an algorithm that can be compared to the performance of EDT, UDT 2.0, etc. and arguably will be superior to all of them (because this wise human can use their tools to approximate or even directly compute such things to the extent that they deem it useful to do so). We can then start thinking about flaws in this algorithm, and see if we can fix them. (Another algorithm to consider is the human champion alone, without all the fancy tooling and copy-generation ability. Even this might still be better than CDT, EDT, UDT, etc.)

Another example:

Consider a standard expected utility maximizer of some sort (e.g. EDT) but with the following twist: It also has a deontological or almost-deontological constraint that prevents it from getting exploited. How is this implemented? Naive first attempt: It has some "would this constitute me being exploited?" classifier which it can apply to imagined situations, and which it constantly applies whenever it's thinking about what to do, and it doesn't take actions that trigger the classifier to a sufficiently high degree. Naive second attempt: "Me getting exploited" is assigned huge negative utility. (I suspect these might be equivalent, but also they might not be, anyhow moving on...) What can we say about this agent?

Well, it all depends on how good its classifier is, relative to the adversaries it is realistically likely to face. Are its adversaries able to find any adversarial examples to its classifier that they can implement in practice? Things that in some sense SHOULD count as exploitation, but which it won't classify as exploitation and thus will fall for?

Moreover, is its classifier wasteful/clumsy/etc., hurting it's own performance in other ways in order to achieve the no-exploitation property?

I think this might not be a hard problem. If you are facing adversaries significantly more intelligent than you, or who can simulate you in detail such that they can spend lots of compute to find adversarial examples by brute force, you are kinda screwed anyway probably and so it's OK if you are vulnerable to exploitation by them. Moreover there are probably fixes to even those failure modes -- e.g. plausibly "they used their simulation of me + lots of compute to find a solution that would give them lots of my stuff but not count as exploitation according to my classifier" can just be something that your classifier classifies as exploitation. Anything even vaguely resembling that can be classified as exploitation. So you'd only be exploitable in practice if they had the simulation of you but you didn't know they did. 

Moreover, that's just the case where you have a fixed/frozen classifier. More sophisticated designs could have more of a 'the constitution is a living document' vibe, a process for engaging in Philosophical/Moral Reasoning that has the power to modify the classifier as it sees fit -- but importantly, still applies the classifier to its own thinking processes, so it won't introduce a backdoor route to exploitation.

Another tool in the toolbox: Infohazard management. There's a classic tradeoff which you discovered, in the context of UDT 2.0 at least, where if you run the logical inductor for longer you risk making yourself exploitable or otherwise losing to agents that are early enough in logical time that you learn about their behavior (and they predict that you'll learn about their behavior) and so they exploit you. But on the other hand, if you pause the logical inductor and let the agent make self-modifications too soon, the self-modifications it makes might be really stupid/crazy. Well, infohazard management maybe helps solve this problem. Make a cautious self-modification along the lines of "let's keep running the logical inductor, but let's not think much yet about what other potentially-exploitative-or-adversarial agents might do." Perhaps things mostly work out fine if the agents in the commitment race are smart enough to do something like this before they stumble across too much information about each other.

Another tool in the toolbox: Learn from history: Heuristics / strategies / norms / etc. for how to get along in commitment race environment can be learned from history via natural selection, cultural selection, and reading history books. People have been in similar situations in the past, e.g. in some cultures people could unilaterally swear oaths/promises and would lose lots of status if they didn't uphold them. Over history various cultures have developed concepts of fairness that diverse agents with different interests can use to coordinate without incentivizing exploiters; we have a historical record which we can use to judge how well these different concept work, including how well they work when different people come from different cultures with different fairness concepts.

Another thing to mention: The incentive to commit to brinksmanshippy, exploitative policies is super strong to the extent that you are confident that the other agents you will interact with are consequentialists. But to the extent that you expect many of those agents to be nonconsequentialists with various anti-exploitation defenses (e.g. the classifier system I described above, or whatever sort of defenses they may have evolved culturally or genetically) the incentive is goes in the opposite direction -- doing brinksmanshippy / bully-ish strategies is going to waste resources at best and get you into lots of nasty fights with high probability and plausibly even  get everyone to preemptively gang up on you. 

And this is important because once you understand the commitment races problem, you realize that consequentialism is a repulsor state, not an attractor state; moreover, realistic agents (whether biological or artificial) will not begin their life as consequentialists except if specifically constructed to be that way. Moreover their causal history will probably contain lots of learned/evolved anti-exploitation defenses, some of which may have made its way into their minds.

Zooming out again: The situation seems extremely messy, but not necessarily grim. I'm certainly worried--enough to make this one of my main priorities!--but I think that agents worthy of being called "rational" will probably handle all this stuff more gracefully/competently than humans do, and I think (compared to how naive consequentialists would handle it, and certainly compared to how it COULD go) humans handle it pretty well. That is, I agree that "the solution to commitment races looks like everyone trying to win the races by being as clever as they can (using whatever tricks one can think of to make the best commitments as quickly as possible while minimizing the downsides of doing so), or a messy mix of racing and trading/cooperating," but I think that given what I've said in this comment, and some other intuitions which I haven't articulated, overall I expect things to go significantly better in expectation than they go with humans. The sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan. (Dath Ilan is Yudkowsky's fantasy utopia of cooperative competence)

I think that agents worthy of being called “rational” will probably handle all this stuff more gracefully/competently than humans do

Humans are kind of terrible at this right? Many give in even to threats (bluffs) conjured up by dumb memeplexes and back up by nothing (i.e., heaven/hell), popular films are full of heros giving in to threats, apparent majority of philosophers have 2-boxing intuitions (hence the popularity of CDT, which IIUC was invented specifically because some philosophers were unhappy with EDT choosing to 1-box), governments negotiate with terrorists pretty often, etc.

The sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.

If we build AGI that learn from humans or defer to humans on this stuff, do we not get human-like (in)competence?[1][2] If humans are not atypical, large parts of the acausal society/economy could be similarly incompetent? I imagine there could be a top tier of "rational" superintelligences, built by civilizations that were especially clever or wise or lucky, that cooperate with each other (and exploit everyone else who can be exploited), but I disagree with this second quoted statement, which seems overly optimistic to me. (At least for now; maybe your unstated reasons to be optimistic will end up convincing me.)


  1. I can see two ways to improve upon this: 1) AI safety people seem to have better intuitions (cf popularity of 1-boxing among alignment researchers) and maybe can influence the development of AGI in a better direction, e.g., to learn from / defer to humans with intuitions more like themselves. 2) We figure out metaphilosophy, which lets AGI figure out how to improve upon humans. (ETA: However, conditioning on there not being a simple and elegant solution to decision theory also seems to make metaphilosophy being simple and elegant much less likely. So what would "figure out metaphilosophy" mean in that case?) ↩︎

  2. I can also see the situation potentially being even worse, since many future threats will be very "out of distribution" for human evolution/history/intuitions/reasoning, so maybe we end up handling them even worse than current threats. ↩︎

Yes. Humans are pretty bad at this stuff, yet still, society exists and mostly functions. The risk is unacceptably high, which is why I'm prioritizing it, but still, by far the most likely outcome of AGIs taking over the world--if they are as competent at this stuff as humans are--is that they talk it over, squabble a bit, maybe get into a fight here and there, create & enforce some norms, and eventually create a stable government/society. But yeah also I think that AGIs will be by default way better than humans at this sort of stuff. I am worried about the "out of distibution" problem though, I expect humans to perform worse in the future than they perform in the present for this reason.

Yes, some AGIs will be better than others at this, and presumably those that are worse will tend to lose out in various ways on average, similar to what happens in human society.

Consider that in current human society, a majority of humans would probably pay ransoms to free loved ones being kidnapped. Yet kidnapping is not a major issue; it's not like 10% of the population is getting kidnapped and paying ransoms every year. Instead, the governments of the world squash this sort of thing (well, except for failed states etc.) and do their own much more benign version, where you go to jail if you don't pay taxes & follow the laws. When you say "the top tier of rational superintelligences exploits everyone else" I say that is analogous to "the most rational/clever/capable humans form an elite class which rules over and exploits the masses." So I'm like yeah, kinda sorta I expect that to happen, but it's typically not that bad? Also it would be much less bad if the average level of rationality/capability/etc. was higher?

I'm not super confident in any of this to be clear.

 

But yeah also I think that AGIs will be by default way better than humans at this sort of stuff.

What's your reasons for thinking this? (Sorry if you already explained this and I missed your point, but it doesn't seem like you directly addressed my point that if AGIs learn from or defer to humans, they'll be roughly human-level at this stuff?)

When you say “the top tier of rational superintelligences exploits everyone else” I say that is analogous to “the most rational/clever/capable humans form an elite class which rules over and exploits the masses.” So I’m like yeah, kinda sorta I expect that to happen, but it’s typically not that bad?

I think it could be much worse than current exploitation, because technological constraints prevent current exploiters from extracting full value from the exploited (have to keep them alive for labor, can't make them too unhappy or they'll rebel, monitoring for and repressing rebellions is costly). But with superintelligence and future/acausal threats, an exploiter can bypass all these problems by demanding that the exploited build an AGI aligned to itself and let it take over directly.

I agree that if AGIs defer to humans they'll be roughly human-level, depending on which humans they are deferring to. If I condition on really nasty conflict happening as a result of how AGI goes on earth, a good chunk of my probability mass (and possibly the majority of it?) is this scenario. (Another big chunk, possibly bigger, is the "humans knowingly or unknowingly build naive consequentialists and let rip" scenario, which is scarier because it could be even worse than the average human, as far as I know). Like I said, I'm worried.

If AGIs learn from humans though, well, it depends on how they learn, but in principle they could be superhuman.

Re: analogy to current exploitation: Yes there are a bunch of differences which I am keen to study, such as that one. I'm more excited about research agendas that involve thinking through analogies like this than I am about what people interested in this topic seem to do by default, which is think about game theory and Nash bargaining and stuff like that. Though I do agree that both are useful and complementary.

The Ultimatum game seems like it has pretty much the same type signature as the prisoner's dilemma: Payoff matrix for different strategies, where the players can roll dice to pick which strategy they use. Does timeless decision theory return the "correct answer" (second player rejects greedy proposals with some probability) when you feed it the Ultimatum game?

Okay, so now having thought about this a bit...

I at first read this and was like "I'm confused – isn't this what the whole agent foundations agenda is for? Like, I know there are still kinks to work out, and some of this kinks are major epistemological problems. But... I thought this specific problem was not actually that confusing anymore."

"Don't have your AGI go off and do stupid things" is a hard problem, but it seemed basically to be restating "the alignment problem is hard, for lots of finnicky confusing reasons."

Then I realized "holy christ most AGI research isn't built off the agent foundations agenda and people regularly say 'well, MIRI is doing cute math things but I don't see how they're actually relevant to real AGI we're likely to build.'"

Meanwhile, I have several examples in mind of real humans who fell prey to something similar to commitment-race concerns. i.e. groups of people who mutually grim-triggered each other because they were coordinating on slightly different principles. (And these were humans who were trying to be rationalist and even agent-foundations-based)

So, yeah actually it seems pretty likely that many AGIs that humans might build might accidentally fall into these traps. 

So now I have a vague image in my head of a rewrite of this post that ties together some combo of:

  • The specific concerns noted here
  • The rocket alignment problem "hey man we really need to make sure we're not fundamentally confused about agency and rationality."
  • Possibly some other specific agent-foundations-esque concerns

Weaving those into a central point of:

"If you're the sort of person who's like 'Why is MIRI even helpful? I get how they might be helpful but they seem more like a weird hail-mary or a 'might as well given that we're not sure what else to do?'... here is a specific problem you might run into if you didn't have a very thorough understanding of robust agency when you built your AGI. This doesn't (necessarily) imply any particular AGI architecture, but if you didn't have a specific plan for how to address these problems, you are probably going to get them wrong by default.

(This post might already exist somewhere, but currently these ideas feel like they just clicked together in my mind in a way they hadn't previously. I don't feel like I have the ability to write up the canonical version of this post but feel like "someone with better understanding of all the underlying principles" should)

One big factor this whole piece ignores is communication channels: a commitment is completely useless unless you can credibly communicate it to your opponent/partner. In particular, this means that there isn't a reason to self-modify to something UDT-ish unless you expect other agents to observe that self-modification. On the other hand, other agents can simply commit to not observing whether you've committed in the first place - effectively destroying the communication channel from their end.

In a game of chicken, for instance, I can counter the remove-the-steering-wheel strategy by wearing a blindfold. If both of us wear a blindfold, then neither of us has any reason to remove the steering wheel. In principle, I could build an even stronger strategy by wearing a blindfold and using a beeping laser scanner to tell whether my opponent has swerved - if both players do this, then we're back to the original game of chicken, but without any reason for either player to remove their steering wheel.

I think in the acausal context at least that wrinkle is smoothed out.

In a causal context, the situation is indeed messy as you say, but I still think commitment races might happen. For example, why is [blindfold+laserscanner] a better strategy than just blindfold? It loses to the blindfold strategy, for example. Whether or not it is better than blindfold depends on what you think the other agent will do, and hence it's totally possible that we could get a disastrous crash (just imagine that for whatever reason both agents think the other agent will probably not do pure blindfold. This can totally happen, especially if the agents don't think they are strongly correlated with each other and sometimes even if they do (e.g. if they use CDT)) The game of chicken doesn't cease being a commitment race when we add the ability to blindfold and the ability to visibly attach laserscanners.

Blindfold + scanner does not necessarily lose to blindfold. The blindfold does not prevent swerving, it just prevents gaining information - the blindfold-only agent acts solely on its priors. Adding a scanner gives the agent more data to work with, potentially allowing the agent to avoid crashes. Foregoing the scanner doesn't actually help unless the other player knows I've foregone the scanner, which brings us back to communication - though the "communication" at this point may be in logical time, via simulation.

In the acausal context, communication kicks even harder, because either player can unilaterally destroy the communication channel: they can simply choose to not simulate the other player. The game will never happen at all unless both agents expect (based on priors) to gain from the trade.

If you choose not to simulate the other player, then you can't see them, but they can still see you. So it's destroying one direction of the communication channel. But the direction that remains (they seeing you) is the dimension most relevant for e.g. whether or not there is a difference between making a commitment and credibly communicating it to your partner. Not simulating the other player is like putting on a blindfold, which might be a good strategy in some contexts but seems kinda like making a commitment: you are committing to act on your priors in the hopes that they'll see you make this commitment and then conform their behavior to the incentives implied by your acting on your priors.

It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”

The set of solution concepts you commit to regarding as acceptable still poses an equilibrium selection problem. But, on the face of it at least, the “thinking more” part is handled by conditional commitments to act on the basis of future beliefs.

I guess there’s the problem of what protocols for specifying future world-models you commit to regarding as acceptable. Maybe there are additional protocols that haven’t occurred to you, but which other agents may have committed to and which you would regard as acceptable when presented to you. Hopefully it is possible to specify sufficiently flexible methods for determining whether protocols proposed by your future counterparts are acceptable that this is not a problem.

If I read you correctly, you are suggesting that some portion of the problem can be solved, basically -- that it's in some sense obviously a good idea to make a certain sort of commitment, e.g. "When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.” So the commitment races problem may still exist, but it's about what other commitments to make besides this one, and when. Is this a fair summary?

I guess my response would be "On the object level, this seems like maybe a reasonable commitment to me, though I'd have lots of questions about the details. We want it to be vague/general/flexible enough that we can get along nicely with various future agents with somewhat different protocols, and what about agents that are otherwise reasonable and cooperative but for some reason don't want to agree on a world-model with us? On the meta level though, I'm still feeling burned from the various things that seemed like good commitments to me and turned out to be dangerous, so I'd like to have some sort of stronger reason to think this is safe."

Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced otherwise”. Idk.

Your meta-level concern seems warranted. One would at least want to try to formalize the kinds of commitments we’re discussing and ask if they provide any guarantees, modulo equilibrium selection.

I think we are on the same page then. I like the idea of a deliberation module; it seems similar to the "moral reasoning module" I suggested a while back. The key is to make it not itself a coward or bully, reasoning about schelling points and universal principles and the like instead of about what-will-lead-to-the-best-expected-outcomes-given-my-current-credences.

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn't actually commit to anything then.

 

Why is that?

All the versions of updatelessness that I know of would have led to some pretty disastrous, not-adding-up-to-normality behaviors, I think. I'm not sure. More abstractly, the commitment races problem has convinced me to be more skeptical of commitments, even ones that seem probably good. If I was a consequentialist I might take the gamble, but I'm not a consequentialist -- I have commitments built into me that have served my ancestors well for generations, and I suspect for now at least I'm better off sticking with that than trying to self-modify to something else.

(This is some of what I tried to say yesterday, but I was very tried and not sure I said it well)

Hm, the way I understand UDT, is that you give yourself the power to travel back in logical time. This means that you don't need to actually make commitment early in your life when you are less smart.

If you are faced with blackmail or transparent Newcomb's problem, or something like that, where you realise that if you had though of the possibility of this sort of situation before it happened (but with your current intelligence), you would have pre-committed to something, then you should now do as you would have pre-committed to.

This means that an UDT don't have to do tons of pre-commitments. It can figure things out as it goes, and still get the benefit of early pre-committing. Though as I said when we talked, it does loose some transparency which might be very costly in some situations. Though I do think that you loose transparency in general by being smart, and that it is generally worth it.

(Now something I did not say)

However the there is one commitment that you (maybe?[1]) have to do to get the benefit of UDT if you are not already UDT, which is to commit to become UDT. And I get that you are wary of commitments. 

Though more concretely, I don't see how UDT can lead to worse behaviours. Can you give an example? Or do you just mean that UDT get into commitment races at all, which is bad? But I don't know any DT that avoids this, other than always giving in to blackmail and bullies, which I already know you don't, given one of the stories in the blogpost.

[1] Or maybe not. Is there a principled difference between never giving into blackmail becasue you pre-committed something, or just never giving into blackmail with out any binding pre-commitment? I suspect not really, which means you are UDT as long as you act UDT, and no pre-commitment needed, other than for your own sake.

Thanks for the detailed reply!

where you realise that if you had though of the possibility of this sort of situation before it happened (but with your current intelligence), you would have pre-committed to something, then you should now do as you would have pre-committed to.

The difficulty is in how you spell out that hypothetical. What does it mean to think about this sort of situation before it happened but with your current intelligence? Your current intelligence includes lots of wisdom you've accumulated, and in particular, includes the wisdom that this sort of situation has happened, and more generally that this sort of situation is likely, etc. Or maybe it doesn't -- but then how do we define current intelligence then? What parts of your mind do we cut out, to construct the hypothetical?

I've heard of various ways of doing this and IIRC none of them solved the problem, they just failed in different ways. But it's been a while since I thought about this.

One way they can fail is by letting you have too much of your current wisdom in the hypothetical, such that it becomes toothless -- if your current wisdom is that people threatening you is likely, you'll commit to giving in instead of resisting, so you'll be a coward and people will bully you. Another way they can fail is by taking away too much of your current wisdom in the hypothetical, so that you commit to stupid-in-retrospect things too often.

Imagine your life as a tree (as in data structure). Every observation which (from your point of view of prior knowledge) could have been different, and every decision which (from your point of view) could have been different, is a node in this tree. 

Ideally you would would want to pre-analyse the entire tree, and decide the optimal pre-commitment for each situation. This is too much work. 

So instead you wait and see which branch you find yourself in, only then make the calculations needed to figure out what you would do in that situation, given a complete analysis of the tree (including logical constraints, e.g. people predicting what you would have done, etc). This is UDT. In theory, I see no drawbacks with UDT. Except in practice UDT is also too much work. 

What you actually do, as you say, is to rely on experience based heuristics. Experience based heuristics is much superior for computational efficiency, and will give you a leg up in raw power. But you will slide away from optimal DT, which will give you a negotiating disadvantage. Given that I think raw power is more important than negotiating advantage, I think this is a good trade-off. 

The only situation where you want to rely more on DT principles, is in super important one-off situations, and you basically only get those in weird acausal trade situations. Like, you could frame us building a friendly AI as acausal trade, like Critch said, but that framing does not add anything useful. 

And then there is things like this and this and this, which I don't know how to think of. I suspect it breaks somehow, but I'm not sure how. And if I'm wrong, getting DT right might be the most important thing.

But in any normal situation, you will either have repeated games among several equals, where some coordination mechanism is just uncomplicatedly in everyone interest. Or your in a situation where one person just have much more power over the other one.

h/t Anthony DiGiovanni who points to this new paper making a weaker version of this point, in the context of normative ethics: Johan E. Gustafsson, Bentham’s Mugging - PhilPapers

I mostly agree with this post, except I'm not convinced it is very important. (I wrote some similar thought here.)

Raw power (including intelligence) will always be more important than having the upper hand in negotiation. Because I can only shift you up to the amount I can threaten you.

Let's say I can cause you up to X utility of harm, according to your utility function. If I'm maximally skilled at blackmail negotiation then I can decide your action with in the set of action such that your utility is with in (max-X, max] utility.

If X utility is a lot, then I can influence you a lot. If X is not so much then I don't have much power over you. If I'm strong then X will be large, and influencing your action will probably be of little importance to me. 

Blackmail is only important when players are of similar straights which is probably unlikely, or if the power to destroy is much more than the power to create, which I also find unlikely. 

The main scenario where I expect blackmail to seriously matter (among super intelligences) is in aclausal trade between different universes. I'm sceptical to this being a real thing, but admit I don't have strong arguments on this point.

I agree raw power (including intelligence) is very useful and perhaps generally more desireable than bargaining power etc. But that doesn't undermine the commitment races problem; agents with the ability to make commitments might still choose to do so in various ways and for various reasons, and there's general pressure (collective action problem style) for them to do it earlier while they are stupider, so there's a socially-suboptimal amount of risk being taken.

I agree that on Earth there might be a sort of unipolar takeoff where power is sufficiently imbalanced and credibility sufficiently difficult to obtain and "direct methods" easier to employ, that this sort of game theory and bargaining stuff doesn't matter much. But even in that case there's acausal stuff to worry about, as you point out.

I was confused about this post, and... I might have resolve my confusion by the time I got ready to write this comment. Unsure. Here goes:

My first* thought: 

Am I not just allowed to precommit to "be the sort of person who always figures out whatever the optimal game theory was, and commit to that?". I thought that was the point. 

i.e. I wouldn't precommit to treating either the Nash Bargaining Solution or Kalai-Smorodinsky Solution as "the permanent grim trigger bullying point", I'd precommit to something like "have a meta-policy of not giving into bullying, pick my best-guess-definition-of-bullying as my default trigger, and my best-guess grim-trigger response, but include an 'oh shit I didn't think about X' parameter." (with some conditional commitments thrown in)

Where X can't be an arbitrary new belief – the whole point of having a grim trigger clause is to be able to make appropriately weighted threats that AGI-Bob really thinks will happen. But, if I legitimately didn't think of the Kalai-Smordinwhatever solution as something an agent might legitimately think was a good coordination tool, I want to be able to say. depending on circumstances:

  1. If the deal hasn't resolved yet "oh, shit I JUUUST thought of the Kalai-whatever thing and this means I shouldn't execute my grim trigger anti-bullying clause without first offering some kind of further clarification step."
  2. If the deal already resolved before I thought of it, say "oh shit man I really should realized the Kalai-Smorodinsk thing was a legitimate schelling point and not started defecting hard as punishment. Hey, fellow AGI, would you like me to give you N remorseful utility in return for which I stop grim-triggering you and you stop retaliating at me and we end the punishment spiral?"

My second* thought:

Okay. So. I guess that's easy for me to say. But, I guess the whole point of all this updateless decision theory stuff was to actually formalize that in a way that you could robustly program an AGI that you were about to give the keys to the universe. 

Having a vague handwavy notion of it isn't reassuring enough if you're about to build a god. 

And while it seems to me like this is (relatively) straightforward... do I really want to bet that?

I guess my implicit assumption was that game theory would turn out to not be that complicated in the grand scheme of thing. Surely once you're a Jupiter Brain you'll have it figured out? And, hrmm, maybe that's true, but but maybe it's not, or maybe it turns out the fate of the cosmos gets decided with smaller AGIs fighting over Earth which much more limited compute.

Third thought:

Man, just earlier this year, someone offered me a coordination scheme that I didn't understand, and I fucked it up, and the deal fell through because I didn't understand the principles underlying it until just-too-late. (this is an anecdote I plan to write up as a blogpost sometime soon)

And... I guess I'd been implicitly assuming that AGIs would just be able to think fast enough that that wouldn't be a problem.

Like, if you're talking to a used car salesman, and you say "No more than $10,000", and then they say "$12,000 is final offer", and then you turn and walk away, hoping that they'll say "okay, fine, $10,000"... I suppose metaphorical AGI used car buyers could say "and if you take more than 10 compute-cycles to think about it, the deal is off." And that might essentially limit you to only be able to make choices you'd precomputed, even if you wanted to give yourself the option to think more.

That seems to explain why my "Just before deal resolves, realize I screwed up my decision theory" idea doesn't work. 

It seems like my "just after deal resolves and I accidentally grim trigger, turn around and say 'oh shit, I screwed up, here is remorse payment + a costly proof that that I'm not fudging my decision theory'" should still work though?

I guess in the context of Acausal Trade, I can imagine things like "they only bother running a simulation of you for 100 cycles, and it doesn't matter if on the 101st cycle you realize you made a mistake and am sorry." They'll never know it.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

I dunno. Maybe I'm still confused.

But, I wanted to check in on whether I was on the right track in understanding what considerations were at play here.

...

*actually there were like 20 thoughts before I got to the one I've labeled 'first thought' here. But, "first thought that seemed worth writing down."

Thanks! Reading this comment makes me very happy, because it seems like you are now in a similar headspace to me back in the day. Writing this post was my response to being in this headspace.

But... I dunno man. I figured the first rule of Acausal Trade was "build a galaxy brain and think really goddamn carefully about acausal trade and philosophical competence" before you actually try simulating anything, and I'm skeptical a galaxy brain can't figure out the right precommitments.

This sounds like a plausibly good rule to me. But that doesn't mean that every AI we build will automatically follow it. Moreover, thinking about acausal trade is in some sense engaging in acausal trade. As I put it:

Since real agents can't be logically omniscient, one needs to decide how much time to spend thinking about things like game theory and what the outputs of various programs are before making commitments. When we add acausal bargaining into the mix, things get even more intense. Scott Garrabrant, Wei Dai, and Abram Demski have described this problem already, so I won't say more about that here. Basically, in this context, there are many other people observing your thoughts and making decisions on that basis. So bluffing is impossible and there is constant pressure to make commitments quickly before thinking longer. (That's my take on it anyway)

As for your handwavy proposals, I do agree that they are pretty good. They are somewhat similar to the proposals I favor, in fact. But these are just specific proposals in a big space of possible strategies, and (a) we have reason to think there might be flaws in these proposals that we haven't discovered yet, and (b) even if these proposals work perfectly there's still the problem of making sure that our AI follows them:

Objection: "Surely they wouldn't be so stupid as to make those commitments--even I could see that bad outcome coming. A better commitment would be..."
Reply: The problem is that consequentialist agents are motivated to make commitments as soon as possible, since that way they can influence the behavior of other consequentialist agents who may be learning about them. Of course, they will balance these motivations against the countervailing motive to learn more and think more before doing drastic things. The problem is that the first motivation will push them to make commitments much sooner than would otherwise be optimal. So they might not be as smart as us when they make their commitments, at least not in all the relevant ways. Even if our baby AGIs are wiser than us, they might still make mistakes that we haven't anticipated yet. The situation is like the centipede game: Collectively, consequentialist agents benefit from learning more about the world and each other before committing to things. But because they are all bullies and cowards, they individually benefit from committing earlier, when they don't know so much.

If you want to think and talk more about this, I'd be very interested to hear your thoughts. Unfortunately, while my estimate of the commitment races problem's importance has only increased over the past year, I haven't done much to actually make intellectual progress on it.

Yeah I'm interested in chatting about this. 

I feel I should disclaim "much of what I'd have to say about this is a watered down version of whatever Andrew Critch would say". He's busy a lot, but if you haven't chatted with him about this yet you probably should, and if you have I'm not sure whether I'll have much to add.

But I am pretty interested right now in fleshing out my own coordination principles and fleshing out my understanding of how they scale up from "200 human rationalists" to 1000-10,000 sized coalitions to All Humanity and to AGI and beyond. I'm currently working on a sequence that could benefit from chatting with other people who think seriously about this.

This feels like an important question in Robust Agency and Group Rationality, which are major topics of my interest.

I have another "objection", although it's not a very strong one, and more of just a comment.

One reason game theory reasoning doesn't work very well in predicting human behavior is because games are always embedded in a larger context, and this tends to wreck the game-theory analysis by bringing in reputation and collusion as major factors. This seems like something that would be true for AIs as well (e.g. "the code" might not tell the whole story; I/"the AI" can throw away my steering wheel but rely on an external steering-wheel-replacing buddy to jump in at the last minute if needed).

In apparent contrast to much of the rationalist community, I think by default one should probably view game theoretic analyses (and most models) as "just one more way of understanding the world" as opposed to "fundamental normative principles", and expect advanced AI systems to reason more heuristically (like humans).

But I understand and agree with the framing here as "this isn't definitely a problem, but it seems important enough to worry about".


I think you're missing at least one key element in your model: uncertainty about future predictions. Commitments have a very high cost in terms of future consequence-effecting decision space. Consequentialism does _not_ imply a very high discount rate, and we're allowed to recognize the limits of our prediction and to give up some power in the short term to reserve our flexibility for the future.

Also, one of the reasons that this kind of interaction is rare among humans is that commitment is impossible for humans. We can change our minds even after making an oath - often with some reputational consequences, but still possible if we deem it worthwhile. Even so, we're rightly reluctant to make serious committments. An agent who can actually enforce it's self-limitations is going to be orders of magnitude more hesitant to do so.

All that said, it's worth recognizing that an agent that's significantly better at predicting the consequences of potential commitments will pay a lower cost for the best of them, and has a material advantage over those who need flexibility because they don't have information. This isn't a race in time, it's a race in knowledge and understanding. I don't think there's any way out of that race - more powerful agents are going to beat weaker ones most of the time.

I don't think I was missing that element. The way I think about it is: There is some balance that must be struck between making commitments sooner (risking making foolish decisions due to ignorance) and later (risking not having the right commitments made when a situations arises in which they would be handy). A commitment race is a collective action problem where individuals benefit from going far to the "sooner" end of the spectrum relative to the point that would be optimal for everyone if they could coordinate.

I agree about humans not being able to make commitments--at least, not arbitrary commitments. (Arguably, getting angry and seeking revenge when someone murders your family is a commitment you made when you were born.) I think we should investigate whether this inability is something evolution "chose" or not.

I agree it's a race in knowledge/understanding as well as time. (The two are related.) But I don't think more knowledge = more power. For example, if I don't know anything and decide to commit to plan X which benefits me, else war, and you know more than me--in particular, you know enough about me to know what I will commit to--and you are cowardly, then you'll go along with my plan.


This post attempts to generalize and articulate a problem that people have been thinking about since at least 2016.

I found some related discussions going back to 2009. It's mostly highly confused, as you might expect, but I did notice this part which I'd forgotten and may actually be relevant:

But if you are TDT, you can’t always use less com­put­ing power, be­cause that might be cor­re­lated with your op­po­nents also de­cid­ing to use less com­put­ing power

This could potentially be a way out of the "racing to think as little as possible before making commitments" dynamic, but if we have to decide how much to let our AIs think initially before making commitments, on the basis of reasoning like this, that's a really hairy thing to have to do. (This seems like another good reason for wanting to go with a metaphilosophical approach to AI safety instead of a decision theoretic one. What's the point of having a superintelligent AI if we can't let it figure these kinds of things out for us?)

If two consequentialists are attempting to divide up a pie or select a game-theoretic equilibrium to play in, the one that can “move first” can get much more than the one that “moves second.”

I'm not sure how the folk theorem shows this. Can you explain?

going updateless is like making a bunch of commitments all at once

Might be a good idea to offer some examples here to help explain updateless and for pumping intuitions.

Meanwhile, a few years ago when I first learned about the concept of updatelessness, I resolved to be updateless from that point onwards. I am now glad that I couldn’t actually commit to anything then.

Interested to hear more details about this. What would have happened if you were actually able to become updateless?

Thanks, edited to fix!

I agree with your push towards metaphilosophy.

I didn't mean to suggest that the folk theorem proves anything. Nevertheless here is the intuition: The way the folk theorem proves any status quo is possible is by assuming that players start off assuming everyone else will grim trigger them for violating that status quo. So in a two-player game, if both players start off assuming player 1 will grim trigger player 2 for violating player 1's preferred status quo, then player 1 will get what they want. One way to get this to happen is for player 1 to be "earlier in logical time" than player 2 and make a credible commitment.

As for updatelessness: Well, updateless agents follow the policy that is optimal from the perspective of the credences they have at the time they go updateless. So e.g. if there is a cowardly agent who simulates you at that time or later and then caves to your demands (if you make any) then an updateless agent will be a bully and make demands, i.e. commit to punishing people it identifies as cowards who don't do what it wants. But of course updateless agents are also cowards themselves, in the sense that the best policy from the perspective of credences C is to cave in to any demands that have already been committed to according to C. I don't have a super clear example of how this might lead to disaster, but I intend to work one out in the future...

Same goes for my own experience. I don't have a clear example in mind of something bad that would have happened to me if I had actually self-modified, but I get a nervous feeling about it.