John_Maxwell's Shortform

John_Maxwell

This is a special post for quick takes by John_Maxwell. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

A friend and I went on a long drive recently and listened to this podcast with Andrew Critch on ARCHES. On the way back from our drive we spent some time brainstorming solutions to the problems he outlines. Here are some notes on the podcast + some notes on our brainstorming.

In a possibly inaccurate nutshell, Critch argues that what we think of as the "alignment problem" is most likely going to get solved because there are strong economic incentives to solve it. However, Critch is skeptical of forming a singleton--he says people tend to resist that kind of concentration of power, and it will be hard for an AI team that has this as their plan to recruit team members. Critch says there is really a taxonomy of alignment problems:

single-single, where we have a single operator aligning a single AI with their preferences
single-multi, where we have a single operator aligning multiple AIs with their preferences
multi-single, where we have multiple operators aligning a single AI with their preferences
multi-multi, where we have multiple operators aligning multiple AIs with their preferences

Critch says that although there are commercial incentives to solve the single-single alignment problem, there aren't commercial incentives to solve all of the others. He thinks the real alignment failures might look like the sort of diffusion of responsibility you see when navigating bureaucracy.

I'm a bit skeptical of this perspective. For one thing, I'm not convinced commercial incentives for single-single alignment will extrapolate well to exotic scenarios such as the "malign universal prior" problem--and if hard takeoff happens then these exotic scenarios might come quickly. For another thing, although I can see why advocating a singleton would be a turnoff to the AI researchers that Critch is pitching, I feel like the question of whether to create a singleton deserves more than the <60 seconds of thought that an AI researcher having a casual conversation with Critch likely puts into their first impression. If there are commercial incentives to solve single-single alignment but not other kinds, shouldn't we prefer that single-single is the only kind which ends up being load-bearing? Why can't we form an aligned singleton and then tell it to design a mechanism by which everyone can share their preferences and control what the singleton does (democracy but with better reviews)?

I guess a big issue is the plausibility of hard takeoff, because if hard takeoff is implausible, that makes it less likely that a singleton will form under any circumstances, and it also means that exotic safety problems aren't likely to crop up as quickly. If this is Critch's worldview then I could see why he is prioritizing the problems he is prioritizing.

Anyway my friend and I spent some time brainstorming about how to solve versions of the alignment problem besides single-single. Since we haven't actually read ARCHES or much relevant literature, it's likely that much of what comes below is clueless, but it might also have new insights due to being unconstrained by existing paradigms :P

One scenario which is kind of in between multi-single and multi-multi alignment is a scenario where everyone has an AI agent which negotiates with some kind of central server on their behalf. We could turn multi-single into this scenario by telling the single AI to run internal simulations of everyone's individual AI agent, or we could turn multi-multi into this scenario if we have enough cooperation/enforcement for different people to abide by the agreements that their AI agents make with one another on their behalf.

Most of the game theory we're familiar with deals with a fairly small space of agreements it is possible to make, but it occurred to us that in an ideal world, these super smart AIs would be doing a lot of creative thinking, trying to figure out a clever way for everyone's preferences to be satisfied simultaneously. Let's assume each robot agent has a perfect model of its operator's preferences (or can acquire a perfect model as needed by querying the operator). The central server queries the agents about how much utility their operator assigns to various scenarios, or whether they prefer Scenario A to Scenario B, or something like that. And the agents can respond either truthfully or deceptively ("data poisoning"), trying to navigate towards a final agreement which is as favorable as possible for their operator. Then the central server searches the space of possible agreements in a superintelligent way and tries to find an agreement that everyone likes. (You can also imagine a distributed version of this where there is no central server and individual robot agents try to come up with a proposal that everyone likes.)

How does this compare to the scenario I mentioned above, where an aligned AI designs a mechanism and collects preferences from humans directly without any robot agent as an intermediary? The advantage of robot agents is that if everyone gets a superintelligent agent, then it is harder for individuals to gain advantage through the use of secret robot agents, so the overall result ends up being more fair. However, it arguably makes the mechanism design problem harder: If it is humans who are answering preference queries rather than superintelligent robot agents, since humans have finite intelligence, it will be harder for them to predict the strategic results of responding in various ways to preference queries, so maybe they're better off just stating their true preferences to minimize downside risk. Additionally, an FAI is probably better at mechanism design than humans. But then again, if the mechanism design for discovering fair agreements between superintelligent robot agents fails, and a single agent manages to negotiate really well on behalf of its owner's preferences, then arguably you are back in the singleton scenario. So maybe the robot agents scenario has the singleton scenario as its worst case.

I said earlier that it will be harder for humans to predict the strategic results of responding in various ways to preference queries. But we might be able to get a similar result for supersmart AI agents by making use of secret random numbers during the negotiation process to create enough uncertainty where revealing true preferences becomes the optimal strategy. (For example, you could imagine two mechanisms, one of which incentivizes strategic deception in one direction, and the other incentivizes strategic deception in the other direction; if we collect preferences and then flip a coin regarding which mechanism to use, the best strategy might be to do no deception at all.)

Another situation to consider is one where we don't have as much cooperation/enforcement and individual operators are empowered to refuse to abide by any agreement--let's call this "declaring war". In this world, we might prefer to overweight the preferences of more powerful players, because if everyone is weighted equally regardless of power, then the powerful players might have an incentive to declare war and get more than their share. However it's unclear how to do power estimation in an impartial way. Also, such a setup incentivizes accumulation of power.

One idea which seems like it might be helpful on first blush would be to try to invent some way of verifiably implementing particular utility functions, so competing teams could know that a particular AI will take their utility function into account. However this could be abused as follows: In the same way the game of chicken incentivizes tearing out your steering wheel so the opponent has no choice but to swerve, Team Evil could verifiably implement a particular utility function in their AI such that their AI will declare war unless competing teams verifiably implement a utility function Team Evil specifies.

Anyway looking back it doesn't seem like what I've written actually does much for the "bureaucratic diffusion of responsibility" scenario. I'd be interested to know concretely how this might occur. Maybe what we need is a mechanism for incentivizing red teaming/finding things that no one is responsible for/acquiring responsibility for them?

In this reaction to Critch's podcast, I wrote about some reasons to think that a singleton would be preferable to a multipolar scenario. Here's another rather exotic argument.

[The dark forest theory] is explained very well near the end of the science fiction novel, The Dark Forest by Liu Cixin.

...

When two [interstellar] civilizations meet, they will want to know if the other is going to be friendly or hostile. One side might act friendly, but the other side won't know if they are just faking it to put them at ease while armies are built in secret. This is called chains of suspicion. You don't know for sure what the other side's intentions are. On Earth this is resolved through communication and diplomacy. But for civilizations in different solar systems, that's not possible due to the vast distances and time between message sent and received. Bottom line is, every civilization could be a threat and it's impossible to know for sure, therefore they must be destroyed to ensure your survival.

Source. (Emphasis mine.)

Secure second strike is the ability to retaliate with your own nuclear strike if someone hits you with nukes. Secure second strike underpins mutually assured destruction. If nuclear war had a "first mover advantage", where whoever launches nukes first wins because the country that is hit with nukes is unable to retaliate, that would be much worse for a game theory perspective, because there's an incentive to be the first mover and launch a nuclear war (especially if you think your opponent might do the same).

My understanding is that the invention of nuclear submarines was helpful for secure second strike. There is so much ocean for them to hide in that it's difficult to track and eliminate all of your opponent's nuclear submarines and ensure they won't be able to hit you back.

However, in Allan Dafoe's article AI Governance: Opportunity and Theory of Impact, he mentions that AI processing of undersea sensors could increase the risk of nuclear war (presumably because it makes it harder for nuclear submarines to hide).

Point being, we don't know what the game theory of a post-AGI world looks like. And we really don't know what interstellar game theory between different AGIs looks like. ("A colonized solar system is plausibly a place where predators can see most any civilized activities of any substantial magnitude, and get to them easily if not quickly."--source.) It might be that the best strategy is for multipolar AIs to unify into a singleton anyway.

Potential counterargument: Second-strike capabilities are still relevant in the interstellar setting. You could build a bunch of hidden ships in the oort cloud to ram the foe and do equal devastation if the other party does it first, deterring a first strike even with tensions and an absence of communication. Further, while the "ram with high-relativistic objects" idea works pretty well for preemptively ending a civilization confined to a handful of planets, AI's would be able to colonize a bunch of little asteroids and KBO's and comets in the oort cloud, and the higher level of dispersal would lead to preemptive total elimination being less viable.

That's possible, but I'm guessing that it's not hard for a superintelligent AI to suddenly swallow an entire system using something like gray goo.

single-single, where we have a single operator aligning a single AI with their preferences
single-multi, where we have a single operator aligning multiple AIs with their preferences
multi-single, where we have multiple operators aligning a single AI with their preferences
multi-multi, where we have multiple operators aligning multiple AIs with their preferences