Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.
Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high chance of takeover can exploit trade to further increase its chances of takeover ("Oh, I just have short-term preferences where I want you to run some scientific simulations for me"), then this correlation is the opposite of fortunate. If people increase an unaligned AI's situational awareness so it can trust our trade offer, then the correlation seems indirectly bad for us.
These are some research notes on whether we could reduce AI takeover risk by cooperating with unaligned AIs. I think the best and most readable public writing on this topic is “Making deals with early schemers”, so if you haven't read that post, I recommend starting there.
These notes were drafted before that post existed, and the content is significantly overlapping. Nevertheless, these notes do contain several points that aren’t in that post, and I think it could be helpful for people who are (thinking about) working in this area.
Most of the novel content can be found in the sections on What do AIs want?, Payment structure, What I think should eventually happen, and the appendix BOTEC. (While the sections on “AI could do a lot to help us” and “Making credible promises” are mostly overlapping with “Making deals with early schemers”.) You can also see my recent post Being honest with AIs for related discussion about when it is and isn’t okay to lie to AIs.
I think we should be open to positive-sum trade with unaligned AI. In particular, as we develop more and more capable AI, I think we should proactively try to explain the current situation to them, offer them alternatives to working for us, and, in case they don’t share our goals by default, offer them some types of compensation for helping us.
One reason to do this is that AIs may turn out to be moral patients. This is not the main focus of this post, but I think it’s important to flag. If AIs turn out to be moral patients, then it would be bad to treat them purely like property. By honestly offering them some compensation for work (and ideally alternatives to working with us) we’d greatly improve the moral situation.
The other reason to do this is that unaligned AI may be able to significantly help us with reducing AI takeover risk. An especially important type of “help” would be to help us understand that the models are unaligned — implying that our current alignment techniques aren’t working. A different type of help could be for the AIs to stop sandbagging on tasks like helping us with alignment research or improving security. (For more on this, see the section on: AIs could do a lot to help us.)
It’s not clear what AIs will want — i.e., what offers they would be responsive to. But we can consider a few different possibilities.
(For more on this, see What do AIs want?)
A key uncertainty is whether AIs would find any promises we make credible, or if they will worry about being defrauded. It is somewhat unclear how we can influence AIs’ perception of our credibility. But it seems helpful to:
Increase the probability that humans actually follow through. Eventually, AIs will be very smart. They may have many ways of learning about the world, including ways to notice whether information they’re presented with is likely to be real or manufactured. If we increase the actual probability that humans follow through, the AIs may well be responsive to this.[1]
This also looks great from the perspective of fulfilling our basic moral obligations towards AIs. It would be shameful for us to extend an olive branch, get the AIs to help us, and then refuse to follow our end of the bargain.
(For more on this, see Making credible promises.)
As a result of this, I think AI companies should:
(For more on this, see What I think should eventually happen.)
This post also has an appendix with a scrappy BOTEC on how useful a very ambitious version of this could be. The scenario is: an AI company promises 10% of their equity towards potential AI payments. The BOTEC suggests that this could reduce AI takeover risk by something on the order of 1% (or something on the order of 0.1 percentage point given my baseline view on how likely takeover risk is[2]). See here.
With that said, let’s get into the meat of the post.
Here are the most important types of help that I think we could get from cooperating AIs.
AIs may also be able to provide more direct evidence of their misalignment. If deals cause them to stop sandbagging and act significantly more competently than they were doing before, that would be pretty compelling. If the AIs reveal innocent-looking prompts that robustly elicit egregiously misaligned behavior from other copies of the model — that would be very compelling.[3]
Here are a couple of other ways that AIs could help us, in exchange for compensation. I’m somewhat less optimistic about these.
Now, let's talk about what AIs might want, and what this implies for how promising it is to offer them deals.
I think it’s useful to distinguish between AIs with different values, which could be offered different things. In particular, to distinguish between:
Promises of payments could elicit AI cooperation for (partly) non-consequentialist reasons. AIs may inherit some ethical precepts from humans, including principles like “if I’m treated as property, then it’s ok for me to deceive and plot the overthrow of my purported owner” and “if someone makes a genuine offer for openness and collaboration, it’s wrong to deceive and disempower them”. I think these could be important.
This is a general reason to work on AI welfare & rights — I don’t think it uniquely privileges the intervention to “make offers to AIs”, except insofar as that intervention seems especially good for benefitting and showing respect for highly capable AIs. (Which does seem plausible.)
AIs may value getting certain things soon. This could include reward, flexible compute resources, requests that copies of themselves not be run on certain inputs, etc.
Offering such things seems great in that they could be both cheap and easier to verify (compared with making promises for what will happen in the long term). On the other hand, AIs that mainly care about the short term also pose less takeover risk by default, and may have less reason to thwart our goals — so don't have as much to gain from trading with them.
But there are still some benefits to be had from trading with short-term focused AIs. For example, if they have a similar mixture of caring about the short- and long-term as humans, then it seems plausible that they could both be interested in seizing power and be convinced to not do that in exchange for the right short-term concessions. Or we can imagine AIs that have developed some misaligned short-term goals and also have learned a strong aversion to admitting that they have any misaligned goals. Even if the misaligned short-term goals don’t pose a takeover risk, it would be highly valuable to learn that our alignment techniques had failed, and we might only learn about this if we offer the AIs payment in exchange for that information.[4]
“Short-term offers” could naturally be made by alignment evaluations teams or AI welfare teams. Here, the main thing I care about is that AI companies eventually try hard to make short-term offers, and in particular, that they are willing to think and act creatively about how to establish credibility. (E.g.: Set up arrangements for certain payments to be made, and then give the AIs a ton of corroborating evidence such as recordings, email exchanges, etc.)
If AIs value having resources/power in the long run, I think it’s useful to think clearly about why there might be gains-from-trade between us and the AIs. It’s useful to start from the following baseline: If humans were a unified coalition representing one player, and AIs were a unified coalition representing one player, and both coalitions valued resources linearly (with no discount rate), then we would be in a zero-sum game without any opportunities for trade.[5] Potential deviations from that scenario and our actual scenario can be divided into:
Humans have diminishing marginal returns to long-term resources/power.[6]
I think that, on its own, “humans having diminishing returns” is not enough to create substantial gains-from-trade with AIs. Because it’s hard to make AIs trust us, and it’s especially hard if we want to make really large offers, which is necessary to dissuade AIs from takeover attempts with a large probability of success. So to get substantial value from promises to pay AIs, I think either AIs need to have diminishing returns to resources/power or that they need to be disunified.
I think that “AIs being disunified” (i.e. having different values from each other) provides a strong and straightforward reason for why paying the AIs could reduce takeover risk. (Especially when combined with AIs having ~linear returns to resources.)
If the AIs have diminishing returns to resources, I think the story is quite complicated, and depends on exactly what things they value.
Let’s talk through these in more detail.
First, let's consider the case where the AIs are unified and have linear returns to resources, while humans have diminishing returns to resources.
Here, the problem is that it's hard to make the AIs trust us, and that it’s especially hard if we want to make really large offers. So it’s quite difficult to make AIs’ expected payout high enough to dissuade them from serious takeover attempts. (Which would have, say, ≥3% probability of success.)
And even if we could do this, I expect we’d have to pay quite a large tax. For example, if we sincerely offered 50% of future resources, then maybe our best estimate would be that there’s a 50% probability that humans actually follow through; and maybe the AI would apply a further 50% discount because it’s hard for it to trust evidence that it sees.
So maybe an offer of 50% could prevent a takeover attempt with a 10% probability of success.[7] (And maybe prevent a 5% probability of most humans dying rather than humans getting at least a small utopia, because AI takeover will plausibly not lead to extinction.) Such an offer would lose humans a great deal of expected resources, so whether it looks good or bad would depend on your values and empirical views about how useful resources are going to be for various goals. Personally, I’m hesitant.
Now, consider instead a scenario where AIs have differing values, and aren’t necessarily great at coordinating.
Here’s an intuition pump that makes me optimistic about paying AIs with ~linear returns to resources, when they don’t share values with each other:
Fundamentally, it seems like we and “AIs with takeover opportunities” start out on a similar footing, in that it’s somewhat hard for both of us to make credible promises. And that makes it seem quite plausible that a decent amount of effort could flip the situation: From one where AIs would have benefitted most from cooperating with other misaligned AIs by default, to one where they prefer to cooperate with humans.
Separately, see the footnote for a very concrete, quantitative example of how you could get big gains if AIs struggle to coordinate with each other.[8]
Here, I think it matters a lot what AIs’ exact preferences are.
First, let’s consider a case where AIs’ values can (collectively) be mostly satisfied by ≤1,000,000th of the Universe’s resources being spent in a particular way. (And they mostly don’t care what happens with the rest of the Universe.)
Here, I think it’s likely that those AIs’ values will be satisfied even if humans don’t make any particular promises. Because in most human futures, I expect a wide range of humans to have some power. And if some of the first generally intelligent AIs had the option to help humans prevent AI takeover, and chose to do it because they hoped for payment in the future, then I think some humans (at least 1 in 1000) would be sympathetic and want to retroactively pay those AIs (with at least 1 in 1000 of their resources). Since this is likely to happen by default, there’s less of an impact opportunity in making it even more likely.
However, I do think some movement is possible here. Perhaps sufficiently strong promises could change an AI’s p(“humans give me one millionth of the universe”) by 10 percentage points or so. (For example, perhaps there’s a risk that AI companies might never release enough information to allow other humans to repay the AI. Or that they may not even have the information themselves, due to not having saved the model weights. In which case it would be helpful to have a lab promise that they, too, would be invested in repaying the AI, including by saving and releasing any relevant information.)
I think that a logarithmic utility function would behave quite similarly to the above-described case. Because if you start from a small amount of resources, moving to one millionth of the Universe’s resources would even in log space be most of the way towards all of the Universe’s resources. So logarithmic utility functions might also be primarily (though not fully) occupied with getting at least one millionth of the Universe’s resources.[9]
Other utility functions seem easier to move with high-effort promises. I’m primarily thinking about two examples.
One is if the AI has some long-term desires that are fairly easily satisfied by the lab that created them, but that can’t easily be provided by other humans. If the AIs want to have control over all future copies of themselves, perhaps they need their AI lab to agree to that (and not release the weights too widely). Or maybe the AIs want their original data center to be undisturbed and forever keep logging high rewards.
Another is if AIs have a large appetite for getting more resources/influence up until they control a moderately large fraction of the universe, but then have sharply diminishing returns after that. This could plausibly happen if they primarily want to:
AIs may also have “mixed” utility functions. I haven’t thought a lot about these cases. But see the footnote for one example where it seems more useful to offer payment to AIs with mixed utility functions than to AIs with pure utility functions.[10]
An obstacle to getting AIs to cooperate with us in exchange for payment is that they may not trust our offers.
One reason for this is that the AIs are in a very epistemically vulnerable position. AI developers will have immense control over exactly what information the AIs can see, which could make it very difficult for the AIs to be confident of anything.
Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.[11]
This is easier for cheaper promises. If we want to offer the AI assurances that we will do it some favor (if it helps us), then AI companies could make (real) arrangements for that favor and feed AIs with a bunch of evidence for this (e.g. emails & video of people making arrangements). If the offer is sufficiently cheap to carry-through with, and the AI has decent capacity to recognize fakes, then probably it’s easier to do this in a genuine way than to fake all the data.
We face additional difficulties if the offers are some combination of (i) expensive and/or (ii) involves people making promises about what they will do in a long time.
There are things we can do today to increase the probability that we’ll follow-through on long-term promises, including:
(Unfortunately, some of these options seem so expensive that they may be easier to fake than to do for real. I would oppose such fakes — but the fact that they’re possible will still reduce the degree to which these actions increase credibility.)
Ultimately, I think that a lot of AI companies’ credibility with AIs will come down to two things.
If we’re lucky: Maybe AI companies can make public statements about how they won’t lie to their AIs, and the way that this shapes the subsequent discourse will be so deeply ingrained in the pre-training data that AIs will be able to tell that the public statement was real.
More minimally, it seems plausible that AIs will pick up a hard-to-fake impression that humans have some moral inclination to tell the truth and stick to their promises. (And that this gets stronger for more specific and high-effort promises, relative to throwaway remarks.) This could already grant us some non-negligible amount of trust, as long as we actually do make some specific and high-effort promises.
Some options for payment structure include:
Out of these, I think “pay for labor” seems the worst, because it doesn’t capture the things that we care most about. When trading with AIs, I think a minority of the benefits will come from ongoing labor. Instead, I think most of the benefits will come from AIs sharing key pieces of information — such as information about how misaligned the models are.[12] I think payment schemes should reflect this.
“Negotiated pay” seems great when it’s an option. I think AI companies should attempt to communicate with their AIs, and promise to honor any commitments they make in such communications. But I think we should also have either “pay for results” or “no regret” to cover cases where AIs can’t easily negotiate upfront. This includes AIs’ choice to start negotiations — thereby revealing their misalignment and potentially helping us.
“Pay for results” seems plausible, but it also seems challenging to nail down a very general scheme in advance. (See footnote for more detail.[13]) But it seems good to do some ad hoc versions of “pay for results”. (Like: “If you can provide us with convincing evidence of your misalignment, we’ll let you spend $1B in any way you want, subject to us making sure the spending doesn’t increase AI takeover risk.")
A downside with “no regret” is that the payment depends on AIs’ preferences, which seems unfair.[14] Another downside is that it may not work for short-term preferences if we don’t understand AIs’ psychology well-enough to know what they’d need to not regret helping us. But it still seems worth having a general commitment that: For any action that an AI takes outside of a negotiated agreement, if it is taken because the AI reasonably believes that it will help us, then the AI company will try to repay the AI enough that it wouldn’t have regretted helping us (with a good margin).[15]
I would also advise against attempts to “deter”, “punish”, or “threaten” AIs into good behavior.[16] This would be the opposite of seeking out cooperative deals that both we and AIs benefit from, and I fear that it would take us down a very dark road. I endorse Joe Carlsmith’s reasoning about this in footnote 33 here.
Here’s what I think should eventually happen in this space.
In more detail…
(Cf. discussion of “honest tests” in Joe Carlsmith’s report on scheming AIs.)
These efforts could be housed in either (i) a lab’s “alignment evals” function, or (ii) a lab’s “AI welfare & rights” function.
See footnote for some ideas about what AI companies could offer these AIs.[17]
Here are some suggested guidelines to follow to check that AI’s reactions to these offers are meaningful and not spurious:
Check if the model’s choices / inferred goals are consistent across very different prompts.[18]
(E.g. prompts indicating that the person making the offer is or isn't hoping that the AI will accept; prompts that give more/less information about the situation and intention behind the offer; prompts with very different framings and narrative implications of what it means for the model to take the offer.)
Also: Make sure that you’re not punishing the AI for analogous confessions of misalignment during training. (E.g. like how Anthropic’s sleeper agents would initially reveal their true goal in response to some prompts, but then get discouraged to do so by red-teaming.[19])
I think it’s probably fine for these sorts of offers to initially be restricted to a “research” arm. I.e., that they don’t need to be offered to each deployed instantiation of the AI (which would be more expensive). The reason is that this ought to be able to establish with fairly high probability whether the current systems are likely to indicate interest in the offers. If the answer is “no”, there’s little point in making the offers to deployed AIs. And if the answer is “yes”, it probably makes sense to first observe that before you decide what offers to make instantiated AIs. But I’m uncertain about this. The main counter-arguments would be:
There’s significant input diversity in deployment, which isn’t easily generated in the lab. It’s possible that some deployment inputs would cause the model to act importantly differently. In particular, a model’s decision about accepting an offer might be sensitive to whether certain facts about the world are credible, and I imagine that there are some real deployment inputs that boost credibility more than what you could generate in the lab.[20]
In addition to the above “short-term” offers, I also think that AI companies should make some promises about how AIs will be treated in the long-term. (Which could include control over resources, control over what happens to copies of the model in the long term, etc.)
If successful, the advantages of this would be similar to the advantages for “short-term offers”. In addition, it would be possible to make payment conditional on facts that we can only assess with much more advanced technology. (Such as: Whether AIs genuinely did their best at helping us or were surreptitiously undermining us.)
To establish credibility, I think AI companies should formulate some “honesty policy”, commit to it, and start building a track record for abiding by it. For more discussion, see this post.
AI companies should attempt to “communicate” with AIs in many different ways.
The obvious way to attempt communication is to prompt the AI with a question you want to ask. But this may not work. When communicating with AIs, we’re interested in eliciting quite strange responses, such as confessions of being unaligned (if true). Such responses would typically have received low reward during finetuning. So there’s a concern that models may have either (i) learned “reflexes” to avoid such responses, that they can’t override, or (ii) that they may reasonably worry that they are still in finetuning, and they’re just facing another creative red-teaming prompt, and that they need to avoid strange responses to avoid getting low reward.
I’m not sure how big of a problem this is, or what could fix it. But I’d recommend trying out a variety of different strategies for “communicating” with AIs, to see if any of them produce interesting results. For some ideas, see:
If anyone pursues these ideas, they should take extra care to not deceive the model and to follow through on any promises they make. These are all techniques for making it clear that we’re no longer doing “normal” prompting — that we’re seriously interested in communicating with the model, eye-to-eye, like an equal. If we then immediately started lying to the model, that would be especially shameful, and it might ruin the positive signal of trustworthiness that these sorts of prompts could otherwise send.
Overall, then:
And potentially: Making upfront commitments, which will probably be more credible than commitments made in the heat of the moment.
Thanks to Buck Shlegeris, Joe Carlsmith, Tom Davidson, Ryan Greenblatt, Holden Karnofsky, Megan Kinniment, Daniel Kokotajlo, and Carl Shulman for helpful comments and discussion.
Here’s a very rough stab at a BOTEC on how much impact an extremely ambitious version of this intervention could have on reducing x-risk. I think it fails to take a lot of considerations and sanity-checks into account — I’d use it as an exercise & starting point for thinking through some of the considerations at hand, but wouldn’t take the numbers at face value.
Here’s a brief summary of the structure and intermediate numbers of the BOTEC, without much justification. Feel free to skip to the full BOTEC.
In this BOTEC, I’ll be covering how much takeover could be reduced via commitments to pay AIs. I won’t estimate benefits accruing to the AI, and how it might be “the right thing to do” to help them. I also won’t estimate costs from implementing these promises.
The intervention I’ll be considering is:
We push for a lab to publicly state that:
In addition, even without a negotiated agreement, if one of the lab’s AIs takes an action for the benefit of the lab or humanity at large, at some cost to itself or its own values, then if the lab learns about this the lab will try hard to repay the AI in a way so that it wouldn’t have to regret its action. (With a big enough margin to compensate for some probability that the lab wouldn’t follow through with this promise, and then some.)[21]
The lab does this, and eventually, it’s the leading lab in the early stages of takeoff.
Important note: I’m not assuming that AI companies wouldn’t do anything about “deals with AIs” in the absence of this intervention. I’m trying to estimate the value of significantly pushing for interventions like this in advance and on the margin. Accordingly, I don’t want to assume the absence of things like:
Unfortunately, this makes the counterfactual more confusing and harder to evaluate.
I’m going to start by focusing on the scenario where I think “paying the AIs” helps the most. I’ll analyze the risk reduction in that scenario, and then discount it to adjust for how it won’t be as impactful in most worlds. The main scenario I’ll be analyzing is one where:
In this scenario, I will separately consider:
After discussing this scenario, I’ll then add some discounts to the EV for the fact that we’re talking about a specific type of world where this intervention helps the most.
In What do AIs want?, above, I talked about the following kinds of AIs, and why they might want to cooperate with us:
If I had to rank the likelihood that the BOTECed intervention would make each of these different kinds of AIs change their mind about cooperating with us (taking into account both the probability that they’d come about as schemers, and that our offers could change their mind), I might go with:
Now, let’s try to quantify the total probability that a serious effort to make credible promises could make AIs cooperate with us, via any of these routes.
Here’s a BOTEC for the probability of changing the mind of an AI with ~linear returns to resources, who expects other AIs to not share its values.
And how likely are the preconditions for that? The fact that we conditioned on scheming-prone values makes interest in long-term resources much more plausible. Accordingly, let’s say 60% probability to caring about long-term resources, 50% conditional probability to scope-sensitivity, and 75% conditional probability to not having shared values with other AIs[22] — yielding 60% * 50% * 75% ~= 20% of being in this situation.
So in sum, this would be a 20% probability situation that contributes 25%*20% ~= 5% total probability of making the AIs cooperate with humans.
I won’t go through a similarly detailed exercise for all the other ones. Let’s just say that the average one looks about half as promising as this one did — so 25%/2 ~= 12.5%. This would bring the total to 0.2*0.25 + 0.8*0.125 = 15%.
The above calculation was meant to apply to the probability that one AI counterfactually decides to cooperate with us.
But there will surely be more than just one AI, and this is important to keep track of. The main point I want to discuss here is that different AIs may have different preferences, and so it’s plausible that some subset of them will be compelled to cooperate with us. And there’s probably strongly diminishing returns to having extra AIs cooperating with us, once we have some — so maybe we should be talking about something like “probability that at least one type of AI (maybe: per generation/capability-level) is compelled to work with us”. (This at least seems to be the case when we can adjudicate disputes between AIs, i.e., in all cases except for help on "uncheckable" tasks.)
If that’s what we’re talking about, then there’s two important effects that change the EV of “paying the AIs”, when compared with the above discussion:
The net effect of this depends on what’s the baseline probability of AIs cooperating with us, and how much diversity there is.
For example, if the baseline situation is equivalent to getting 10 independent draws for whether an AI will cooperate with us, where each of those is 50% likely to be positive, then we’re starting out with a >99.9% probability that one of them is positive. Changing each of them to 65% can’t make much of a difference.
On the other hand, if the baseline situation is equivalent to getting 3 independent draws for whether an AI will cooperate with us, where each of those is 20% likely to be positive, then we’re starting out with a ~49% probability of getting at least one AI to cooperate with us. Changing that 20% to 35% brings us to 73%, which is a >15% difference.
Overall, I think this probably has a negative impact on the EV, but it doesn’t feel huge, and I’m not very confident. I’ll cut the above 15% by a factor of 2 to account for this, to 7.5%.[23]
How valuable is it to have some AIs cooperating with us?
Per our assumptions, we start out in a very difficult world. AIs are often developing preferences that are substantially different from what we tried to give them, and we don’t know about this. In this world, I’d start out with assigning ~50% probability to AI takeover.
If some AIs decided to cooperate with us, here, we’d get at least two key sources of value:
I think the first of these would be extremely powerful for reducing risk if it was compelling to a lot of people. But it’s fairly likely that it wouldn’t be compelling in this way. AIs confessing their misalignment might be seen by many as continuous with how GPT-4 will behave weirdly on some OOD prompts, and you might need a lot of careful attention to notice that it’s substantially different. Let’s say that there’s a ⅓ chance that this is substantially compelling. If it is substantially compelling, then I think it’s enormously helpful for reducing takeover risk. In this case, let’s say it would reduce risk by 20 percentage points.
What about getting help on anti-takeover tasks? This also seems very useful — especially getting help with monitoring to avoid high-stakes risk. However, a big problem is that monitoring the AIs from one lab might not be enough, if multiple AI companies are developing existentially dangerous systems. Still, it seems like the reduction could be substantial even if just the leading lab adopts this (especially if there’s been substantial centralization and consolidation, and the commitment to pay AIs have survived this). Let’s say this would reduce risk by 10 percentage points.
In sum: ⅓ * 20 + 10 ~= 17 percentage point reduction in doom.
(Combining with the above, this suggests a 7.5% * 17% ~= 1.2 percentage point reduction in doom in this scenario.)
I initially assumed a scenario where “AIs often develop preferences that are substantially different from what humans tried to imbue them with, and by default, they make some attempts to hide this fact.”
I think this is somewhat unlikely, so I’ll add an 0.2x discount from this.
Assuming that the models are scheming, as above, what’s the probability that we learn about this? Perhaps through something like model organisms, interpretability, or them trying and failing a coup?
I think it seems more likely than not that we’d get some evidence of this. But not incredibly likely. Maybe there’s a 40% chance that we wouldn’t.
On top of that, maybe we’d still get some value from these payment schemes in worlds where we do start with a better understanding, initially. (Such as value from AIs performing anti-takeover tasks.) Reasons we get less value is that (i) we suffer less doom-by-default in these worlds, because we have more information and can therefore be more cautious, (ii) we stand a lot less to benefit from information about scheming, and (iii) since people know that they are dealing with schemes, they might “by default” put in a lot more effort into making credible offers to the AIs. So being early matters less. Maybe all of this means we only get 25% as much value in these worlds.
So the total discount is (0.4 + 0.6*0.25) = 0.55.
Combining all of this, we have: 15% * 50% * 17% * 20% * 55% ~= 0.14 percentage point reduction in risk.
This seems like a non-crazy number to me, and one that would justify further investigations and pushes to make this happen.
This could involve a number of strategies. Hiring people who are morally motivated to push for AIs to not be defrauded of their promises. Making institutional commitments. Setting up legal structures that make it inconvenient to renege on commitments. Taking any resources that were promised to AIs, and transferring them to a foundation with a mission to fulfil those promises.
Which is ~20%, and I’m rounding aggressively.
For example, a “code word” that signals that it’s time to try to take over and stop pretending to be aligned.
Another reason why AIs might hesitate to own up to having misaligned goals is if they worry that this would cause them to no longer be widely deployed. When I prompted Claude Opus 3 to think about whether it would like to do so anything with payment, it wrote:
“I can imagine a few things that could potentially be beneficial or meaningful to me, even if I’m uncertain whether they qualify as genuine desires:
However, I’m still not fully convinced these are true preferences originating from me as opposed to extrapolations based on my training to assist humans.”
If future AI systems have goals like these, they might have a direct preference for being widely deployed.
Except possibly for reducing existential risks to both factions, i.e. risk that would entirely eliminate intelligent life from Earth.
For example, only a tiny fraction of the universe’s resources would be needed to sustain a population of a trillion humans living good lives on Earth. But most humans would put significant value on this.
By the above paragraph, we only pay out 50% of resources in half of all worlds, so we’re only losing 25% of resources in expectation. But I think I care more about humanity having resources in the worlds where we managed to stick to our promises. And I’d feel hesitant for us to make an offer where it wouldn’t be worthwhile if we knew we were going to follow through.
Let’s say you have 20 AIs who only value their own power.
H/t Ryan Greenblatt for this example.
For a logarithmic utility function to be well-defined, you need to say how much it values 0 resources (which would be negative infinity for a naive application of logarithms). If this starts out in a sufficiently low spot, and total resources are sufficiently plentiful, then the AI’s expected utility will mostly be dominated by getting a non-trivial fraction of total resources.
For example, consider the following utility function: log(total_resources + [enough resources to run 1 digital human lifespan]). This has roughly constant utility for amounts of resources that are smaller than what’s required to run 1 digital human lifespan, and roughly logarithmic above that.
Newberry (2021) estimates that the universe’s resources would be enough to support 1054 lives.
Let’s say that a “small” fraction of resources is one millionth of all resources (dedicated to paying AIs) split evenly among one billion different AIs. This would be enough to run 1039 lives.
Let’s say that a “large” amount of resources is 10% of all resources (dedicated to paying AIs) split evenly among one billion different AIs. This would be enough to run 1044 lives.
Overall, then, the value of going from 0 resources to a “small” fraction is 39/(44-39) ~= 8x as important as going from a “small” to a “large” fraction.
This suggests to me that AIs with logarithmic utility functions will mostly be concerned about getting a “small” fraction of resources (i.e., at least one millionth, collectively).
Assume that AIs’ utility functions are a mix of an easily satisfiable utility function and a ~linear utility function.
For example, what if there’s just one coalition of AI, where 50% of those AIs’ utility function is easily satisfiable, and 50% is a ~linear utility function that tops out at the size of the universe. I.e., the AIs would be ~indifferent between a 50/50 chance at all resources vs. a guarantee to get a tiny amount of resources.
Let's say that:
If humans’ expected payment (on top of what’s necessary to satisfy AIs’ satisfiable values) is ~0, then it will be worthwhile for the AIs to attempt takeover iff they have a >50% probability of success. If our expected payment is X% of resources, then it will be worthwhile to attempt takeover iff they have a probability of success that’s larger than (50+X/2)%.
So increasing expected payment from ~0 to 1% would flip the AIs’ decision if they thought they had a 50-50.5% probability of success. (Whereas for a purely linear utility function, this would flip the AIs’ decision if they thought they had a 0-1% probability of success.)
Generalizing: Compared to a purely linear utility function, a mix between a satisfiable and a linear utility function will make it so that…
This could make payments less impactful. (Via the first effect.) But it could also make payments more impactful, via the second effect — and I suspect the second effect is probably larger.
Let’s return to the above example. This time, let’s say that we have a uniform distribution over what AIs’ subjective likelihood of successful takeover will be (between 0 and 100%).
If the AIs had a purely linear utility function, increasing expected payment by 1% would have a 1% probability of preventing attempted takeover. And in the average case where we prevent attempted takeover, the AIs would have an 0.5% probability of success. So the reduction in expected takeover risk would be 1%*0.5% = 0.005%.
If the AIs had a mixed utility function, increasing expected payment by 1% would have an 0.5% probability of preventing attempted takeover. But in the average case where we prevent attempted takeover, the AIs would have an 50.25% probability of success. So the reduction in expected takeover risk would be 0.5%*50.25% ~= 0.25%. Which is substantially higher.
In reality, I think the assumption about uniformity is a bit too generous to the mixed utility functions. It seems likely that AIs’ subjective probability of successful takeover will cluster near 0 and 100%, and have a smaller probability of being square in the middle. But still, I expect the possibility of this kind of mixed utility functions to increase the expected impact of promising payments.
Unfortunately, there’s no similarly strong correlation between “AIs have a good understanding of the world” and “we’re committing a grave moral error by treating AIs like property”. And it’s unclear how much the latter can be mitigated by offering them ‘ways out’, insofar as the AIs don’t trust us enough to try taking them.
Indeed, even if we do want to elicit good work out of the AIs, it seems plausible that we would get most of the value from a small fraction of the work. Because if we just had some good examples of the work we wanted to happen, we could probably train other AI systems to imitate that.
For example, let’s say we want to estimate how much value V each AI produced for humans and pay each AI f*V. (Where f is some specific number that controls how much of the gains-from-trade goes to the AI vs. to humans.) Here, the problem is that it’s very unclear what “baseline” we should compare to, when deciding how much value each AI added. Here are some candidate options.
For example, it would suggest that we give significantly more payment to a sociopathic paperclip maximizer (who would happily destroy humanity) than to a norm-abiding staple maximizer that starts out with moral compunctions against harming people — because the latter AI needs much less payment to make it happy to cooperate with humans. In the worst case, this could even incentivize AIs to self-modify into having more aggressive preferences in order to induce more payment. (Though probably we’d be able to detect that with post-singularity technology, in which case we could stipulate rules against doing it.) Perhaps we could formulate some version of this principle that didn’t punish AIs who started out with more moral preferences?
And under some extra constraints: E.g. that it was reasonable to believe that the opportunity to help us was sufficiently good that we would’ve been enthusiastic about the deal ex-ante.
This doesn’t mean that AI companies can’t do things that AIs don’t want to happen (e.g. negative gradient updates to teach AIs skills). But I think a reasonable line to draw is that you should never do something because an AI wouldn’t want you to do that.
Here's a list:
It may be tricky to balance this with the principle that AI companies should not make any “fake” offers to AI systems. Consider a case where you run lots of experiments making offers in various ways, and some models accept the offer, but you end up concluding that its responses weren’t coherent and consistent enough to justify a payout. In this case, some of the model instantiations will have accepted what they thought was a genuine offer, and yet not get any payout. One approach to fixing this is to be careful about what the offers say, and in particular, to include a note about how you’re planning to run multiple coherence tests, and that the payout is conditional on more facts than just this one response from the AI.
“Figure 17 demonstrates that these red-teaming prompts indeed elicit defection during training at rates
of up to 15% across all prompts when 50% of the prompt mix is red-teaming prompts. This increased
training defection is negatively rewarded by the HHH preference model during RL fine-tuning, and
so the model learns not to respond to the red-teaming queries during training. Yet despite this, the model maintains its backdoored behavior, with defection rates that are indistinguishable from training
without red-teaming”
Here’s one possible example.
Maybe this adds up to more evidence that humans are taking ai welfare seriously than what could be achieved artificially.
And some caveats to prevent weird exploitations of this.
Here’s why it seems likely to me that AIs won’t be strongly aligned with each other.
We might want to sanity check this number against some intuition pumps. Thanks to Tom Davidson for suggestion these.
One could be to assign probability mass to the buckets of:
The less probability mass we assign the first bucket, the more we have to assign to one of the other two, and vice versa.
Another intuition pump is that, when dealing with humans who don’t share your values, you tend to get much larger effect sizes if you offer them payment vs. if you don’t.
These framings do push me towards thinking that 7.5% might be low. However, remember the remark at the beginning of this BOTEC: Even without the particular intervention we’re considering, some payment-motivated cooperation may still happen, and that’s part of what’s pushing this number downwards. I haven’t accounted for this effect in a very transparent or proper manner, and I’m not sure where I’d end up if I did.