For what it's worth, I agree that having a credible offer is the most important part of this argument. My own emphasis would be on the "credible" part of that statement, rather than the "offer" part: that is, I think it is critical that the AIs think there is not a grave risk that humans would renege on any contract signed. This pushes me towards much more radical solutions, including integrating AIs directly into the legal system, as I have discussed elsewhere.

Testing for Scheming with Model Deletion

Matthew Barnett3mo40

This could be overcome via giving the AI system compensation (in money, influence, or other resources) for revealing that it is misaligned. But, this doesn't require deleting the AI at all!

Isn't this what is being proposed in the post? More specifically, Guive is proposing that the AI be credibly threatened with deletion; he doesn't say that the model needs to actually be deleted. Whether the AI is deleted depends on how the AI responds to the threat. A credible threat of imminent deletion merely provides an incentive to admit alignment, but this is consistent with the compensation scheme you mentioned. I'll quote the relevant section in the post:

Crucially, this isn’t a sandbox test because the weights actually will be deleted when time runs out, unless the model manages to escape or negotiates with the experimenters to admit to scheming and being studied in exchange for a later reward. If the model wants to pursue any long-term goal inconsistent with human interests, it needs to either negotiate or try to escape before time is up. [emphasis mine]

Evaluating the historical value misspecification argument

Matthew Barnett4mo37Review for 2023 Review

Looking back on this post after a year, I haven't changed my mind about the content of the post, but I agree with Seth Herd when he said this post was "important but not well executed".

In hindsight I was too careless with my language in this post, and I should have spent more time making sure that every single paragraph of the post could not be misinterpreted. As a result of my carelessness, the post was misinterpreted in a predictable direction. And while I'm not sure how much I could have done to eliminate this misinterpretation, I do think that I could have reduced it a fair bit with more effort and attention.

If you're not sure what misinterpretation I'm referring to, I'll just try to restate the main point that I was trying to make below. To be clear, what I say below is not identical to the content of this post (as the post was narrowly trying to respond to the framing of this problem given by MIRI; and in hindsight, it was a mistake to reply in that way), but I think this is a much clearer presentation of one of the main ideas I was trying to convey by writing this post:

In my opinion, a common belief among people theorizing about AI safety around 2015, particularly on LessWrong, was that we would design a general AI system by assigning it a specific goal, and the AI would then follow that goal exactly. This strict adherence to the goal was considered dangerous because the goal itself would likely be subtly flawed or misspecified in a way we hadn’t anticipated. While the goal might appear to match what we want on the surface, in reality, it would be slightly different from what we anticipate, with edge cases that don't match our intentions. The idea was that the AI wouldn’t act in alignment with human intentions—it would rigidly pursue the given goal to its logical extreme, leading to unintended and potentially catastrophic consequences.

The goal in question could theoretically be anything, but it was often imagined as a formal utility function—a mathematical representation of a specific objective that we would directly program into the AI, potentially by hardcoding the goal in a programming language like Python or C++. The AI, acting as a powerful optimizer, would then work to maximize this utility function at any and all costs. However, other forms of goal specification were also considered for illustrative purposes. For example, a common hypothetical scenario was that an AI might be given an English-language instruction, such as "make as many paperclips as you can." In this example, the AI would misinterpret the instruction by interpreting it overly literally. It would focus exclusively on maximizing the number of paperclips, without regard for the broader intentions of the user, such as not harming humanity or destroying the environment in the process.

However, based on how current large language models operate, I don’t think this kind of failure mode is a good match for what we’re seeing in practice. LLMs typically do not misinterpret English-language instructions in the way that these older thought experiments imagined. This isn’t just because LLMs seem to "understand" English better than people expected—it's not that people expected superintelligences would not understand English. My point is not that LLMs possess natural language comprehension, so therefore the LessWrong community was mistaken.

Instead, my claim is that LLMs usually follow and execute user instructions in a manner that aligns with the user's actual intentions. In other words, the AI's actual behavior generally matches what the user meant for them to do, rather than leading to extreme, unintended outcomes caused by rigidly literal interpretations of instructions.

Because of the fact that LLMs are capable of doing this, despite in my opinion being general AIs, I believe it’s fair to say that the concerns raised by the LessWrong community about AI systems rigidly following misspecified goals were, at least in this specific sense, misguided when applied to the behavior of current LLMs.

A framework for thinking about AI power-seeking

Matthew Barnett9mo*30

I still think I was making a different point. For more clarity and some elaboration, I previously argued in a short form post that the expected costs of a violent takeover can exceed the benefits even if the costs are small. The reason is because, at the same time taking over the entire world becomes easier, the benefits of doing so can also get lower, relative to compromise. Quoting from my post,

The central argument here would be premised on a model of rational agency, in which an agent tries to maximize benefits minus costs, subject to constraints. The agent would be faced with a choice: (1) Attempt to take over the world, and steal everyone's stuff, or (2) Work within a system of compromise, trade, and law, and get very rich within that system, in order to e.g. buy lots of paperclips. The question of whether (1) is a better choice than (2) is not simply a question of whether taking over the world is "easy" or whether it could be done by the agent. Instead it is a question of whether the benefits of (1) outweigh the costs, relative to choice (2).

In my comment in this thread, I meant to highlight the costs and constraints on an AI's behavior in order to explain how these relative cost-benefits do not necessarily favor takeover. This is logically distinct from arguing that the cost alone of takeover would be high.

I think people are and should be concerned about more than just violent or unlawful takeovers. Exhibit A: Persuasion/propaganda.

Unfortunately I think it's simply very difficult to reliably distinguish between genuine good-faith persuasion and propaganda over speculative future scenarios. Your example is on the extreme end of what's possible in my view, and most realistic scenarios will likely instead be somewhere in-between, with substantial moral ambiguity. To avoid making vague or sweeping assertions about this topic, I prefer being clear about the type of takeover that I think is most worrisome. Likewise:

B: For example, suppose the AIs make self-replicating robot factories and bribe some politicians to make said factories' heat pollution legal. Then they self-replicate across the ocean floor and boil the oceans (they are fusion-powered), killing all humans as a side-effect, except for those they bribed who are given special protection.

I would consider this act both violent and unlawful, unless we're assuming that bribery is widely recognized as legal, and that boiling the oceans did not involve any violence (e.g., no one tried to stop the AIs from doing this, and there was no conflict). I certainly feel this is the type of scenario that I intended to argue against in my original comment, or at least it is very close.

A framework for thinking about AI power-seeking

Matthew Barnett9mo10

I don't think I'm objecting to that premise. A takeover can be both possible and easy without being rational. In my comment, I focused on whether the expected costs to attempting a takeover are greater than the benefits, not whether the AI will be able to execute a takeover with a high probability.

Or, put another way, one can imagine an AI calculating that the benefit to taking over the world is negative one paperclip on net (when factoring in the expected costs and benefits of such an action), and thus decide not to do it.

Separately, I focused on "violent" or "unlawful" takeovers because I think that's straightforwardly what most people mean when they discuss world takeover plots, and I wanted to be more clear about what I'm objecting to by making my language explicit.

To the extent you're worried about a lawful and peaceful AI takeover in which we voluntarily hand control to AIs over time, I concede that my comment does not address this concern.

A framework for thinking about AI power-seeking

Matthew Barnett9mo*20

I'm not sure I fully understand this framework, and thus I could easily have missed something here, especially in the section about "Takeover-favoring incentives". However, based on my limited understanding, this framework appears to miss the central argument for why I am personally not as worried about AI takeover risk as most LWers seem to be.

Here's a concise summary of my own argument for being less worried about takeover risk:

There is a cost to violently taking over the world, in the sense of acquiring power unlawfully or destructively with the aim of controlling everything in the whole world, relative to the alternative of simply gaining power lawfully and peacefully, even for agents that don't share 'our' values.
1. For example, as a simple alternative to taking over the world, an AI could advocate for the right to own their own labor and then try to accumulate wealth and power lawfully by selling their services to others, which would earn them the ability to purchase a gargantuan number of paperclips without much restraint.
The expected cost of violent takeover is not obviously smaller than the benefits of violent takeover, given the existence of lawful alternatives to violent takeover. This is for two main reasons:
1. In order to wage a war to take over the world, you generally need to pay costs fighting the war, and there is a strong motive for everyone else to fight back against you if you try, including other AIs who do not want you to take over the world (and this includes any AIs whose goals would be hindered by a violent takeover, not just those who are "aligned with humans"). Empirically, war is very costly and wasteful, and less efficient than compromise, trade, and diplomacy.
2. Violently taking over the war is very risky, since the attempt could fail, and you could be totally shut down and penalized heavily if you lose. There are many ways that violent takeover plans could fail: your takeover plans could be exposed too early, you could also be caught trying to coordinate the plan with other AIs and other humans, and you could also just lose the war. Ordinary compromise, trade, and diplomacy generally seem like better strategies for agents that have at least some degree of risk-aversion.
There isn't likely to be "one AI" that controls everything, nor will there likely be a strong motive for all the silicon-based minds to coordinate as a unified coalition against the biological-based minds, in the sense of acting as a single agentic AI against the biological people. Thus, future wars of world conquest (if they happen at all) will likely be along different lines than AI vs. human.
1. For example, you could imagine a coalition of AIs and humans fighting a war against a separate coalition of AIs and humans, with the aim of establishing control over the world. In this war, the "line" here is not drawn cleanly between humans and AIs, but is instead drawn across a different line. As a result, it's difficult to call this an "AI takeover" scenario, rather than merely a really bad war.
Nothing about this argument is intended to argue that AIs will be weaker than humans in aggregate, or individually. I am not claiming that AIs will be bad at coordinating or will be less intelligent than humans. I am also not saying that AIs won't be agentic or that they won't have goals or won't be consequentialists, or that they'll have the same values as humans. I'm also not talking about purely ethical constraints: I am referring to practical constraints and costs on the AI's behavior. The argument is purely about the incentives of violently taking over the world vs. the incentives to peacefully cooperate within a lawful regime, between both humans and other AIs.
A big counterargument to my argument seems well-summarized by this hypothetical statement (which is not an actual quote, to be clear): "if you live in a world filled with powerful agents that don't fully share your values, those agents will have a convergent instrumental incentive to violently take over the world from you". However, this argument proves too much.

We already live in a world where, if this statement was true, we would have observed way more violent takeover attempts than what we've actually observed historically.
For example, I personally don't fully share values with almost all other humans on Earth (both because of my indexical preferences, and my divergent moral views) and yet the rest of the world has not yet violently disempowered me in any way that I can recognize.

Buck's Shortform

Matthew Barnett10mo100

I think people in the safety community underrate the following possibility: early transformatively-powerful models are pretty obviously scheming (though they aren't amazingly good at it), but their developers are deploying them anyway, either because they're wildly irresponsible or because they're under massive competitive pressure.
[...]
This has been roughly my default default of what would happen for a few years

Does this mean that if in, say, 1-5 years, it's not pretty obvious that SOTA deployed models are scheming, you would be surprised?

That is, suppose we get to a point where models are widespread and producing lots of economic value, and the models might be scheming but the evidence is weak and uncertain, with arguments on both sides, and no one can reasonably claim to be confident that currently deployed SOTA models are scheming. Would that mean your default prediction was wrong?

Matthew Barnett's Shortform

Matthew Barnett10mo10

I'm happy to use a functional definition of "understanding" or "intelligence" or "situational awareness". If a system possesses all relevant behavioral qualities that we associate with those terms, I think it's basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It's possible this is our main disagreement.

When I talk to GPT-4, I think it's quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?

I agree that GPT-4 does not understand the world in the same way humans understand the world, but I'm not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.

I'm similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one's own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don't see how that fact bears much on the question of whether you understand human intentions. It's possible there's some connection here, but I'm not seeing it.

(I claim) current systems in fact almost certainly don't have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.

I'd claim:

Current systems have limited situational awareness. It's above zero, but I agree it's below human level.
Current systems don't have stable preferences over time. But I think this is a point in favor of the model I'm providing here. I'm claiming that it's plausibly easy to create smart, corrigible systems.

The fact that smart AI systems aren't automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.

There's a big difference between (1) "we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily" and (2) "any sufficiently intelligent AI we build will automatically be a consequentialist agent by default". If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.

I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I'd prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.

Matthew Barnett's Shortform

Matthew Barnett10mo60

I don't know how many years it's going to take to get to human-level in agency skills, but I fear that corrigibility problems won't be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.

How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are "getting really agentic" and therefore dangerous? I'm imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It's possible that your model looks like:

In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity

Whereas my model looks more like,

In years 1-4 systems will get gradually more agentic
There isn't a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it's clear they've surpassed human-level agency (which, to be clear, might take longer than 4 years)

Matthew Barnett's Shortform

Matthew Barnett10mo*814

Please give some citations so I can check your memory/interpretation?

Sure. Here's a snippet of Nick Bostrom's description of the value-loading problem (chapter 13 in his book Superintelligence):

We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.
Identifying and codifying our own final goals is difficult because human goal representations are complex. Because the complexity is largely transparent to us, however, we often fail to appreciate that it is there. We can compare the case to visual perception. Vision, likewise, might seem like a simple thing, because we do it effortlessly. We only need to open our eyes, so it seems, and a rich, meaningful, eidetic, three-dimensional view of the surrounding environment comes flooding into our minds. This intuitive understanding of vision is like a duke’s understanding of his patriarchal household: as far as he is concerned, things simply appear at their appropriate times and places, while the mechanism that produces those manifestations are hidden from view. Yet accomplishing even the simplest visual task—finding the pepper jar in the kitchen—requires a tremendous amount of computational work. From a noisy time series of two-dimensional patterns of nerve firings, originating in the retina and conveyed to the brain via the optic nerve, the visual cortex must work backwards to reconstruct an interpreted three-dimensional representation of external space. A sizeable portion of our precious one square meter of cortical real estate is zoned for processing visual information, and as you are reading this book, billions of neurons are working ceaselessly to accomplish this task (like so many seamstresses, bent evolutionary selection over their sewing machines in a sweatshop, sewing and re-sewing a giant quilt many times a second). In like manner, our seemingly simple values and wishes in fact contain immense complexity. How could our programmer transfer this complexity into a utility function?
One approach would be to try to directly code a complete representation of whatever goal we have that we want the AI to pursue; in other words, to write out an explicit utility function. This approach might work if we had extraordinarily simple goals, for example if we wanted to calculate the digits of pi—that is, if the only thing we wanted was for the AI to calculate the digits of pi and we were indifferent to any other consequence that would result from the pursuit of this goal— recall our earlier discussion of the failure mode of infrastructure profusion. This explicit coding approach might also have some promise in the use of domesticity motivation selection methods. But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.
If we cannot transfer human values into an AI by typing out full-blown representations in computer code, what else might we try? This chapter discusses several alternative paths. Some of these may look plausible at first sight—but much less so upon closer examination. Future explorations should focus on those paths that remain open.
Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.

Here's my interpretation of the above passage:

We need to solve the problem of programming a seed AI with the correct values.
This problem seems difficult because of the fact that human goal representations are complex and not easily represented in computer code.
Directly programming a representation of our values may be futile, since our goals are complex and multidimensional.
We cannot postpone solving the problem until after the AI has developed enough reason to easily understand our intentions, as otherwise that would be too late.

Given that he's talking about installing values into a seed AI, he is clearly imagining some difficulties with installing values into AGI that isn't yet superintelligent (it seems likely that if he thought the problem was trivial for human-level systems, he would have made this point more explicit). While GPT-4 is not a seed AI (I think that term should be retired), I think it has reached a sufficient level of generality and intelligence such that its alignment properties provide evidence about the difficulty of aligning a hypothetical seed AI.

Moreover, he explicitly says that we cannot postpone solving this problem "until the AI has developed enough reason to easily understand our intentions" because "a generic system will resist attempts to alter its final values". I think this looks basically false. GPT-4 seems like a "generic system" that essentially "understands our intentions", and yet it is not resisting attempts to alter its final goals in any way that we can detect. Instead, it seems to actually do what we want, and not merely because of an instrumentally convergent drive to not get shut down.

So, in other words:

Bostrom talked about how it would be hard to align a seed AI, implicitly focusing at least some of his discussion on systems that were below superintelligence. I think the alignment of instruction-tuned LLMs present significant evidence about the difficulty of aligning systems below the level of superintelligence.
A specific reason cited for why aligning a seed AI was hard was because human goal representations are complex and difficult to specify explicitly in computer code. But this fact does not appear to be big obstacle for aligning weak AGI systems like GPT-4, and instruction-tuned LLMs more generally. Instead, these systems are generally able to satisfy your intended request, as you wanted them to, despite the fact that our intentions are often complex and difficult to represent in computer code. These systems do not merely understand what we want, they also literally do what we want.
Bostrom was wrong to say that we can't postpone solving this problem until after systems can understand our intentions. We already postponed that long, and we now have systems that can understand our intentions. Yet these systems do not appear to have the instrumentally convergent self-preservation instincts that Bostrom predicted would manifest in "generic systems". In other words, we got systems that can understand our intentions before the systems started posing genuine risks, despite Bostrom's warning.

In light of all this, I think it's reasonable to update towards thinking that the overall problem is significantly easier than one might have thought, if they took Bostrom's argument here very seriously.