Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

So8res

AI ALIGNMENT FORUM
AF

Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense — AI Alignment Forum

67 Ability to solve long-horizon tasks correlates with wanting things in the behaviorist sense

by So8res

24th Nov 2023

6 min read

67

Review by

Steven Byrnes

Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?

(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)

And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?

(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI's ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you're check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise. Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one's plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

I make no claim about the AI's internal states and whether those bear any resemblance to the internal state of a human consumed by the feeling of desire. To paraphrase something Eliezer Yudkowsky said somewhere: we wouldn't say that a blender "wants" to blend apples. But if the blender somehow managed to spit out oranges, crawl to the pantry, load itself full of apples, and plug itself into an outlet, then we might indeed want to start talking about it as though it has goals, even if we aren’t trying to make a strong claim about the internal mechanisms causing this behavior.

If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I'll say it "wants" that outcome “in the behaviorist sense”.

Why might we see this sort of "wanting" arise in tandem with the ability to solve long-horizon problems and perform long-horizon tasks?

Because these "long-horizon" tasks involve maneuvering the complicated real world into particular tricky outcome-states, despite whatever surprises and unknown-unknowns and obstacles it encounters along the way. Succeeding at such problems just seems pretty likely to involve skill at figuring out what the world is, figuring out how to navigate it, and figuring out how to surmount obstacles and then reorient in some stable direction.

(If each new obstacle causes you to wander off towards some different target, then you won’t reliably be able to hit targets that you start out aimed towards.)

If you're the sort of thing that skillfully generates and enacts long-term plans, and you're the sort of planner that sticks to its guns and finds a way to succeed in the face of the many obstacles the real world throws your way (rather than giving up or wandering off to chase some new shiny thing every time a new shiny thing comes along), then the way I think about these things, it's a little hard to imagine that you don't contain some reasonably strong optimization that strategically steers the world into particular states.

(Indeed, this connection feels almost tautological to me, such that it feels odd to talk about these as distinct properties of an AI. "Does it act as though it wants things?" isn’t an all-or-nothing question, and an AI can be partly goal-oriented without being maximally goal-oriented. But the more the AI’s performance rests on its ability to make long-term plans and revise those plans in the face of unexpected obstacles/opportunities, the more consistently it will tend to steer the things it's interacting with into specific states—at least, insofar as it works at all.)

The ability to keep reorienting towards some target seems like a pretty big piece of the puzzle of navigating a large and complex world to achieve difficult outcomes.

And this intuition is backed up by the case of humans: it's no mistake that humans wound up having wants and desires and goals—goals that they keep finding clever new ways to pursue even as reality throws various curveballs at them, like “that prey animal has been hunted to extinction”.

These wants and desires and goals weren’t some act of a god bequeathing souls into us; this wasn't some weird happenstance; having targets like “eat a good meal” or “impress your friends” that you reorient towards despite obstacles is a pretty fundamental piece of being able to eat a good meal or impress your friends. So it's no surprise that evolution stumbled upon that method, in our case.

(The implementation specifics in the human brain—e.g., the details of our emotional makeup—seem to me like they're probably fiddly details that won’t recur in an AI that has behaviorist “desires”. But the overall "to hit a target, keep targeting it even as you encounter obstacles" thing seems pretty central.)

The above text vaguely argues that doing well on tough long-horizon problems requires pursuing an abstract target in the face of a wide array of real-world obstacles, which involves doing something that looks from the outside like “wanting stuff”. I’ll now make a second claim (supported here by even less argument): that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.

For instance, humans find themselves wanting things like good meals and warm nights and friends who admire them. And all those wants added up in the ancestral environment to high inclusive genetic fitness. Observing early hominids from the outside, aliens might have said that the humans are “acting as though they want to maximize their inclusive genetic fitness”; when humans then turn around and invent birth control, it’s revealed that they were never actually steering the environment toward that goal in particular, and instead had a messier suite of goals that correlated with inclusive genetic fitness, in the environment of evolutionary adaptedness, at that ancestral level of capability.

Which is to say, my theory says “AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.

A follow-on inference from the above point is: when the AI leaves training, and it’s tasked with solving bigger and harder long-horizon problems in cases where it has to grow smarter than ever before and develop new tools to solve new problems, and you realize finally that it’s pursuing neither the targets you trained it to pursue nor the targets you asked it to pursue—well, by that point, you've built a generalized obstacle-surmounting engine. You've built a thing that excels at noticing when a wrench has been thrown in its plans, and at understanding the wrench, and at removing the wrench or finding some other way to proceed with its plans.

And when you protest and try to shut it down—well, that's just another obstacle, and you're just another wrench.

So, maybe don't make those generalized wrench-removers just yet, until we do know how to load proper targets in there.

AgencyAI

Frontpage

Mentioned in

24Counting arguments provide no evidence for AI doom

New Comment

27 comments, sorted by

top scoring

Click to highlight new comments since: Today at 3:43 AM

[-]paulfchristiano2y3639

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior? [...] Well, I claim that these are more-or-less the same fact.

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.

I think that a system may not even be able to "want" things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can't want things or solve long horizon tasks at all, then maybe you shouldn't update at all when they don't appear to want things.

But that's not really where we are at---AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question

Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.)

(The foreshadowing example doesn't seem very good to me. One way a human or an AI would write a story with foreshadowing is to first decide what will happen, and then write the story and include foreshadowing of the event you've already noted down. Do you think that series of steps is hard? Or that the very idea of taking that approach is hard? Or what?)

Like you, I think that future more powerful AI systems are more likely to want things in the behaviorist sense, but I have a different picture and think that you are overstating the connection between "wanting things" and "ability to solve long horizon tasks" (as well as overstating the overall case). I think a system which gets high reward across a wide variety of contexts is particularly likely to want reward in the behaviorist sense, or to want something which is consistently correlated with reward or for which getting reward is consistently instrumental during training. This seems much closer to a tautology. I think this tendency increases as models get more competent, but that it's not particularly about "ability to solve long-horizon tasks," and we are obviously getting evidence about it each time we train a new language model.

[-]dxu2y*2418

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

Here's an existing Nate!comment that I find reasonably persuasive, which argues that these two things are correlated in precisely those cases where the outcome requires routing through lots of environmental complexity:

Part of what's going on here is that reality is large and chaotic. When you're dealing with a large and chaotic reality, you don't get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to "unroll" that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like "if the experiments come up this way, then I'll follow it up with this experiment, and if instead it comes up that way, then I'll follow it up with that experiment", and etc. This decision tree quickly explodes in size. And even if we didn't have a memory problem, we'd have a time problem -- the thing to do in response to surprising experimental evidence is often "conceptually digest the results" and "reorganize my ontology accordingly". If you're trying to unroll that reasoner into a decision-tree that you can write down in advance, you've got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.

Reasoners are a way of compressing plans, so that you can say "do some science and digest the actual results", instead of actually calculating in advance how you'd digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)

Like, you can't make an "oracle chess AI" that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You've gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.

Like, the outputs you can get out of an oracle AI are "no plan found", "memory and time exhausted", "here's a plan that involves running a reasoner in real-time" or "feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action". In the first two cases, your oracle is about as useful as a rock; in the third, it's the realtime reasoner that you need to align; in the fourth, all [the] word "oracle" is doing is mollifying you unduly, and it's this "oracle" that you need to align.

Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.)

Here's an existing Nate!response to a different-but-qualitatively-similar request that, on my model, looks like it ought to be a decent answer to yours as well:

a thing I don't expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.

Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.

(The original discussion that generated this example was couched in terms of value alignment, but it seems to me the general form "delete all discussion pertaining to some deep insight/set of insights from the training corpus, and see if the model can generate those insights from scratch" constitutes a decent-to-good test of the model's cognitive planning ability.)

(Also, I personally think it's somewhat obvious that current models are lacking in a bunch of ways that don't nearly require the level of firepower implied by a counterexample like "go to the moon" or "generate this here deep insight from scratch", s.t. I don't think current capabilities constitute much of an update at all as far as "want-y-ness" goes, and continue to be puzzled at what exactly causes [some] LLM enthusiasts to think otherwise.)

[-]ryan_greenblatt2y*10-2

I don't see why you can't just ask at each point in time "Which action would maximize the expected value of X". It seems like asking once and asking many times as new things happen in reality don't have particularly different properties.

More detailed comment

Paul noted:

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

An earlier Nate comment (not in response) is:

Part of what's going on here is that reality is large and chaotic. When you're dealing with a large and chaotic reality, you don't get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to "unroll" that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like "if the experiments come up this way, then I'll follow it up with this experiment, and if instead it comes up that way, then I'll follow it up with that experiment", and etc. This decision tree quickly explodes in size. And even if we didn't have a memory problem, we'd have a time problem -- the thing to do in response to surprising experimental evidence is often "conceptually digest the results" and "reorganize my ontology accordingly". If you're trying to unroll that reasoner into a decision-tree that you can write down in advance, you've got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.

Reasoners are a way of compressing plans, so that you can say "do some science and digest the actual results", instead of actually calculating in advance how you'd digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)

But, can't you just query the reasoner at each point for what a good action would be? And then, it seems unclear if the AI actually "wants" the long run outcome vs just "wants" to give a good response or something else entirely.

Maybe the claim is that if you do this, it's equivalent to just training the reasoner to do the long term outcome (which will get you a reasoner which want long term outcomes). Or it would only work if the reasoner had the ability to solve long-horizon tasks directly which itself might imply it's likely to want to do this. But this seems at least unclear for reasonable training schemes.

For instance, imagine you train an AI with purely process based feedback to take actions. As in, I want to train my AI to accomplish objectives over the course of 6 months. So, I have a human review actions the AI took over a 1 hour period and rate these actions based on how good these actions seem for accomplishing the long term objective. It seems like this feedback is likely to deviate considerably from the best way to accomplish the long run objective in ways which make danger less likely. In particular, it seems far less likely that the AI will 'want' long term outcomes rather than 'wanting' to do an action such that the human rater will think the action will lead to good long term consequences (or some other proxy 'want' entirely).

(Note that just because the feedback differs considerably doesn't mean it's way less competitive, it might be, but that will depend on more details.)

It's totally consistent to have the view 'AIs which just aim to satisify local measures of goodness (e.g. a human thinks this action is good) will never be able to accomplish long run outcomes without immense performance penalties', but I think this seems at least unclear. Further, training based mostly on long run feedback is very expensive (even if we're thinking about time scales more like 2 hours than 6 months which is more plausible anyway).

[-]ryan_greenblatt2y7-2

More generally, it seems like we can build systems that succeed in accomplishing long run goals without having the core components which are doing this actually 'want' to accomplish any long run goal.

It seems like this is common for corporations and we see similar dynamics for language model agents.

(Again, efficiency concerns are reasonable.)

[-]Daniel Kokotajlo2y72

I am confused what your position is, Paul, and how it differs from So8res' position. Your statement of your position at the end (the bit about how systems are likely to end up wanting reward) seems like a stronger version of So8res' position, and not in conflict with it. Is the difference that you think the main dimension of improvement driving the change is general competence, rather than specifically long-horizon-task competence?

[-]paulfchristiano2y5-3

Differences:

I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
Nate writes:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense"."

I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!
I think this post is a bad answer to the question "when are the people who expected 'agents' going to update?" I think you should be updating some now and you should be discussing that in an answer. I think the post also fails to engage with the actual disagreements so it's not really advancing the discussion.

[-]Daniel Kokotajlo2y50

Thanks for the response. I'm still confused but maybe that's my fault. FWIW I think my view is pretty similar to Nate's probably, though I came to it mostly independently & I focus on the concept of agents rather than the concept of wanting. (For more on my view, see this sequence.)

I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

[-]paulfchristiano2y112

If your AI system "wants" things in the sense that "when prompted to get X it proposes good strategies for getting X that adapt to obstacles," then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying "If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task" + "If your AI wants something, then it will undermine your tests and safety measures" seems like a sleight of hand, most of the oomph is coming from equivocating between definitions of want.

You say:

I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.

But the OP says:

to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise

This seems to strongly imply that a particular capability---succeeding at these long horizon tasks---implies the AI has "wants/desires." That's what I'm saying seems wrong.

[-]Daniel Kokotajlo2y50

I agree that arguments for AI risk rely pretty crucially on human inability to notice and control what the AI wants. But for conceptual clarity I think we shouldn't hardcode that inability into our definition of 'wants!' Instead I'd say that So8res is right that ability to solve long-horizon tasks is correlated with wanting things / agency, and then say that there's a separate question of how transparent and controllable the wants will be around the time of AGI and beyond. This then leads into a conversation about visible/faithful/authentic CoT, which is what I've been thinking about for the past six months and which is something MIRI started thinking about years ago. (See also my response to Ryan elsewhere in this thread)

[-]paulfchristiano2y64

If you use that definition, I don't understand in what sense LMs don't "want" things---if you prompt them to "take actions to achieve X" then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn't that "want" or "desire" like behavior? So what does it mean when Nate says "AI doesn't seem to have all that much "want"- or "desire"-like behavior"?

I'm genuinely unclear what the OP is asserting at that point, and it seems like it's clearly not responsive to actual people in the real world saying "LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?” People who say that kind of thing mostly aren't saying that LMs can't be prompted to achieve outcomes. They are saying that LMs don't want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don't seem to have preferences about the training objective, or that are coherent over time).

[-]Daniel Kokotajlo2y74

I would say that current LLMs, when prompted and RLHF'd appropriately, and especially when also strapped into an AutoGPT-type scaffold/harness, DO want things. I would say that wanting things is a spectrum and that the aforementioned tweaks (appropriate prompting, AutoGPT, etc.) move the system along that spectrum. I would say that future systems will be even further along that spectrum. IDK what Nate meant but on my charitable interpretation he simply meant that they are not very far along the spectrum compared to e.g. humans or prophecied future AGIs.

It's a response to "LLMs turned out to not be very want-y, when are the people who expcted 'agents' going to update?" because it's basically replying "I didn't expect LLMs to be agenty/wanty; I do expect agenty/wanty AIs to come along before the end and indeed we are already seeing progress in that direction."

To the people saying "LLMs don't want things in the sense that is relevant to the usual arguments..." I recommend rephrasing to be less confusing: Your claim is that LLMs don't seem to have preferences about the training objective, or that are coherent over time, unless hooked up into a prompt/scaffold that explicitly tries to get them to have such preferences. I agree with this claim, but don't think it's contrary to my present or past models.

[-]ryan_greenblatt2y*20

(I'm obviously not Paul)

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

In the case of literal current LLM agents with current models:

Humans manually engineer the prompting and scaffolding (and we understand how and why it works)
We can read the intermediate goals directly via just reading the CoT.

Thus, we don't have risk from hidden, unintended, or unpredictable objectives. There is no reason to think that goal seeking behavior due to the agency from the engineered scaffold or prompting will results in problematic generalization.

It's unclear if this will hold in the future even for LLM agents, but it's at least plausible that this will hold (which defeats Nate's rather confident claim). In particular, we could run into issues from the LLM used within the LLM agent having hidden goals, but insofar as the retargeting and long run agency is a human engineered and reasonably understood process, the original argument from Nate doesn't seem very relevant to risk. We also could run into issues from imitating very problematic human behavior, but this seems relatively easy to notice in most cases as it would likely be discussed outload with non-negligable probability.

We'd also lose this property if we did a bunch of RL and most of the power of LLM agents was coming from this RL rather than imitating human optimization or humans engineering particular optimization processes.

[-]Daniel Kokotajlo2y30

It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."

Yes of course. My research for the last few months has been focused on what happens after that, when the systems get smart enough and/or get trained so that the chain of thought is unfaithful when it needs to be faithful, e.g. the system uses euphemisms when it's thinking about whether it's misaligned and what to do about that.

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn't plot against us or otherwise screw us over.)

[-]ryan_greenblatt2y32

It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."

Basically, but more centrally that in literal current LLM agents the scary part of the system that we don't understand (the LLM) doesn't generalize in any scary way due to wanting while we can still get the overall system to achieve specific long term outcomes in practice. And that it's at least plausible that this property will be preserved in the future.

I edited my earlier comment to hopefully make this more clear.

[-]ryan_greenblatt2y10

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn't plot against us or otherwise screw us over.)

Even if we didn't have the visible thoughts property in the actual deployed system, the fact that all of the retargeting behavior is based on explicit human engineering is still relevant and contradicts the core claim Nate makes in this post IMO.

[-]ryan_greenblatt2y10

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said.

I think it contradicts things Nate says in this post directly. I don't know if it contradicts things you've said.

To clarify, I'm commenting on the following chain:

First Nate said:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

as well as

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Then, Paul responded with

I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!

Then you said

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

And I was responding to this.

So, I was just trying to demonstrate at least one plausible example of a system which plausibly could pursue long term goals and doesn't have the sense of wanting needed for AI risk arguments. In particular, LLM agents where the retargeting is purely based on human engineering (analogous to a myopic employee retargeted by a manager who cares about longer term outcomes).

This directly contradicts "Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.".

[-]Daniel Kokotajlo2y52

Thanks for the explanation btw.

My version of what's happening in this conversation is that you and Paul are like "Well, what if it wants things but in a way which is transparent/interpretable and hence controllable by humans, e.g. if it wants what it is prompted to want?" My response is "Indeed that would be super safe, but it would still count as wanting things. Nate's post is titled "ability to solve long-horizon tasks correlates with wanting" not "ability to solve long-horizon tasks correlates with hidden uncontrollable wanting."

One thing at time. First we establish that ability to solve long-horizon tasks correlates with wanting, then we argue about whether or not the future systems that are able to solve diverse long-horizon tasks better than humans can will have transparent controllable wants or not. As you yourself pointed out, insofar as we are doing lots of RL it's dubious that the wants will remain as transparent and controllable as they are now. I meanwhile will agree that a large part of my hope for a technical solution comes from something like the Faithful CoT agenda, in which we build powerful agentic systems whose wants (and more generally, thoughts) are transparent and controllable.

[-]paulfchristiano2y30

If this is what's going on, then I basically can't imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying "it's already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans." Is there something I'm missing?

I think your interpretation of Nate is probably wrong, but I'm not sure and happy to drop it.

[-]Daniel Kokotajlo2y20

FWIW, your proposed pitch "it's already the case that..." is almost exactly the elevator pitch I currently go around giving. So maybe we agree? I'm not here to defend Nate's choice to write this post rather than some other post.

[-]TurnTrout2y182

Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise.

This seems like a great spot to make some falsifiable predictions which discriminate your particular theory from the pack. (As it stands, I don't see a reason to buy into this particular chain of reasoning.)

AIs will increasingly be deployed and tuned for long-term tasks, so we can probably see the results relatively soon. So—do you have any predictions to share? I predict that AIs can indeed do long-context tasks (like writing books with foreshadowing) without having general, cross-situational goal-directedness.^[1]

I have a more precise prediction:

AIs can write novels with at least 50% winrate against a randomly selected novel from a typical American bookstore, as judged by blinded human raters or LLMs which have at least 70% agreement with human raters on reasonably similar tasks.
Credence: 70%; resolution date: 12/1/2025

Conditional on that, I predict with 85% confidence that it's possible to do this with AIs which are basically as tool-like as GPT-4. I don't know how to operationalize that in a way you'd agree to.

(I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won't update.)

^{^}
I expect most of real-world "agency" to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself.

[-]aysja2y1619

From my perspective, meaningfully operationalizing “tool-like” seems like A) almost the whole crux of the disagreement, and B) really quite difficult (i.e., requiring substantial novel scientific progress to accomplish), so it seems weird to leave as a simple to-do at the end.

Like, I think that “tool versus agent” shares the same confusion that we have about “non-life versus life”—why do some pieces of matter seem to “want” things, to optimize for them, to make decisions, to steer the world into their preferred states, and so on, while other pieces seem to “just” follow a predetermined path (algorithms, machines, chemicals, particles, etc.)? What’s the difference? How do we draw the lines? Is that even the right question? I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.

Concrete operationalizations seem great to ask for, when they’re possible to give—but I suspect that expecting/requesting them before they’re possible is more likely to muddy the discourse than clarify it.

[-]TurnTrout2y*54

I didn't leave it as a "simple" to-do, but rather an offer to collaboratively hash something out.

That said: If people don't even know what it would look like when they see it, how can one update on evidence? What is Nate looking at which tells him that GPT doesn't "want things in a behavioralist sense"? (I bet he's looking at something real to him, and I bet he could figure it out if he tried!)

I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.

To be clear, I'm not talking about formalizing the boundary. I'm talking about a bet between people, adjudicated by people.

(EDIT: I'm fine with a low sensitivity, high specificity outcome -- we leave it unresolved if it's ambiguous / not totally obvious relative to the loose criteria we settled on. Also, the criterion could include randomly polling n alignment / AI people and asking them how "behaviorally-wanting" the system seemed on a Likert scale. I don't think you need fundamental insights for that to work.)

[-]Daniel Kokotajlo2y1522

The thing people seem to be disagreeing about is the thing you haven't operationalized--the "and it'll still be basically as tool-like as GPT4" bit. What does that mean and how do we measure it?

[-]abramdemski2y75

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI's ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you're check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)

I don't think the following is all that relevant to the point you are making in this post, but someone cited this post of yours in relation to the question of whether LLMs are "intelligent" (summarizing the post as "Nate says LLMs aren't intelligent") and then argued against the post as goalpost-moving, so I wanted to discuss that.

It may come as a shock to some, that Abram Demski adamantly defends the following position: GPT4 is AGI. I would be goalpost-moving if I said otherwise. I think the AGI community is goalpost-moving to the extent that it says otherwise.

I think there is some tendency in the AI Risk community to equate "AGI" with "the sort of AI which kills all the humans unless it is aligned". But "AGI" stands for "artificial general intelligence", not "kills all the humans". I think it makes more sense for the definition of AGI to be up to the community of AI researchers who use the term AGI to distance their work from narrow AI, rather than for it to be up to the AI risk community. And GPT4 is definitely not narrow AI.

I'll argue an even stronger claim: if you come up with a task which can be described and completed entirely in text format (and then evaluated somehow for performance quality), for most such tasks the performance of GPT4 is at or above the performance of a random human. (We can even be nice and only randomly sample humans who speak whichever languages are appropriate to the task; I'll still stand by the claim.) Yes, GPT4 has some weaknesses compared to a random human. But most claims of weaknesses I've heard are in fact contrasting GPT4 to expert humans, not random humans. So my stronger claim is: GPT4 is human-level AGI, maybe not by all possible definitions of the term, but by a very reasonable-seeming definition which 2014 Abram Demski might have been perfectly happy with. To deny this would be goalpost-moving for me; and, I expect, for many.

So (and I don't think this is what you were saying) if GPT4 were being ruled out of "human-level AGI" because it cannot write a coherent set of novels on its own, or do a big engineering project, well, I call shenanigans. Most humans can't do that either.

[-]Vladimir_Nesov2y*22

GPT-4 as a human level AGI is reasonable as a matter of evaluating the meaning of words, but this meaning of "AGI" doesn't cut reality at its joints. Humans are a big deal not for the reason of being at human level, but because there is capability for unbounded technological progress, including through building superintelligence. Ability for such progress doesn't require being superintelligent, so it's a different thing. For purposes of AI timelines it's the point where history starts progressing at AI speed rather than at human speed. There should be a name for this natural concept, and "AGI" seems like a reasonable option.

[-]abramdemski2y30

I agree that this is an important distinction, but I personally prefer to call it "transformative AI" or some such.

[-]Steven Byrnes1y*60Review for 2023 Review

The main insight of the post (as I understand it) is this:

In the context of a discussion of whether we should be worried about AGI x-risk, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—hooray, why were people so worried about AGI risk?”
In the context of a discussion among tech people and VCs about how we haven't yet made an AGI that can found and run companies as well as Jeff Bezos, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—alas, let's try to fix that problem.”

One sounds good and the other sounds bad, but there’s a duality connecting them. They’re the same observation. You can’t get one without the other.

This is an important insight because it helps us recognize the fact that people are trying to solve the second-bullet-point problem (and making nonzero progress), and to the extent that they succeed, they’ll make things worse from the perspective of the people in the first bullet point.

This insight is not remotely novel! (And OP doesn’t claim otherwise.) …But that’s fine, nothing wrong with saying things that many readers will find obvious.

(This “duality” thing is a useful formula! Another related example that I often bring up is the duality between positive-coded “the AI is able to come up with out-of-the-box solutions to problems” versus the negative-coded “the AI sometimes engages in reward hacking”. I think another duality connects positive-coded “it avoids catastrophic forgetting” to negative-coded “it’s hard to train away scheming”, at least in certain scenarios.)

(…and as comedian Mitch Hedberg sagely noted, there’s a duality between positive-coded “cheese shredder” and negative-coded “sponge ruiner”.)

The post also chats about two other (equally “obvious”) topics:

Instrumental convergence: “the AI seems like it's trying hard to autonomously accomplish long-horizon goals” involves the AI routing around obstacles, and one might expect that to generalize to “obstacles” like programmers trying to shut it down
Goal (mis)generalization: If “the AI seems like it's trying hard to autonomously accomplish long-horizon goal X”, then the AI might actually “want” some different Y which partly overlaps with X, or is downstream from X, etc.

But the question on everyone’s mind is: Are we doomed?

In and of itself, nothing in this post proves that we’re doomed. I don’t think OP ever explicitly claimed it did? In my opinion, there’s nothing in this post that should constitute an update for the many readers who are already familiar with instrumental convergence, and goal misgeneralization, and the fact that people are trying to build autonomous agents. But OP at least gives a vibe of being an argument for doom going beyond those things, which I think was confusing people in the comments.

Why aren’t we necessarily doomed? Now this is my opinion, not OP’s, but here are three pretty-well-known outs (at least in principle):

The AI can “want” to autonomously accomplish a long-horizon goal, but also simultaneously “want” to act with integrity, helpfulness, etc. Just like it’s possible for humans to do. And if the latter “want” is strong enough, it can outvote the former “want” in cases where they conflict. See my post Consequentialism & corrigibility.
The AI can behaviorist-“want” to autonomously accomplish a long-horizon goal, but where the “want” is internally built in such a way as to not generalize OOD to make treacherous turns seem good to the AI. See e.g. my post Thoughts on “Process-Based Supervision”, which is skeptical about the practicalities, but I think the idea is sound in principle.
We can in principle simply avoid building AIs that autonomously accomplish long-horizon goals, notwithstanding the economic and other pressures—for example, by keeping humans in the loop (e.g. oracle AIs). This one came up multiple times in the comments section.

There’s plenty of challenges in these approaches, and interesting discussions to be had, but the post doesn’t engage with any of these topics.

Anyway, I’m voting strongly against including this post in the 2023 review. It’s not crisp about what it’s arguing for and against (and many commenters seem to have gotten the wrong idea about what it’s arguing for), it’s saying obvious things in a meandering way, and it’s not refuting or even mentioning any of the real counterarguments / reasons for hope. It’s not “best of” material.

Moderation Log

More detailed comment

Paul noted:

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

An earlier Nate comment (not in response) is:

Part of what's going on here is that reality is large and chaotic. When you're dealing with a large and chaotic reality, you don't get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to "unroll" that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like "if the experiments come up this way, then I'll follow it up with this experiment, and if instead it comes up that way, then I'll follow it up with that experiment", and etc. This decision tree quickly explodes in size. And even if we didn't have a memory problem, we'd have a time problem -- the thing to do in response to surprising experimental evidence is often "conceptually digest the results" and "reorganize my ontology accordingly". If you're trying to unroll that reasoner into a decision-tree that you can write down in advance, you've got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.

Reasoners are a way of compressing plans, so that you can say "do some science and digest the actual results", instead of actually calculating in advance how you'd digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)

(Note that just because the feedback differs considerably doesn't mean it's way less competitive, it might be, but that will depend on more details.)

[-]ryan_greenblatt2y7-2

It seems like this is common for corporations and we see similar dynamics for language model agents.

(Again, efficiency concerns are reasonable.)

[-]Daniel Kokotajlo2y72

[-]paulfchristiano2y5-3

Differences:

I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
Nate writes:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense"."

I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!
I think this post is a bad answer to the question "when are the people who expected 'agents' going to update?" I think you should be updating some now and you should be discussing that in an answer. I think the post also fails to engage with the actual disagreements so it's not really advancing the discussion.

[-]Daniel Kokotajlo2y50

[-]paulfchristiano2y112

You say:

I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.

But the OP says:

to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise

This seems to strongly imply that a particular capability---succeeding at these long horizon tasks---implies the AI has "wants/desires." That's what I'm saying seems wrong.

[-]Daniel Kokotajlo2y50

[-]paulfchristiano2y64

[-]Daniel Kokotajlo2y74

[-]ryan_greenblatt2y*20

(I'm obviously not Paul)

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

In the case of literal current LLM agents with current models:

Humans manually engineer the prompting and scaffolding (and we understand how and why it works)
We can read the intermediate goals directly via just reading the CoT.

[-]Daniel Kokotajlo2y30

[-]ryan_greenblatt2y32

It sounds like you are saying "In the current paradigm of prompted/scaffolded instruction-tuned LLMs, we get the faithful CoT property by default. Therefore our systems will indeed be agentic / goal-directed / wanting-things, but we'll be able to choose what they want (at least imperfectly, via the prompt) and we'll be able to see what they are thinking (at least imperfectly, via monitoring the CoT), therefore they won't be able to successfully plot against us."

I edited my earlier comment to hopefully make this more clear.

[-]ryan_greenblatt2y10

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said. Nate and I both agree that if we can create & maintain some sort of faithful/visible thoughts property through human-level AGI and beyond, then we are in pretty good shape & I daresay things are looking pretty optimistic. (We just need to use said AGI to solve the rest of the problem for us, whilst we monitor it to make sure it doesn't plot against us or otherwise screw us over.)

[-]ryan_greenblatt2y10

Anyhow I think this is mostly just a misunderstanding of Nate and my position. It doesn't contradict anything we've said.

I think it contradicts things Nate says in this post directly. I don't know if it contradicts things you've said.

To clarify, I'm commenting on the following chain:

First Nate said:

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".

as well as

Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.

Then, Paul responded with

I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!

Then you said

What do you think is the sense of "wanting" needed for AI risk arguments? Why is the sense described above not enough?

And I was responding to this.

[-]Daniel Kokotajlo2y52

[-]paulfchristiano2y30

I think your interpretation of Nate is probably wrong, but I'm not sure and happy to drop it.

[-]Daniel Kokotajlo2y20

[-]TurnTrout2y182

Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise.

I have a more precise prediction:

AIs can write novels with at least 50% winrate against a randomly selected novel from a typical American bookstore, as judged by blinded human raters or LLMs which have at least 70% agreement with human raters on reasonably similar tasks.
Credence: 70%; resolution date: 12/1/2025

Conditional on that, I predict with 85% confidence that it's possible to do this with AIs which are basically as tool-like as GPT-4. I don't know how to operationalize that in a way you'd agree to.

(I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won't update.)

^{^}
I expect most of real-world "agency" to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself.

[-]aysja2y1619

[-]TurnTrout2y*54

I didn't leave it as a "simple" to-do, but rather an offer to collaboratively hash something out.

I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.

To be clear, I'm not talking about formalizing the boundary. I'm talking about a bet between people, adjudicated by people.

[-]Daniel Kokotajlo2y1522

The thing people seem to be disagreeing about is the thing you haven't operationalized--the "and it'll still be basically as tool-like as GPT4" bit. What does that mean and how do we measure it?

[-]abramdemski2y75

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI's ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you're check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)

[-]Vladimir_Nesov2y*22

[-]abramdemski2y30

I agree that this is an important distinction, but I personally prefer to call it "transformative AI" or some such.

[-]Steven Byrnes1y*60Review for 2023 Review

The main insight of the post (as I understand it) is this:

In the context of a discussion of whether we should be worried about AGI x-risk, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—hooray, why were people so worried about AGI risk?”
In the context of a discussion among tech people and VCs about how we haven't yet made an AGI that can found and run companies as well as Jeff Bezos, someone might say “LLMs don't seem like they're trying hard to autonomously accomplish long-horizon goals—alas, let's try to fix that problem.”

One sounds good and the other sounds bad, but there’s a duality connecting them. They’re the same observation. You can’t get one without the other.

This insight is not remotely novel! (And OP doesn’t claim otherwise.) …But that’s fine, nothing wrong with saying things that many readers will find obvious.

(…and as comedian Mitch Hedberg sagely noted, there’s a duality between positive-coded “cheese shredder” and negative-coded “sponge ruiner”.)

The post also chats about two other (equally “obvious”) topics:

Instrumental convergence: “the AI seems like it's trying hard to autonomously accomplish long-horizon goals” involves the AI routing around obstacles, and one might expect that to generalize to “obstacles” like programmers trying to shut it down
Goal (mis)generalization: If “the AI seems like it's trying hard to autonomously accomplish long-horizon goal X”, then the AI might actually “want” some different Y which partly overlaps with X, or is downstream from X, etc.

But the question on everyone’s mind is: Are we doomed?

Why aren’t we necessarily doomed? Now this is my opinion, not OP’s, but here are three pretty-well-known outs (at least in principle):

The AI can “want” to autonomously accomplish a long-horizon goal, but also simultaneously “want” to act with integrity, helpfulness, etc. Just like it’s possible for humans to do. And if the latter “want” is strong enough, it can outvote the former “want” in cases where they conflict. See my post Consequentialism & corrigibility.
The AI can behaviorist-“want” to autonomously accomplish a long-horizon goal, but where the “want” is internally built in such a way as to not generalize OOD to make treacherous turns seem good to the AI. See e.g. my post Thoughts on “Process-Based Supervision”, which is skeptical about the practicalities, but I think the idea is sound in principle.
We can in principle simply avoid building AIs that autonomously accomplish long-horizon goals, notwithstanding the economic and other pressures—for example, by keeping humans in the loop (e.g. oracle AIs). This one came up multiple times in the comments section.

There’s plenty of challenges in these approaches, and interesting discussions to be had, but the post doesn’t engage with any of these topics.

Moderation Log