9 min read

23

Over the last few months I’ve talked with Eric Drexler a number of times about his Comprehensive AI Services (CAIS) model of AI development, and read most of his technical report on the topic. I think these are important ideas which are well worth engaging with, despite personally being skeptical about many of the conclusions. Below I’ve summarised what I see as the core components of Eric’s view, followed by some of own arguments. Note that these are only my personal opinions. I did make some changes to the summary based on Eric’s comments on early drafts, to better reflect his position - however, there are likely still ways I’ve misrepresented him. Also note that this was written before reading Rohin’s summary of the same report, although I do broadly agree with most of Rohin’s points.

One useful piece of context for this model is Eric's background in nanotechnology, and his advocacy for the development of nanotech as "atomically precise manufacturing" rather than self-replicating nanomachines. The relationship between these two frameworks has clear parallels with the relationship between CAIS and a recursively self-improving superintelligence.

The CAIS model:

  1. The standard arguments in AI safety are concerned with the development of a single AGI agent doing open-ended optimisation. Before we build such an entity (if we do so at all), we will build AI services which each perform a bounded task with bounded resources, and which can be combined to achieve superhuman performance on a wide range of tasks. 
  2. AI services may or may not be “agents”. However, under CAIS there will be no entity optimising extremely hard towards its goals in the way that most AI safety researchers have been worrying about, because: 
    1. Each service will be relatively specialised and myopic (focused on current episodic performance, not maximisation over the whole future). This is true of basically all current AI applications, e.g. image classifiers or Google Translate.
    2. Although rational agents can be proved equivalent to utility-maximisers, the same is not necessarily true of systems of rational agents. Most such systems are fundamentally different in structure from rational agents - for example, individual agents within the system can compete with or criticise each other. And since AI services aren’t “rational agents” in the first place, a system composed of them is even less likely to implement a utility-maximiser. 
    3. There won't be very much demand for unified AIs which autonomously carry out large-scale tasks requiring general capabilities, because systems of AI services will be able to perform those tasks just as well or better. 
  3. Early AI services could do things like massively disrupt financial markets, increase the rate of scientific discovery, help run companies, etc. Eventually they should be able to do any task that humans can, at our level or higher. 
    1. They could also be used to recursively improve AI technologies and to develop AI applications, but usually with humans in the loop - in roughly the same way that science allows us to build better tools with which to do better science. 
  4. Our priorities in doing AI safety research can and should be informed by this model: 
    1. A main role for technical AI safety researchers should be to look at the emergent properties of systems of AI services, e.g. which combinations of architectures, tasks and selection pressures could lead to risky behaviour, as well as the standard problems of specifying bounded tasks. 
    2. AI safety experts can also give ongoing advice and steer the development of AI services. AI safety researchers shouldn't think of safety as a one-shot problem, but rather a series of ongoing adjustments. 
    3. AI services will make it much easier to prevent the development of unbounded agent-like AGI through methods like increasing coordination and enabling surveillance, if the political will can be mustered. 

I'm broadly sympathetic to the empirical claim that we'll develop AI services which can replace humans at most cognitively difficult jobs significantly before we develop any single superhuman AGI (one unified system that can do nearly all cognitive tasks as well as or better than any human). One plausible mechanism is that deep learning continues to succeed on tasks where there's lots of training data, but doesn't learn how to reason in general ways - e.g. it could learn from court documents how to imitate lawyers well enough to replace them in most cases, without being able to understand law in the way humans do. Self-driving cars are another pertinent example. If that pattern repeats across most human professions, we might see massive societal shifts well before AI becomes dangerous in the adversarial way that’s usually discussed in the context of AI safety.

If I had to sum up my objections to Eric’s framework in one sentence, it would be: “the more powerful each service is, the harder it is to ensure it’s individually safe; the less powerful each service is, the harder it is to combine them in a way that’s competitive with unified agents.” I’ve laid out my arguments in more detail below.

Richard’s view:

  1. Open-ended agentlike AI seems like the most likely candidate for the first strongly superhuman AGI system. 
    1. As a basic prior, our only example of general intelligence so far is ourselves - a species composed of agentlike individuals who pursue open-ended goals. So it makes sense to expect AGIs to be similar - especially if you believe that our progress in artificial intelligence is largely driven by semi-random search with lots of compute (like evolution was) rather than principled intelligent design. 
      1. In particular, the way we trained on the world - both as a species and as individuals - was by interacting with it in a fairly unconstrained way. Many machine learning researchers believe that we’ll get superhuman AGI via a similar approach, by training RL agents in simulated worlds. Even if we then used such agents as “services”, they wouldn’t be bounded in the way predicted by CAIS. 
    2. Many complex tasks don’t easily decompose into separable subtasks. For instance, while writing this post I had to keep my holistic impression of Eric’s ideas in mind most of the time. This impression was formed through having conversations and reading essays, but was updated frequently as I wrote this post, and also draws on a wide range of my background knowledge. I don’t see how CAIS would split the task of understanding a high-level idea between multiple services, or (if it were done by a single service) how that service would interact with an essay-writing service, or an AI-safety-research service. 
      1. Note that this isn’t an argument against AGI being modular, but rather an argument that requiring the roles of each module and the ways they interface with each other to be human-specified or even just human-comprehensible will be very uncompetitive compared with learning them in an unconstrained way. Even on today’s relatively simple tasks, we already see end-to-end training outcompeting other approaches, and learned representations outperforming human-made representations. The basic reason is that we aren’t smart enough to understand how the best cognitive structures or representations work. Yet it’s key to CAIS that each service performs a specific known task, rather than just doing useful computation in general - otherwise we could consider each lobe of the human brain to be a “service”, and the combination of them to be unsafe in all the standard ways. 
      2. It’s not clear to me whether this is also an argument against IDA. I think that it probably is, but to a lesser extent, because IDA allows multiple layers of task decomposition which are incomprehensible to humans before bottoming out in subtasks which we can perform. 
    3. Even if task decomposition can be solved, humans reuse most of the same cognitive faculties for most of the tasks that we can carry out. If many AI services end up requiring similar faculties to each other, it would likely be more efficient to unify them into a single entity. It would also be more efficient if that entity could pick up new tasks in the same rapid way that humans do, because then you wouldn’t need to keep retraining. At that point, it seems like you no longer have an AI service but rather the same sort of AGI that we’re usually worried about. (In other words, meta-learning is very important but doesn’t fit naturally into CAIS). 
    4. Humans think in terms of individuals with goals, and so even if there's an equally good approach to AGI which doesn't conceive of it as a single goal-directed agent, researchers will be biased against it. 
  2. Even assuming that the first superintelligent AGI is in fact a system of services as described by the CAIS framework, it will be much more like an agent optimising for an open-ended goal than Eric claims. 
    1. There'll be significant pressure to reduce the extent to which humans are in the loop of AI services, for efficiency reasons. E.g. when a CEO can't improve on the strategic advice given to it by an AI, or the implementation by another AI, there's no reason to have that CEO any more. Then we’ll see consolidation of narrow AIs into one overall system which makes decisions and takes actions, and may well be given an unbounded goal like "maximise shareholder value". (Eric agrees that this is dangerous, and considers it more relevant than other threat models). 
    2. Even if we have lots of individually bounded-yet-efficacious modules, the task of combining them to perform well in new tasks seems like a difficult one which will require a broad understanding of the world. An overseer service which is trained to combine those modules to perform arbitrary tasks may be dangerous because if it is goal-oriented, it can use those modules to fulfil its goals (on the assumption that for most complex tasks, some combination of modules performs well - if not, then we’ll be using a different approach anyway). 
      1. While I accept that many services can be trained in a way which makes them naturally bounded and myopic, this is much less clear to me in the case of an overseer which is responsible for large-scale allocation of other services. In addition to superhuman planning capabilities and world-knowledge, it would probably require arbitrarily long episodes so that it can implement and monitor complex plans. My guess is that Eric would argue that this overseer would itself be composed of bounded services, in which case the real disagreement is how competitive that decomposition would be (which relates to point 1.2 above). 
  3. Even assuming that the first superintelligent AGI is in fact a system of services as described the CAIS framework, focusing on superintelligent agents which pursue unbounded goals is still more useful for technical researchers. (Note that I’m less confident in this claim than the others). 
    1. Eventually we’ll have the technology to build unified agents doing unbounded maximisation. Once built, such agents will eventually overtake CAIS superintelligences because they’ll have more efficient internal structure and will be optimising harder for self-improvement. We shouldn’t rely on global coordination to prevent people from building unbounded optimisers, because it’s hard and humans are generally bad at it. 
    2. Conditional on both sorts of superintelligences existing, I think (and I would guess that Eric agrees) that CAIS superintelligences are significantly less likely to cause existential catastrophe. And in general, it’s easier to reduce the absolute likelihood of an event the more likely it is (even a 10% reduction of a 50% risk is more impactful than a 90% reduction of a 5% risk). So unless we think that technical research to reduce the probability of CAIS catastrophes is significantly more tractable than other technical AI safety research, it shouldn’t be our main focus.

As a more general note, I think that one of the main strengths of CAIS is in forcing us to be more specific about what tasks we envisage AGI being used for, rather than picturing it divorced from development and deployment scenarios. However, I worry that the fuzziness of the usual concept of AGI has now been replaced by a fuzzy notion of “service” which makes sense in our current context, but may not in the context of much more powerful AI technology. So while CAIS may be a good model of early steps towards AGI, I think it is a worse model of the period I’m most worried about. I find CAIS most valuable in its role as a research agenda (as opposed to a predictive framework): it seems worth further investigating the properties of AIs composed of modular and bounded subsystems, and the ways in which they might be safer (or more dangerous) than alternatives.

Many thanks to Eric for the time he spent explaining his ideas and commenting on drafts. I also particularly appreciated feedback from Owain Evans, Rohin Shah and Jan Leike.

New Comment
7 comments, sorted by Click to highlight new comments since:

However, I worry that the fuzziness of the usual concept of AGI has now been replaced by a fuzzy notion of “service” which makes sense in our current context, but may not in the context of much more powerful AI technology.

It seems to me that "AGI" is actually relatively crisp compared to "service": it's something that approximates an expected utility maximizer, which seems like a pretty small and relatively compact cluster in thing-space. "Service" seems to cover a lot more varied ground, from early simple things like image classifiers to later strategic planners, natural language advice givers, AI researchers, etc., with the later things shading into AGI in a way makes it hard to distinguish between them.

it definitely seems worth investigating ways to make modular and bounded AIs more competitive, and CAIS more likely.

A major problem in predicting CAIS safety is to understand the order in which various services are likely to arise, in particular whether risk-reducing services are likely to come before risk-increasing services. This seems to require a lot of work in delineating various kinds of services and how they depend on each other as well as on algorithmic advancements, conceptual insights, computing power, etc. (instead of treating them as largely interchangeable or thinking that safety-relevant services will be there when we need them). Since this analysis seems very hard to do much ahead of time, I think we'll have to put very wide error bars on any predictions of whether CAIS would be safe or unsafe, until very late in the game. (This seems like a natural perspective for thinking about CAIS safety, which appears to be missing from Eric's report.)

Having said that, my feeling is that many risk-reducing services (especially ones that can address human safety problems) seem to require high-level general reasoning abilities, whereas many risk-increasing services can just be technical problem solvers or other kinds of narrow intelligences or optimizers, so the latter is likely to arrive earlier than the former, and as a result CAIS is actually quite unsafe, and hard to make safe, whereas AGI is by default highly unsafe, but with appropriate advances in safety research can perhaps be made safe. So I disagree with the proposal to push for CAIS, at least until we can better understand the strategic landscape. See also this comment where I made some related points.

AGI is ... something that approximates an expected utility maximizer.

This seems like a trait which AGIs might have, but not a part of how they should be defined. I think Eric would say that the first AI system which can carry out all the tasks we would expect an AGI to be capable of won't actually approximate an expected utility maximiser, and I consider it an open empirical question whether or not he's right.

Many risk-reducing services (especially ones that can address human safety problems) seem to require high-level general reasoning abilities, whereas many risk-increasing services can just be technical problem solvers or other kinds of narrow intelligences or optimizers, so CAIS is actually quite unsafe, and hard to make safe, whereas AGI / goal-directed agents are by default highly unsafe, but with appropriate advances in safety research can perhaps be made safe.

Yeah, good point. I guess that my last couple of sentences were pretty shallowly-analysed, and I'll retract them and add a more measured conclusion.

This seems like a trait which AGIs might have, but not a part of how they should be defined.

There's a thing that Eric is arguing against in his report, which he calls an "AGI agent". I think it is reasonable to say that this thing can be fuzzily defined as something that approximates an expected utility maximizer.

(By your definition of AGI, which seems to be something like "thing that can do all tasks that humans can do", CAIS would be AGI, and Eric is typically contrasting CAIS and AGI.)

That said, I disagree with Wei that this is relatively crisp: taken literally, the definition is vacuous because all behavior maximizes some expected utility. Maybe we mean that it is long-term goal-directed, but at least I don't know how to cash that out. I think I agree that it is more crisp than the notion of a "service", but it doesn't feel that much more crisp.

Responding to this very late, but: If I recall correctly, Eric has told me in personal conversation that CAIS is a form of AGI, just not agent-like AGI. I suspect Eric would agree broadly with Richard's definition.

And since AI services aren’t “rational agents” in the first place

AI services can totally be (approximately) VNM rational -- for a bounded utility function. The point is the boundedness, not the lack of VNM rationality. It is true that AI services would not be rational agents optimizing a simple utility function over the history of the universe (which is what I read when I see the phrase "AGI agent" from Eric).

As a basic prior, our only example of general intelligence so far is ourselves - a species composed of agentlike individuals who pursue open-ended goals.

Note that CAIS is suggesting that we should use a different prior: the prior based on "how have previous advances in technology come about". I find this to be stronger evidence than how evolution got to general intelligence.

Humans think in terms of individuals with goals, and so even if there's an equally good approach to AGI which doesn't conceive of it as a single goal-directed agent, researchers will be biased against it. 

I'm curious how strong an objection you think this is. I find it weak; in practice most of the researchers I know think much more concretely about the systems they implement than "agent with a goal", and these are researchers who work on deep RL. And in the history of AI, there were many things to be done besides "agent with a goal"; expert systems/GOFAI seems like the canonical counterexample.

There'll be significant pressure to reduce the extent to which humans are in the loop of AI services, for efficiency reasons.

Agreed for tactical decisions that require quick responses (eg. military uses, surgeries); this seems less true for strategic decisions. Humans are risk-averse and the safety community is cautioning against giving control to AI systems. I'd weakly expect that humans continue to be in the loop for nearly all important decisions (eg. remaining as CEOs of companies, but with advisor AI systems that do most of the work), until eg. curing cancer, solving climate change, ending global poverty, etc. (I'm not saying they'll stop being in the loop after that, I'm saying they'll remain in the loop at least until then.) To be clear, I'm imagining something like how I use Google Maps: basically always follow its instructions, but check that it isn't eg. routing me onto a road that's closed.

A clear counterargument is that some companies will have AI CEOs, and they will outcompete the others, and so we'll quickly transition to the world where all companies have AI CEOs. I think this is not that important -- having a human in the loop need not slow down everything by a huge margin, since most of the cognitive work is done by the AI advisor, and the human just needs to check that it makes sense (perhaps assisted by other AI services).

To the extent that you are using this to argue that "the AI advisor will be much more like an agent optimising for an open-ended goal than Eric claims", I agree that the AI advisor will look like it is "being a very good CEO". I'm not sure I agree that it will look like an agent optimizing for an open-ended goal, though I'm confused about this.

Even if we have lots of individually bounded-yet-efficacious modules, the task of combining them to perform well in new tasks seems like a difficult one which will require a broad understanding of the world.

Broad understanding isn't incompatible with services; Eric gives the example of language translation.

An overseer service which is trained to combine those modules to perform arbitrary tasks may be dangerous because if it is goal-oriented, it can use those modules to fulfil its goals

The main point of CAIS is that services aren't long-term goal-oriented; I agree that if services end up being long-term goal-oriented they become dangerous. In that case, there are still approaches that help us monitor when something bad happens (eg. looking at which services are being called upon for which task, limiting the information flow into any particular service), but the adversarial optimization danger is certainly present. (I think but am not sure that Eric would broadly agree with this take.)

My guess is that Eric would argue that this overseer would itself be composed of bounded services, in which case the real disagreement is how competitive that decomposition would be

Yup, that's the argument I would make.

Conditional on both sorts of superintelligences existing, I think (and I would guess that Eric agrees) that CAIS superintelligences are significantly less likely to cause existential catastrophe. And in general, it’s easier to reduce the absolute likelihood of an event the more likely it is (even a 10% reduction of a 50% risk is more impactful than a 90% reduction of a 5% risk). So unless we think that technical research to reduce the probability of CAIS catastrophes is significantly more tractable than other technical AI safety research, it shouldn’t be our main focus.

If you go via the CAIS route you definitely want to prevent unbounded AGI maximizers from being created until you are sure of their safety or that you can control them. (I know you addressed that in the previous point, but I'm pretty sure that no one is arguing to focus on CAIS conditional on AGI agents existing and being more powerful than CAIS, so it feels like you're attacking a strawman.)

Eventually we’ll have the technology to build unified agents doing unbounded maximisation. Once built, such agents will eventually overtake CAIS superintelligences because they’ll have more efficient internal structure and will be optimising harder for self-improvement.

Given a sufficiently long delay, we could use CAIS to build global systems that can control any new AGIs, in the same way that government currently controls most people.

I also am not sure why you think that AGI agents will optimize harder for self-improvement.

So while CAIS may be a good model of early steps towards AGI, I think it is a worse model of the period I’m most worried about.

Compared to what? If the alternative is "a vastly superintelligent AGI agent that is acting within what is effectively the society of 2019", then I think CAIS is a better model. I'm guessing that you have something else in mind though.

AI services can totally be (approximately) VNM rational -- for a bounded utility function.

Suppose an AI service realises that it is able to seize many more resources with which to fulfil its bounded utility function. Would it do so? If no, then it's not rational with respect to that utility function. If yes, then it seems rather unsafe, and I'm not sure how it fits Eric's criterion of using "bounded resources".

Note that CAIS is suggesting that we should use a different prior: the prior based on "how have previous advances in technology come about". I find this to be stronger evidence than how evolution got to general intelligence.

I agree with Eric's claim that R&D automation will speed up AI progress. The point of disagreement is more like: when we have AI technology that's able to do basically all human cognitive tasks (which for want of a better term I'll call AGI, as an umbrella term to include both CAIS and agent AGI), what will it look like? It's true that no past technologies have looked like unified agent AGIs - but no past technologies have also looked like distributed systems capable of accomplishing all human tasks either. So it seems like the evolution prior is still the most relevant one.

"Humans think in terms of individuals with goals, and so even if there's an equally good approach to AGI which doesn't conceive of it as a single goal-directed agent, researchers will be biased against it."
I'm curious how strong an objection you think this is. I find it weak; in practice most of the researchers I know think much more concretely about the systems they implement than "agent with a goal", and these are researchers who work on deep RL. And in the history of AI, there were many things to be done besides "agent with a goal"; expert systems/GOFAI seems like the canonical counterexample.

I think the whole paradigm of RL is an example of a bias towards thinking about agents with goals, and that as those agents become more powerful, it becomes easier to anthropomorphise them (OpenAI Five being one example where it's hard not to think of it as a group of agents with goals). I would withdraw my objection if, for example, most AI researchers took the prospect of AGI from supervised learning as seriously as AGI from RL.

A clear counterargument is that some companies will have AI CEOs, and they will outcompete the others, and so we'll quickly transition to the world where all companies have AI CEOs. I think this is not that important -- having a human in the loop need not slow down everything by a huge margin, since most of the cognitive work is done by the AI advisor, and the human just needs to check that it makes sense (perhaps assisted by other AI services).

I claim that this sense of "in the loop" is irrelevant, because it's equivalent to the AI doing its own thing while the human holds a finger over the stop button. I.e. the AI will be equivalent to current CEOs, the humans will be equivalent to current boards of directors.

To the extent that you are using this to argue that "the AI advisor will be much more like an agent optimising for an open-ended goal than Eric claims", I agree that the AI advisor will look like it is "being a very good CEO". I'm not sure I agree that it will look like an agent optimizing for an open-ended goal, though I'm confused about this.

I think of CEOs as basically the most maximiser-like humans. They have pretty clear metrics which they care about (even if it's not just share price, "company success" is a clear metric by human standards), they are able to take actions that are as broad in scope as basically any actions humans can take (expand to new countries, influence politics, totally change the lives of millions of employees), and almost all of the labour is cognitive, so "advising" is basically as hard as "doing" (modulo human interactions). To do well they need to think "outside the box" of stimulus and response, and deal with worldwide trends and arbitrarily unusual situations (has a hurricane just hit your factory? do you need to hire mercenaries to defend your supply chains?) Most of them have some moral constraints, but also there's a higher percentage of psychopaths than any other role, and it's plausible that we'd have no idea whether an AI doing well as a CEO actually "cares about" these sorts of bounds or is just (temporarily) constrained by public opinion in the same way as the psychopaths.

The main point of CAIS is that services aren't long-term goal-oriented; I agree that if services end up being long-term goal-oriented they become dangerous.

I then mentioned that to build systems which implement arbitrary tasks, you may need to be operating over arbitrarily long time horizons. But probably this also comes down to how decomposable such things are.

If you go via the CAIS route you definitely want to prevent unbounded AGI maximizers from being created until you are sure of their safety or that you can control them. (I know you addressed that in the previous point, but I'm pretty sure that no one is arguing to focus on CAIS conditional on AGI agents existing and being more powerful than CAIS, so it feels like you're attacking a strawman.)

People are arguing for a focus on CAIS without (to my mind) compelling arguments for why we won't have AGI agents eventually, so I don't think this is a strawman.

Given a sufficiently long delay, we could use CAIS to build global systems that can control any new AGIs, in the same way that government currently controls most people.

This depends on having pretty powerful CAIS and very good global coordination, both of which I think of as unlikely (especially given that in a world where CAIS occurs and isn't very dangerous, people will probably think that AI safety advocates were wrong about there being existential risk). I'm curious how likely you think this is though? If agent AGIs are 10x as dangerous, and the probability that we eventually build them is more than 10%, then agent AGIs are the bigger threat.

I also am not sure why you think that AGI agents will optimize harder for self-improvement.

Because they have long-term convergent instrumental goals, and CAIS doesn't. CAIS only "cares" about self-improvement to the extent that humans are instructing it to do so, but humans are cautious and slow. Also because even if building AGI out of task-specific strongly-constrained modules is faster at first, it seems unlikely that it's anywhere near the optimal architecture for self-improvement.

Compared to what? If the alternative is "a vastly superintelligent AGI agent that is acting within what is effectively the society of 2019", then I think CAIS is a better model. I'm guessing that you have something else in mind though.

It's something like "the first half of CAIS comes true, but the services never get good enough to actually be comprehensive/general. Meanwhile fundamental research on agent AGI occurs roughly in parallel, and eventually overtakes CAIS." As a vague picture, imagine a world in which we've applied powerful supervised learning to all industries, and applied RL to all tasks which are either as constrained and well-defined as games, or as cognitively easy as most physical labour, but still don't have AI which can independently do the most complex cognitive tasks (Turing tests, fundamental research, etc).

Suppose an AI service realises that it is able to seize many more resources with which to fulfil its bounded utility function. Would it do so? If no, then it's not rational with respect to that utility function. If yes, then it seems rather unsafe, and I'm not sure how it fits Eric's criterion of using "bounded resources".

Yes, it would. The hope is that there do not exist ways to seize and productively use tons of resources within the bound. (To be clear, I'm imagining a bound on time, i.e. finite horizon, as opposed to a bound on the maximum value of the utility function.)

I agree with Eric's claim that R&D automation will speed up AI progress. The point of disagreement is more like: when we have AI technology that's able to do basically all human cognitive tasks (which for want of a better term I'll call AGI, as an umbrella term to include both CAIS and agent AGI), what will it look like? It's true that no past technologies have looked like unified agent AGIs - but no past technologies have also looked like distributed systems capable of accomplishing all human tasks either. So it seems like the evolution prior is still the most relevant one.

I don't really know what to say to this beyond "I disagree", it seems like a case of reference class tennis. I'm not sure how much we disagree -- I do agree that we should put weight on the evolution prior.

I think the whole paradigm of RL is an example of a bias towards thinking about agents with goals, and that as those agents become more powerful, it becomes easier to anthropomorphise them (OpenAI Five being one example where it's hard not to think of it as a group of agents with goals).

But there were so many other paradigms that did not look like that.

I would withdraw my objection if, for example, most AI researchers took the prospect of AGI from supervised learning as seriously as AGI from RL.

There are lots of good reasons not to expect AGI from supervised learning, most notably that with supervised learning you are limited to human performance.

I claim that this sense of "in the loop" is irrelevant, because it's equivalent to the AI doing its own thing while the human holds a finger over the stop button. I.e. the AI will be equivalent to current CEOs, the humans will be equivalent to current boards of directors.

I've lost sight of what original claim we were disagreeing about here. But I'll note that I do think that we have significant control over current CEOs, relative to what we imagine with "superintelligent AGI optimizing a long-term goal".

I think of CEOs as basically the most maximiser-like humans.

I agree with this (and the rest of that paragraph) but I'm not sure what point you're trying to make there. If you're saying that a CAIS-CEO would be risky, I agree. This seems markedly different from worries that a CAIS-anything would behave like a long-term goal-directed literally-actually-maximizer.

I then mentioned that to build systems which implement arbitrary tasks, you may need to be operating over arbitrarily long time horizons. But probably this also comes down to how decomposable such things are.

Agreed that decomposability is the crux.

People are arguing for a focus on CAIS without (to my mind) compelling arguments for why we won't have AGI agents eventually, so I don't think this is a strawman.

Eventually is the key word here. Conditional on AGI agents existing before CAIS, I certainly agree that we should focus on AGI agent safety, which is the claim I thought you were making. Conditional on CAIS existing before AGI agents, I think it's a reasonable position to say "let's focus on CAIS, and then coordinate to either prevent AGI agents from existing or to control them from the outside if they will exist". In particular, approaches like boxing or supervision by a strong overseer become much more likely to work in a world where CAIS already exists.

Also, there is one person working on CAIS and tens to hundreds working on AGI agents (depending on how you count), so arguing for more of a focus on CAIS doesn't mean that you think that CAIS is the most important scenario.

This depends on having pretty powerful CAIS and very good global coordination, both of which I think of as unlikely (especially given that in a world where CAIS occurs and isn't very dangerous, people will probably think that AI safety advocates were wrong about there being existential risk). I'm curious how likely you think this is though?

I don't find it extremely unlikely that we'll get something along these lines. I don't know, maybe something like 5%? (Completely made up number, it's especially meaningless because I don't have a concrete enough sense of what counts as CAIS and what counts as good global coordination to make a prediction about it.) But I also think that the actions we need to take look very different in different worlds, so most of this is uncertainty over which world we're in, as opposed to confidence that we're screwed except in this 5% probability world.

If agent AGIs are 10x as dangerous, and the probability that we eventually build them is more than 10%, then agent AGIs are the bigger threat.

While this is literally true, I have a bunch of problems with the intended implications:

  • Saying "10x as dangerous" is misleading. If CAIS leads to >10% x-risk, it is impossible for agent AGI to be 10x as dangerous (ignoring differences in outcomes like s-risks). So by saying "10x as dangerous" you're making an implicit claim of safety for CAIS. If you phrase it in terms of probabilities, "10x as dangerous" seems much less plausible.
  • The research you do and actions you take in the world where agent AGI comes first are different from those in the world where CAIS comes first. I expect most research to significantly affect one of those two worlds but not both. So the relevant question is the probability of a particular one of those worlds.
  • I expect that our understanding of low-probability / edge-case worlds to be very bad, in which case most research aimed at improving these worlds is much more likely to be misguided and useless. This cuts against arguments of the form "We should focus on X even though it is unlikely or hard to understand because if it happens then it would be really bad/dangerous." Yes, you can apply this to AI safety in general, and yes, I do think that a majority of AI safety research will turn out to be useless, primarily because of this argument.
  • This is an argument only about importance. As I mentioned above, CAIS is much more neglected, and plausibly is more tractable.
Because they have long-term convergent instrumental goals, and CAIS doesn't. CAIS only "cares" about self-improvement to the extent that humans are instructing it to do so, but humans are cautious and slow.

Agreed, though I don't think this is a huge effect. We aren't cautious and slow about our current AI development because we're confident it isn't dangerous; the same can happen in CAIS with basic AI building blocks. But good point, I agree this pushes me to thinking that AGI agents will self-improve faster.

Also because even if building AGI out of task-specific strongly-constrained modules is faster at first, it seems unlikely that it's anywhere near the optimal architecture for self-improvement.

Idk, that seems plausible to me. I don't see strong arguments in either direction.

It's something like "the first half of CAIS comes true, but the services never get good enough to actually be comprehensive/general. Meanwhile fundamental research on agent AGI occurs roughly in parallel, and eventually overtakes CAIS." As a vague picture, imagine a world in which we've applied powerful supervised learning to all industries, and applied RL to all tasks which are either as constrained and well-defined as games, or as cognitively easy as most physical labour, but still don't have AI which can independently do the most complex cognitive tasks (Turing tests, fundamental research, etc).

I agree that seems like a good model. It doesn't seem clearly superior to CAIS though.