I have a problem with section 32, "Unaligned superintelligent agents need not threaten world stability". Here's the summary of that section from the paper:
- Powerful SI-level capabilities can precede AGI agents.
- SI-level capabilities could be applied to strengthen defensive stability.
- Unopposed preparation enables strong defensive capabilities.
- Strong defensive capabilities can constrain problematic agents.
So the key idea here seems to be that good actors will have a period of time to use superintelligent AI services to prepare some sort of ubiquitous defense that will constrain any subsequent AGI agents. But I don't understand where this period of "unopposed preparation" comes from. Why wouldn't someone create an AGI by cobbling together a bunch of AI services, or hire a bunch of AI services to help them design an AGI, as soon as they could? If they did that, then superintelligent AGI agents would arise nearly simultaneously with SI-level capabilities, and there would be no such period of unopposed preparation. In section 32.2, Eric only argues that SI-level capabilities can precede AGI agents. Since I think they wouldn't at least not by a significant margin, the whole argument seems to fall apart or has to be interpreted in a way that makes it strategically irrelevant.
Eric seems to think that no one would bother to create AGI because "AGI agents offer no compelling value", by which he means "Because general AI-development capabilities can provide stable, comprehensive AI services, there is no compelling, practical motivation for undertaking the more difficult and potentially risky implementation of self-modifying AGI agents." But if quickly building an AGI can potentially allow someone to take over the world before "unopposed preparation" can take place, isn't that a compelling motivation by itself for many people?
Why wouldn't someone create an AGI by cobbling together a bunch of AI services, or hire a bunch of AI services to help them design an AGI, as soon as they could?
Because any task that an AGI could do, CAIS could do as well. (Though I don't agree with this -- unified agents seem to work better.)
But if quickly building an AGI can potentially allow someone to take over the world before "unopposed preparation" can take place, isn't that a compelling motivation by itself for many people?
I suspect he would claim that quickly building an AGI would not allow you to take over the world, because the AGI would not be that much more capable than the CAIS service cluster.
It may be the case that people try to take over the world just with CAIS, and maybe that could succeed. I think he's arguing only against AGI accident risk here, not against malicious uses of AI. (I think you already knew that, but it wasn't fully clear on reading your comment.)
I suspect he would claim that quickly building an AGI would not allow you to take over the world, because the AGI would not be that much more capable than the CAIS service cluster.
That does not seem to be his position though, because if AGI is not much more capable than CAIS, then there would be no need to talk specifically about how to defend the world against AGI, as he does at length in section 32. If that was his position, he could just talk about how ordinary policing and military defense would work in a CAIS world (i.e., against human adversaries wielding CAIS) and say that the same policing/defense would also work against AGI because AGI is not much more capable than CAIS.
Instead it seems clear that he thinks AGI requires special effort to defend against, which is made possible by a delay between SI-level CAIS and AGI, which he proposes that we use to do a very extensive "unopposed preparation". I've been trying to figure out why he thinks there will be such a delay and my current best guess is "Implementation of the AGI model is widely regarded as requiring conceptual breakthroughs." (page 75) which he repeats on page 77, "AGI (but not CAIS) calls for conceptual breakthroughs to enable both implementation and subsequent safe application." I don't understand why he thinks such conceptual breakthroughs will be required though. Why couldn't someone just take some appropriate AI services, connect them together in a straightforward way, and end up with an AGI? Do you get it? Or am I on the wrong track here?
Do you get it?
I doubt I will ever be able to confidently answer yes to that question.
That does not seem to be his position though, because if AGI is not much more capable than CAIS, then there would be no need to talk specifically about how to defend the world against AGI, as he does at length in section 32.
My model is that he does think AGI won't be much more capable than CAIS (see sections 12 and 13 in particular, and 10, 11 and 16 also touch on the topic), but lots of people (including me) kept making the argument that end-to-end training tends to improve performance and so AGI would outperform CAIS, and so he decided to write a response to that.
In general, my impression from talking to him and reading earlier drafts is that the earlier chapters are representative of his core models, while the later chapters are more like responses to particular arguments, or specific implications of those models.
I can give one positive argument for AGI being harder to make than SI-level CAIS. All of our current techniques for building AI systems create things that are bounded in the time horizon they are optimizing over. It's actually quite unclear how we would use current techniques to get something that does very-long-term-planning. (This could be the "conceptual breakthroughs" point.) Seems a lot easier to get a bunch of bounded services and hook them up together in such a way that they can do the sorts of things that AGI agents could do.
The one scenario that is both concrete and somewhat plausible to me is that we run powerful deep RL on a very complex environment, and this finds an agent that does very-long-term-planning, because that's what it takes to do well on the environment. I don't know what Eric thinks about this scenario, but it doesn't seem to influence his thinking very much (and in fact in the OP I argued that CAIS isn't engaging enough with this scenario).
Why couldn't someone just take some appropriate AI services, connect them together in a straightforward way, and end up with an AGI?
If you take a bunch of a bounded services and connect them together in some straightforward way, you wouldn't get something that is optimizing over the long term. Where did the long term optimization come from?
For example, you could take any long term task and break it down into the "plan maker" which thinks for an hour and gives a plan for the task, and the "plan executor" which takes an in-progress plan and executes the next step. Both of these are bounded and so could be services, and their combination is generally intelligent, but the combination wouldn't have convergent instrumental subgoals.
Thanks, I think this is helpful for me to understand Eric's model better, but I'm still pretty confused.
It’s actually quite unclear how we would use current techniques to get something that does very-long-term-planning. (This could be the “conceptual breakthroughs” point.)
But it's quite unclear how to use current techniques to do a lot of things. Why should we expect that this conceptual breakthrough would come later than other conceptual breakthroughs needed to achieve CAIS? (Given your disagreement with Eric on this, I guess this is more a question for him than for you.)
Where did the long term optimization come from?
I was assuming that long term strategic planners (as described in section 27) are available as an AIS, and would be one of the components of the hypothetical AGI.
For example, you could take any long term task and break it down into the “plan maker” which thinks for an hour and gives a plan for the task, and the “plan executor” which takes an in-progress plan and executes the next step. Both of these are bounded and so could be services, and their combination is generally intelligent, but the combination wouldn’t have convergent instrumental subgoals.
I don't see why it wouldn't, unless these services are specifically designed to be corrigible (in which case the "corrigible" part seems much more important than the "service" part). For example, suppose you asked the plan maker to create a plan to cure cancer. Why would the mere fact that it's a bounded service prevent it from coming up with a plan that involves causing human extinction (and a bunch of convergent instrumental subgoals like deceiving humans who might stop it)? (If there was a human in the loop, then you could look at the plan and reject it, but I'm imagining that someone, in order to build an AGI as quickly and efficiently as possible, stripped off the "optimize for human consumption" part of the strategic planner and instead optimized it to produce plans for direct machine consumption.)
Why should we expect that this conceptual breakthrough would come later than other conceptual breakthroughs needed to achieve CAIS?
I think I share Eric's intuition that this problem is hard in a more fundamental way than other things, but I don't really know why I have this intuition. Some potential generators:
I don't see why it wouldn't, unless these services are specifically designed to be corrigible (in which case the "corrigible" part seems much more important than the "service" part).
Yeah, you're right, I definitely said the wrong thing there. I guess the difference is that the convergent instrumental subgoals are now "one level up" -- they aren't subgoals of the AI service itself, they're subgoals of the plan that was created by the AI service. It feels like this is qualitatively different and easier to address, but I can't really say why. More generators:
(You could of course say "just because it's easier that doesn't mean people will do it", but I could imagine that if its easy enough this becomes best practice and people do it by default, and you don't actually gain very much by taking these parts out.)
I was assuming that long term strategic planners (as described in section 27) are available as an AIS, and would be one of the components of the hypothetical AGI.
Yeah, here also what I should have said is that the long term optimization is happening one level up, whereas with the typical AGI agent scenario it feels like the long term optimization needs to happen at the base level, and that's the thing we don't know how to do.
Unfortunately, I only vaguely understand the points that you're trying to make in this comment... Would it be fair to just say at this point that this is an important crux that Eric failed to convincingly argue for?
I agree that it's an important crux, and that the arguments are not sufficiently strong that everyone should believe Eric's position. I do think that he has provided arguments that support his position, though they are in a different language/ontology than is usually used here.
Ah, ok, what sections would you suggest that I (re)read to understand his arguments better? (You mentioned 12, 13, 10, 11 and 16 earlier in this thread but back then we were talking about "AGI won’t be much more capable than CAIS" and here the topic is whether we should expect AGI to come later than CAIS or require harder conceptual breakthroughs.)
I quickly skimmed the table of contents to generate this list, so it might have both false positives and false negatives.
Section 1: We typically make progress using R&D processes; this can get us to superintelligence. Implicitly also makes the claim that this is qualitatively different from AGI, though doesn't really argue for that.
Section 8: Optimization pressure points away from generality, not towards it, which suggests that strong optimization pressure doesn't give you AGI.
Section 12.6: AGI and CAIS solve problems in different ways. (Combined with the claim, argued elsewhere: CAIS will happen first.)
Section 13: AGI agents are more complex. (Implicit claim: and so harder to build.)
Section 17: Most complex tasks involve several different subtasks that don't interact much; so you get efficiency and generality gains by splitting the subtasks up into separate services.
Section 38: Division of labor + specialization are useful for good performance.
Most of these sections seem to only contain arguments that AGI won't come earlier than CAIS, but not that it would come later than CAIS. In other words, they don't argue against the likelihood that under CAIS someone can easily build an AGI by connecting existing AI services together in a straightforward way. The only section I can find among the ones you listed that tries to argue in this direction is Section 13, but even it mostly just argues that AGI isn't simpler than CAIS, and not that it's more complex, except for this paragraph in the summary, Section 13.5:
To summarize, in each of the areas outlined above, the classic AGI model both obscures and increases complexity: In order for general learning and capabilities to fit a classic AGI model, they must not only exist, but must be integrated into a single, autonomous, self-modifying agent. Further, achieving this kind of integration would increase, not reduce, the challenges of aligning AI behaviors with human goals: These challenges become more difficult when the goals of a single agent must motivate all (and only) useful tasks.
So putting alignment aside (I'm assuming that someone would be willing to build an unaligned AGI if it's easy enough), the only argument Eric gives for greater complexity of AGI vs CAIS is "must be integrated into a single, autonomous, self-modifying agent", but why should this integration add a non-negligible amount of complexity? Why can't someone just take a plan maker, connect it to a plan executer, and connect that to the Internet to access other services as needed? (I think your argument that strategic planning may be one of the last AIS to arrive is plausible, but it doesn't seem to be an argument that Eric himself makes.) Where is the additional complexity coming from?
Why can't someone just take a plan maker, connect it to a plan executer, and connect that to the Internet to access other services as needed?
I think Eric would not call that an AGI agent.
Setting aside what Eric thinks and talking about what I think: There is one conception of "AGI risk" where the problem is that you have an integrated system that has optimization pressure applied to the system as a whole (similar to end-to-end training) such that the entire system is "pointed at" a particular goal and uses all of its intelligence towards that. The goal is a long-term goal over universe-histories. The agent can be modeled as literally actually maximizing the goal. These are all properties of the AGI itself.
With the system you described, there is no end-to-end training, and it doesn't seem right to say that the overall system is aimed at a long-term goal, since it depends on what you ask the plan maker to do. I agree this does not clearly solve any major problem, but it does seem markedly different to me.
I think that Eric's conception of "AGI agent" is like the first thing I described. I agree that this is not what everyone means by "AGI", and it is particularly not the thing you mean by "AGI".
You might argue that there seems to be no effective safety difference between an Eric-AGI-agent and the plan maker + plan executor. The main differences seem to be about what safety mechanisms you can add -- such as looking at the generated plan, or using human models of approval to check that you have the right goal. (Whereas an Eric-AGI-agent is so opaque that you can't look at things like "generated plans", and you can't check that you have the right goal because the Eric-AGI-agent will not let you change its goal.)
With an Eric-AGI-agent, if you try to create a human model of approval, that would need to be an Eric-AGI-agent itself in order to effectively supervise the first Eric-AGI-agent, but in that case the model of approval will be literally actually maximizing some goal like "be as accurate as possible", which will lead to perverse behavior like manipulating humans so that what they approve is easier to predict. In CAIS, this doesn't happen, because the approval model is not searching over possibilities that involve manipulating humans.
This is one of the documents I was responding to when I wrote A general model of safety-oriented AI development, Three AI Safety Related Ideas, and Two Neglected Problems in Human-AI Safety. (I didn't cite it because it was circulating semi-privately in draft form, and Eric apparently didn't want its existence to be publicly known.) I'm disappointed that although Eric wrote to me "I think that your two neglected problems are critically important", the perspectives in those posts didn't get incorporated more into the final document, which spends only 3 short paragraphs out of hundreds of pages to talk about what I think of as "human safety problems". (I think those paragraphs were in the draft even before I wrote my posts.)
I worry about the framing adopted in this document that the main problem in human-AI safety is "questions of what humans might choose to do with their capabilities", as opposed to my preferred framing of "how can we design human-AI systems to minimize total risk". (To be fair to Eric, a lot of other AI safety people also only talk about "misuse risk" and not about how AI is by default likely to exacerbate human safety problems, e.g., by causing rapid distributional shifts for humans.) I worry that this gives AI researchers and developers license to think, "I'm just developing an AI service. AI services will be comprehensive anyway so there's no reason for me to hold back or think more about what I'm doing. It's someone else's job to worry about what humans might choose to do with these capabilities."
I actually think the CAIS model gives me optimism for these sorts of problems. As long as we acknowledge that the problems exist and can be an issue, we could develop services that help us mitigate them. Safety in the CAIS world already depends on having services that are in charge of good engineering, testing, red teaming, monitoring, etc., as well as services that evaluate objectives and make sure humans would approve of them. It seems fairly easy to expand this to include services that consider how disruptive new technologies will be, how underdetermined human values are, whether a proposed plan reduces option value, what risk aversion implies about a particular plan of action, what blind spots people have, etc.
I'd be interested in a list of services that you think would be helpful for addressing human safety problems. You might think of this as "our best current guess at metaphilosophy and metaphilosophy research".
(I know you were mainly talking about the document's framing, I don't have much to say about that.)
It seems fairly easy to expand this to include services that consider how disruptive new technologies will be, how underdetermined human values are, whether a proposed plan reduces option value, what risk aversion implies about a particular plan of action, what blind spots people have, etc.
Can you explain how you'd implement these services? Take "how disruptive new technologies will be" for example. I imagine you can't just apply ML given the paucity of training data and how difficult it would be to generalize from historical data to new technologies and new social situations. And it seems to me that if you base it on any kind of narrow AI technology, it would be easy to miss some of the novel implications/consequences of the new technologies and social situations and end up with a wrong answer. Maybe you could instead base it on a general purpose reasoner or question-answerer, but if something like that exists, AI would already have created a lot of new technologies that are risky for humans to face. Plus, the general purpose AI could replace a lot of discrete/narrow AI services, so I feel like we would already have moved past the CAIS world at that point. BTW, if the service is not just a thin wrapper on top of a general purpose AI which is generally trustworthy, I also don't know how you'd know whether you can trust the answers that it gives.
I’d be interested in a list of services that you think would be helpful for addressing human safety problems. You might think of this as “our best current guess at metaphilosophy and metaphilosophy research”.
I could try to think in that direction after I get a better sense of what kinds of services might be both feasible and trustworthy in the CAIS world. It seems easy to become too optimistic/complacent under the CAIS model if I just try to imagine what safety-enhancing services might be helpful without worrying about whether those services would be feasible or how well they'd work at the time when they're needed.
Can you explain how you'd implement these services?
Not really. I think of CAIS as suggesting that we take an outside view that says "looking at how AI has been progressing, and how humans generally do things, we'll probably be able to do more and more complex tasks as time goes on". But the emphasis that CAIS places is that the things we'll be able to do will be domain-specific tasks, rather than getting a general-purpose reasoner. I don't have a detailed enough inside view to say how complex tasks might be implemented in practice.
I agree with the rest of what you said, which feels to me like considering a few possible inside-view scenarios and showing that they don't work.
One way to think about this is through the lens of iterated amplification. With iterated amplification, we also get the property that our AI systems will be able to do more and more complex tasks as time goes on. The key piece that enables this is the ability to decompose problems, so that iterated amplification always bottoms out into a tree of questions and subquestions down to leaves which the base agent can answer. You could think of (my conception of) CAIS as a claim that a similar process will happen in a decentralized way for all of ML by default, and at any point the things we can do will look like an explicit iterated amplification deliberation tree of depth one or two, where the leaves are individual services and the top level question will be some task that is accomplished through a combination of individual services.
I could try to think in that direction after I get a better sense of what kinds of services might be both feasible and trustworthy in the CAIS world. It seems easy to become too optimistic/complacent under the CAIS model if I just try to imagine what safety-enhancing services might be helpful without worrying about whether those services would be feasible or how well they'd work at the time when they're needed.
Agreed, I'm making a bid for generating ideas without worrying about feasibility and trustworthiness, but not spending too much time on this and not taking the results too seriously.
Eric and I have exchanged a few emails since I posted this summary, I'm posting some of it here (with his permission), edited by me for conciseness and clarity. The paragraphs in the quotes are Eric's, but I have rearranged his paragraphs and omitted some of them for better flow in this comment.
There is a widespread intuition that AGI agents would by nature be more integrated, flexible, or efficient than comparable AI services. I am persuaded that this is wrong, and stems from an illusion of simplicity that results from hiding mechanism in a conceptually opaque box, a point that is argued at some length in Section 13.
Overall, I think that many of us have been in the habit of seeing flexible optimization itself as problem, when optimization is instead (in the typical case) a strong constraint on a system’s behavior (see Section 8). Flexibility of computation in pursuit of optimization for bounded tasks seems simply useful, regardless of planning horizon, scope of considerations, or scope of required knowledge.
I agree that AGI agents hide mechanism in an opaque box. I also agree that the sort of optimization that current ML does, which is very task-focused, is a strong constraint on behavior. There seems to be a different sort of optimization that humans are capable of, where we can enter a new domain and perform well in it very quickly; I don't have a good understanding of that sort of optimization, and I think that's what the classic AGI agent risks are about.
Relatedly, I've used the words "monolithic AGI agent" a bunch in the summary and the post. Now, I want to instead talk about whether AI systems will be opaque and well-integrated, since that's the main crux of our disagreement. It's plausible to me that even if they are opaque and well-integrated, you don't get the classic AGI agent risks, because you don't get the sort of optimization I was talking about above.
In this connection, you cite the power of end-to-end training, but Section 17.4 (“General capabilities comprise many tasks and end-to-end relationships”) argues that, because diverse tasks encompass many end-to-end relationships, the idea that a broad set of tasks can be trained “end to end” is mistaken, a result of the narrowness of current trained systems in which services form chains rather than networks that are more wide than deep. We should instead expect that broad capabilities will best be implemented by sets of systems (or sets of end-to-end chains of systems) that comprise well-focused competencies: Systems that draw on distinct subtask competencies will typically be easier to train and provide more robust and general performance (Section 17.5). Modularity typically improves flexibility and generality, rather than impeding it.
Note that the ability to employ subtask components in multiple contexts constitutes a form of transfer learning, and [...] this transfer learning can carry with it task-specific aspects of behavioral alignment.
This seems like the main crux of the disagreement. My claim is that for any particular task, given enough compute, data and model size, an opaque, well-integrated, unstructured AI system will outperform a transparent, modular collection of services. This is only on the axis of performance at the task: I agree that the structured system will generalize better out of distribution (which leads to robustness, flexibility, and better transfer learning). I'm basing this primarily off of empirical evidence and intuitions:
Humans definitely solve large tasks in a very structured way; I hypothesize that this is because for those tasks the limits of human compute/data/brain size prevent us from getting the benefits of an unstructured heuristic approach.
Speaking of integration:
Regarding integration, I’ve argued that classic AGI-agent models neither simplify nor explain general AI capabilities (Section 13.3), including the integration of competencies. Whatever integration of functionality one expects to find inside an opaque AGI agent must be based on mechanisms that presumably apply equally well to integrating relatively transparent systems of services. These mechanisms can be dynamic, rather than static, and can include communication via opaque vector embeddings, jointly fine-tuning systems that perform often-repeated tasks, and matching of tasks to services, (including service-development services) in semantically meaningful “task spaces” (discussed in Section 39 “Tiling task-space with AI services can provide general AI capabilities”).
[...]
Direct lateral links between competencies such as organic synthesis, celestial mechanics, ancient Greek, particle physics, image interpretation, algorithm design, traffic planning (etc.) are likely to be sparse, particularly when services perform object-level tasks. This sparseness is, I think, inherent in natural task-structures, quite independent of human cognitive limitations.
(The paragraphs above were written in a response to me while I was still using the phrase "AGI agents")
I expect that the more you integrate the systems of services, the more opaque they will become. The resulting system will be less interpretable; it will be harder to reason about what information particular services do not have access to (Section 9.4); and it is harder to tell when malicious behavior is happening. The safety affordances identified in CAIS no longer apply because there is not enough modularity between services.
Re: sparseness inherent in task-structures, I think this is a result of human cognitive limitations but don't know how to argue more for that perspective.
That was the summary :P The full thing was quite a bit longer. I also didn't want to misquote Eric.
Maybe the shorter summary is: there are two axes which we can talk about. First, will systems be transparent, modular and structured (call this CAIS-like), or will they be opaque and well-integrated? Second, assuming that they are opaque and well-integrated, will they have the classic long-term goal-directed AGI-agent risks or not?
Eric and I disagree on the first one: my position is that for any particular task, while CAIS-like systems will be developed first, they will gradually be replaced by well-integrated ones, once we have enough compute, data, and model capacity.
I'm not sure how much Eric and I disagree on the second one: I think it's reasonable to predict that the resulting systems are specialized for particular bounded tasks and so won't be running broad searches for long-term plans. I would still worry about inner optimizers; I don't know what Eric thinks about that worry.
This summary is more focused on my beliefs than Eric's, and is probably not a good summary of the intent behind the original comment, which was "what does Eric think Rohin got wrong in his summary + opinion of CAIS", along with some commentary from me trying to clarify my beliefs.
Updates were mainly about actually carving up the space in the way above. Probably others, but I often find it hard to introspect on how my beliefs are updating.
I don't understand why this crux needs to be dichotomous. Setting aside the opacity question for the moment, why can't services in a CAIS be differentiable w.r.t. each other?
Example Consider a language modeling service (L) that is consumed by several downstream tasks, including various text classifiers, an auto-correction service for keyboards, and a machine translation service. In the end-to-end view, it would be wise for these downstream services to use a language representation from L and to propagate their own error information back to L so that it can improve its shared representation. Since the downstream services ultimately make up L's raison d'etre, it will be obliged to do so.
For situations that are not so neatly differentiable, we can describe the services network as a stochastic computation graph if there is a benefit for end-to-end learning the entire system. This should lead to a slightly more precise conjecture about the relationship between the CAIS agent and utility-maximizing agent: A CAIS agent that can be described as a stochastic computation graph is equivalent to some utility-maximizing agent when trained end-to-end via approximate backpropagation.
It's likely that CAIS agents aren't usefully described as stochastic computation graphs, or that we may need to extend the usage of "stochastic computation graph" here to deal with services that create other services as offspring and attach them to the graph. But the possibility itself suggests a spectrum between the archetypal modular CAIS and an end-to-end CAIS, in which subgraphs of the services network are trained end-to-end. It's not obvious to me that the CAIS as defined in the text discounts this scenario, despite Eric's comments here.
I broadly agree, especially if you set aside opacity; I very rarely mean to imply a strict dichotomy.
I do think in the scenario you outlined the main issue would be opacity: the learned language representation would become more and more specialized between the various services, becoming less interpretable to humans and more "integrated" across services.
One way to test the "tasks don't overlap" idea is to have two nets do two different tasks, but connect their internal layers. Then see how high the weights on those layers get. Like, is the internal processing done by Mario AI useful for Greek translation at all? If it is then backprop etc should discover that.
Promoted to curated: I think the linked document is one of the most interesting things to be written in AI Alignment in the last year, and this is the best summary and commentary of it that currently exists. Quality wise, I think everything that I have to say has already been covered by the other commenters, but I overall found reading the linked document, as well as this summary, to be quite helpful in my thinking about AI Alignment, though I also disagree with large parts of it (However, I am not at the research level, and so have a harder time judging how useful it would be for the people who are spending even more time thinking about AI Alignment).
Thanks a lot for writing this summary, and thanks a lot to Eric for all the work he is doing.
I trust past-me to have summarized CAIS much better than current-me; back when this post was written I had just finished reading CAIS for the third or fourth time, and I haven't read it since. (This isn't a compliment -- I read it multiple times because I had a lot of trouble understanding it.)
I've put in two points of my own in the post. First:
(My opinion: I think this isn't engaging with the worry with RL agents -- typically, we're worried about the setting where the RL agent is learning or planning at test time, which can happen in learn-to-learn and online learning settings, or even with vanilla RL if the learned policy has access to external memory and can implement a planning process separately from the training procedure.)
I agree even more with this two years later. There is an important point that CAIS makes, which is that learning is separate from competence. Nonetheless, just because an AI system must first learn about a domain before it can act in it, does not mean that we will notice it doing so. The AI does not learn to take over the world by trying to take over the world and failing, it learns by making a plan to take over the world, learning about the relevant domains (e.g. if it wants to engineer a pandemic, it learns about genetics by reading textbooks), until it is confident that its plan will succeed. This can be true even if the AI was trained using PPO.
(To connect with current discourse, this is basically saying "this doesn't engage with mesa optimization")
Second:
(My opinion: It seems like the lesson of deep learning is that if you can do something end-to-end, that will work better than a structured approach. This has happened with computer vision, natural language processing, and seems to be in the process of happening with robotics. So I don't buy this -- while it seems true that we will get CAIS before AGI since structured approaches tend to be available sooner and to work with less compute, I expect that a monolithic AGI agent would outperform CAIS at most tasks once we can make one.)
I still agree with this, though I'd phrase it differently now. Now I would say that there is some level of data, model capacity, and compute at which an end-to-end / monolithic approach outperforms a structured approach on the training distribution (this is related to but not the same as the bitter lesson). However, at low levels of these three, the structured approach will typically perform better. The required levels at which the end-to-end approach works better depends on the particular task, and increases with task difficulty.
Since we expect all three of these factors to grow over time, I then expect that there will be an expanding Pareto frontier where at any given point the most complex tasks are performed by structured approaches, but as time progresses these are replaced by end-to-end / monolithic systems (but at the same time new, even more complex tasks are found, that can be done in a structured way).
(Really I expect this will be true up till human-level AI and a little past that, and after that who knows what happens.)
----
On CAIS itself:
I think the CAIS framing that Eric Drexler proposed gave concrete shape to a set of intuitions that many people have been relying on for their thinking about AGI. I also tend to think that those intuitions and models aren't actually very good at modeling AGI, but I nevertheless think it productively moved the discourse forward a good bit.
In particular I am very grateful about the comment thread between Wei Dai and Rohin, which really helped me engage with the CAIS ideas, and I think were necessary to get me to my current understanding of CAIS and to pass the basic ITT of CAIS (which I think I have succeeded in in a few conversations I've had since the report came out).
An additional reference that has not been brought up in the comments or the post is Gwern's writing on this, under the heading: "Why Tool AIs Want to Be Agent AIs"
My main objection to this idea is that it is a local solution, and doesn't have built-in mechanisms to become global AI safety solution, that is, to prevent other AIs creation, which could be agential superintelligences. One can try to make "AI police" as a service, but it could be less effective than agential police.
Another objection is probably Gwern's idea that any Tool AI "wants" to become agential AI.
This idea also excludes the robotic direction in AI development, which will anyway produce agential AIs.
If by agent we mean "system that takes actions in the real world", then services can be agents. As I understand it, Eric is only arguing against monolithic AGI agents that are optimizing a long-term utility function and that can learn/perform any task.
Current factory robots definitely look like a service, and even the soon-to-come robots-trained-with-deep-RL will be services. They execute particular learned behaviors.
If I remember correctly, Gwern's argument is basically that Agent AI will outcompete Tool AI because Agent AI can optimize things that Tool AI cannot, such as its own cognition. In the CAIS world, there are separate services that improve cognition, and so the CAIS services do get the benefit of ever-improving cognition, without being classical AGI agents. But overall I agree with this point (and disagree with Eric) because I expect there to be lots of gains to be had by removing the boundaries between services, at least where possible.
One can try to make “AI police” as a service, but it could be less effective than agential police.
This seems likely to me as well, especially since "service" is by definition bounded and agent is not.
Monitoring surveillance in order to see if anyone is breaking rules seems to be quite a bounded task, and in fact is one that we are already in the process of automating (using our current AI systems, which are basically all bounded).
Of course, there are lots of other tasks that are not as clear. But to the extent that you believe the Factored Cognition hypothesis, you should believe that we can make bounded services that nevertheless do a very good job.
Since the CAIS technical report is a gargantuan 210 page document, I figured I'd write a post to summarize it. I have focused on the earlier chapters, because I found those to be more important for understanding the core model. Later chapters speculate about more concrete details of how AI might develop, as well as the implications of the CAIS model on strategy. ETA: This comment provides updates based on more discussion with Eric.
The Model
The core idea is to look at the pathway by which we will develop general intelligence, rather than assuming that at some point we will get a superintelligent AGI agent. To predict how AI will progress in the future, we can look at how AI progresses currently -- through research and development (R&D) processes. AI researchers consider a problem, define a search space, formulate an objective, and use an optimization technique in order to obtain an AI system, called a service, that performs the task.
A service is an AI system that delivers bounded results for some task using bounded resources in bounded time. Superintelligent language translation would count as a service, even though it requires a very detailed understanding of the world, including engineering, history, science, etc. Episodic RL agents also count as services.
While each of the AI R&D subtasks is currently performed by a human, as AI progresses we should expect that we will automate these tasks as well. At that point, we will have automated R&D, leading to recursive technological improvement. This is not recursive self-improvement, because the improvement comes from R&D services creating improvements in basic AI building blocks, and those improvements feed back into the R&D services. All of this should happen before we get any powerful AGI agents that can do arbitrary general reasoning.
Why Comprehensive?
Since services are focused on particular tasks, you might think that they aren't general intelligence, since there would be some tasks for which there is no service. However, pretty much everything we do can be thought of as a task -- including the task of creating a new service. When we have a new task that we would like automated, our service-creating-service can create a new service for that task, perhaps by training a new AI system, or by taking a bunch of existing services and putting them together, etc. In this way, the collection of services can perform any task, and so as an aggregate is generally intelligent. As a result, we can call this Comprehensive AI Services, or CAIS. The "Comprehensive" in CAIS is the analog of the "General" in AGI. So, we'll have the capabilities of an AGI agent, before we can actually make a monolithic AGI agent.
Isn't this just as dangerous as AGI?
You might argue that each individual service must be dangerous, since it is superintelligent at its particular task. However, since the service is optimizing for some bounded task, it is not going to run a long-term planning process, and so it will not have any of the standard convergent instrumental subgoals (unless the subgoals are helpful for the task before reaching the bound).
In addition, all of the optimization pressure on the service is pushing it towards a particular narrow task. This sort of strong optimization tends to focus behavior. Any long term planning processes that consider weird plans for achieving goals (similar to "break out of the box") will typically not find any such plan and will be eliminated in favor of cognition that will actually help achieve the task. Think of how a racecar is optimized for speed, while a bus is optimized for carrying passengers, rather than having a "generally capable vehicle".
It's also worth noting what we mean by superintelligent here. In this case, we mean that the service is extremely competent at its assigned task. It need not be learning at all. We see this distinction with RL agents -- when they are trained using something like PPO, they are learning, but at test time you can simply execute them without any PPO and they will perform the behavior they previously learned and won't change that behavior at all.
(My opinion: I think this isn't engaging with the worry with RL agents -- typically, we're worried about the setting where the RL agent is learning or planning at test time, which can happen in learn-to-learn and online learning settings, or even with vanilla RL if the learned policy has access to external memory and can implement a planning process separately from the training procedure.)
On a different note, you might argue that if we analyze the system of services as a whole, then it certainly looks generally intelligent, and so should be regarded as an AGI agent. However, "AGI agent" usually carries the anthropomorphic connotation of VNM rationality / expected utility maximization / goal-directedness. While it seems possible and even likely that each individual service can be well-modeled as VNM rational (albeit with a bounded utility function), it is not the case that a system of VNM rational agents will itself look VNM rational -- in fact, game theory is all about how systems of rational agents have weird behavior.
In addition, there are several aspects of CAIS that make it more safe than a classic monolithic AGI agent. Under CAIS, each service interacts with other services via clearly defined channels of communication, so that the system is interpretable and transparent, even though each service may be opaque. We can reason about what information is present in the inputs to infer what the service could possibly know. We could also provide access to some capability through an external resource during training, so that the service doesn't develop that capability itself.
This interpretability allows us to monitor the service -- for example, we could look at which subservices it accesses in order to make sure it isn't doing anything crazy. But what if having a human in the loop leads to unacceptable delays? Well, this would only happen for deployed applications, where having a human in the loop seems expected, and should also be economically incentivized because it leads to better behavior. Basic AI R&D can continue to be improved autonomously without a human in the loop, so you could still see an intelligence explosion. Note that tactical tasks requiring quick reaction times probably would be delegated to AI services, but the important strategic decisions could still be left in human hands (assisted by AI services, of course).
What happens when we create AGI?
Well, it might not be valuable to create an AGI. We want to perform many different tasks, and it makes sense for these to be done by diverse services. It would not be competitive to include all capabilities in a single monolithic agent. This is analogous to how specialization of labor is a good idea for us humans.
(My opinion: It seems like the lesson of deep learning is that if you can do something end-to-end, that will work better than a structured approach. This has happened with computer vision, natural language processing, and seems to be in the process of happening with robotics. So I don't buy this -- while it seems true that we will get CAIS before AGI since structured approaches tend to be available sooner and to work with less compute, I expect that a monolithic AGI agent would outperform CAIS at most tasks once we can make one.)
That said, if we ever do build AGI, we can leverage the services from our CAIS-world in order to make it safe. We could use superintelligent security services to constrain any AGI agent that we build. For example, we could have services trained to identify long-term planning processes and to perform adversarial testing and red teaming.
Safety in the CAIS world
While CAIS suggests that we will not have AGI agents, this does not mean that we automatically get safety. We will still have AI systems that take high impact actions, and if they take even one wrong action of this sort it could be catastrophic. One way this could happen is if the system of services starts to show agentic behavior -- our standard AI safety work could apply to this scenario.
In order to ensure safety, we should have AI safety researchers figure out and codify the best development practices that need to be followed. For example, we could try to always use predictive models of human (dis)approval as a sanity check on any plan that is being enacted. We could also train AI services that can adversarially check new services to make sure they are safe.
Summary
The CAIS model suggests that before we get to a world with monolithic AGI agents, we will already have seen an intelligence explosion due to automated R&D. This reframes the problems of AI safety and has implications for what technical safety researchers should be doing.
ETA: This comment provides updates based on more discussion with Eric.