Reply to Jebari and Lundborg on Artificial Superintelligence

Richard_Ngo

Jebari and Lundborg have recently published an article entitled Artificial superintelligence and its limits: why AlphaZero cannot become a general agent. It focuses on the thorny issue of agency in superintelligent AIs. I’m glad to see more work on this crucial topic; however, I have significant disagreements with their terminology and argumentation, as I outline in this reply. Note that it was written rather quickly, and so might lack clarity in some places, or fail to convey some nuances of the original article. I welcome comments and further responses.

Their paper runs roughly as follows: Jebari and Lundborg first discuss the belief–desire model for intentional action, under which agents act in ways that they believe will bring about their desires. They then distinguish between different degrees of generality that agents can have: “general agency is, loosely speaking, the ability to act in a diverse set of situations.” They consider thermostats to be very specialised agents, and dogs to be fairly general agents. They then introduce introduce Legg and Hutter’s definition of intelligence as “an agent’s ability to achieve its goals in a wide range of environments”. This is, unfortunately, a misquotation of Legg and Hutter, and one which leads Jebari and Lundborg astray, as they follow it with:

Claim 1: If an agent has very specialized goals or desires, it can be superintelligent with regards to those desires without being a general agent.

The actual definition given in both A collection of definitions of intelligence and Universal intelligence is: “Intelligence measures an agent’s ability to achieve goals in a wide range of environments”. Note the important distinction between an agent which can achieve its goals versus one which can achieve goals in general. In the original context, it is clear that to be highly intelligent according to this definition, an agent doesn’t just need to be able to achieve a single (potentially specialised) set of goals, but rather a wide variety of them! More specifically, in Universal intelligence goals are formalised as a reward function implemented by the environment, with intelligence measured over all computable environments (and therefore all computable reward functions). So Legg and Hutter’s definition does not allow us to call a non-general agent superintelligent. Bostrom, too, defines superintelligence to refer specifically to "intellects that greatly outperform the best current human minds across many very general cognitive domains".

Perhaps Jebari and Lundborg aren’t making a claim based on Legg and Hutter’s definition, but rather trying to define a new concept: intelligence with respect to a set of desires. But this is still a highly misleading use of the term ‘intelligence’, because there are many possible desires whose achievement doesn’t require any of the cognitive faculties traditionally associated with intelligence (planning, memory, and so on). Consider an agent with a goal that is trivially easy to fulfil in almost all environments - for example, the goal of never taking any actions. Claim 1 implies that, even if this agent has no other cognitive faculties beyond the possession of that goal, it can nevertheless be superintelligent with respect to that goal, because it achieves the goal in almost all environments. Yet why would it be useful to conclude that an agent with nearly no cognitive abilities is superintelligent in any way? Or, perhaps more pithily: if this definition forces us to conclude that rocks could be superintelligent if only they possessed the goal of behaving in a rocklike way, we should suspect that it’s a misuse of the term.

In their next paragraph, Jebari and Lundborg argue that superintelligence also requires the ability to perform a task better than humans. But this doesn’t salvage claim 1, because there are many tasks on which intelligence is not needed to outperform humans - e.g. both of the examples I discussed in the previous paragraph. Humans, unlike rocks, are notoriously bad at taking no actions; eventually we need to eat and sleep, or just get bored. They also note, in defense of their argument, that “the term “intelligence” has a lot of connotations that are misleading”. But at some point, when you separate a term from all of its connotations, you are merely talking about a different concept. We can already say that objects have the property of x, or a strong ability or capability to do x; we do not need to claim that this is equivalent to them being superintelligent at x.

Having said all this, I do agree with them that it’s reasonable and indeed necessary to separate agency and intelligence. I think this will require us to move beyond Legg and Hutter’s definition of intelligence, in a way which I discuss briefly here and hope to explicate further soon.

Ascribing desires to simple agents

Jebari and Lundborg go on to note: “In many situations, machines acting in our environment are often described as trying to attain some general aim, for example, the goal of a vacuum cleaner robot is often described as “clean a room”… But vacuum cleaner robots are not literally trying to clean a room. Rather they follow a set of specific instructions that often results in cleaning a room.”

This is a very reasonable observation. To me, the implication is that it is not very useful to ascribe goals to vacuum cleaner robots; saying that they have the goal to “follow a set of specific instructions that often results in cleaning a room” adds nothing. In Dennett’s terminology, we can take the design stance towards such objects, and merely say that they have been designed to clean rooms without needing to invoke any goals belonging to the vacuum cleaner itself.

But Jebari and Lundborg disagree. For example, they claim that “[a] thermostat has no desire with respect to the temperature of the room. It only has a desire to activate the AC if the temperature, as registered by its thermometer, reaches a certain threshold.” They introduce a distinction between these and other desires: they call desires like the ones they ascribe to thermometers and vacuum cleaners unproductive desires.

I should note, however, that this is primarily a semantic dispute. I say that unproductive desires shouldn’t qualify as desires; Jebari and Lundborg think that they are a special type of desire. I think that their terminology is misleading because, like their definition of intelligence, it’s too permissive - it allows us to ascribe unproductive desires to rocks and other inanimate objects. And I expect that if this terminology becomes more common, there’ll be a lot of unnecessary contention between people who use this terminology and people who think that rocks and thermostats don’t really have desires. But it’s not a crucial issue.

Two ways of acquiring desires

A more serious disagreement arises in the subsequent section, where they argue that “an AI with a set of desires constrained to a specific domain cannot acquire desires relevant to other domains”, because “desires can only be reinforced from within”. We should immediately be suspicious of this argument, because it gives rise to a regress problem: how does an AI acquire its first desire in any given domain? In fact, Jebari and Lundborg are conflating two ways in which new desires can arise. The first is that new desires can arise by thinking about ways to achieve old desires; these are often called instrumental desires. But the second is that desires can arise via a modification to the system itself. For example, if I train an AI to play Starcraft, then that AI goes from an initial state in which it has no desires, to a state in which it desires (whether “productively” or “unproductively”) to win at Starcraft, due to the parameter updates implemented by its optimiser.

In this case, I have described the situation as if there is a “policy” agent which chooses what actions to take, and a separate optimiser. But we could also think about this as one system: a single agent which contains a policy module and an optimiser module that updates the policy module. Some sources use the former terminology, some use the latter. Yet in either case, the “agent” starts off without general agency, and might develop general agency during training if the training environment is complex enough. Because of this, the distinction Jebari and Lundborg draw between “spontaneous emergence” and “accidental emergence” of generality seems to merely reflect the arbitrary choice of where we draw the line of what qualifies as an “agent”.

Perhaps Jebari and Lundborg might reply that the combination of a randomly initialised policy and an optimiser already constitutes a general agent, because the optimiser is able to update the policy in a wide range of circumstances. Yet if so, then they are incorrect to claim that Bostrom argues for the possibility of spontaneous emergence of a general agent from a non-general agent. In fact, the example Bostrom uses is of evolutionary search processes, which are a type of optimisation algorithm. So if an optimiser updating a randomly initialised policy constitutes a general agent, then so does Bostrom’s proposed evolutionary search process. But I think that saying that either of these are “general” agents is too much of a stretch; they’re simple, blind optimisation processes that are able to give rise to general agents. I consider this distinction to be crucial in thinking about agency and what behaviour we should expect from superintelligences.

[-]gwern4y130

I skimmed the paper when they announced it on Twitter. It seemed like it fundamentally ignores every possibility vaguely like mesa-optimization or imitation learning, and can't deal with things like, say, GPT-3 meta-learning agency to better predict data derived from agents (ie. humans). They leave themselves an out by handwaving away all such inconveniences as 'iron ore agents', but then it's thoroughly useless and circular; "what's an iron ore agent?" "It's one which has dangerous outcomes due to hidden agency." "OK, which agents are those, how can you tell AlphaZero from GPT-3 from AGI?" "Well, try them and see!"

[-]Adam Shimi4y10

I agree with almost every element of your response, especially the part about how desires emerges in an AI.

Yet I do find the concept of productive and unproductive desires interesting. It seems to point at a distinction between types of goals. For example, the difference between a robot that twitches (it's code literally just makes it twitch) and a robot that wants to twitch (optimizer trying to maximize the number of times it twitches before getting destroyed). The latter is clearly more goal-directed than the other, and I only fear instrumental subgoals for the latter, not the former. I guess you could say that the former just doesn't have a goal, and that's alright if there is really a binary switch "productive" and "unproductive". But if there is a continuum between the two, then thinking about the end points in terms of goals is relevant in interpreting the degrees of productiveness of goals in the middle.

[-]Richard Ngo4y40

But if there is a continuum between the two, then thinking about the end points in terms of goals is relevant in interpreting the degrees of productiveness of goals in the middle.

I don't see why this is the case - you can just think about the continuum from non-goal to goal instead, which should get you the same benefits.

Yeah, rereading myself, you're right. I think the important thing I wanted to say is just that the productive/unproductive desires or goals seems an interesting idea to formalize an aspect of goal-directedness.

AI ALIGNMENT FORUM
AF

14

Reply to Jebari and Lundborg on Artificial Superintelligence

14

Ascribing desires to simple agents

Two ways of acquiring desires