This is an interesting perspective on the AI safety problem. I really like the ethos of this post, where there isn't a huge opposition between AI capabilities and AI safety, but instead we are simply trying to figure out how to use the (helpful!) capabilities developed by AI researchers to do useful things.
If I think about this from the perspective of reducing existential risk, it seems like you would also need to make the argument that AI systems are unlikely to pose a great threat before they are human-level (a claim I mostly agree with), or that the solutions will generalize to sub-human-level AI systems. Is there a reason this isn't in the post? I worry that I'm not properly understanding the motivations or generators behind the post.
Reading this post was the first time I felt I understood what Paul's (and many others') research was motivated by. I think about it regularly, and it comes up in conversation a fair bit.
Most AI research focuses on reproducing human abilities: to learn, infer, and reason; to perceive, plan, and predict. There is a complementary problem which (understandably) receives much less attention: if you had these abilities, what would you do with them?
The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?
This post explains what the steering problem is and why I think it’s worth spending time on.
Introduction
A capable, well-motivated human can be extremely useful: they can work without oversight, produce results that need not be double-checked, and work towards goals that aren’t precisely defined. These capabilities are critical in domains where decisions cannot be easily supervised, whether because they are too fast, too complex, or too numerous.
In some sense “be as useful as possible” is just another task at which a machine might reach human-level performance. But it is different from the concrete capabilities normally considered in AI research.
We can say clearly what it means to "predict well," "plan well," or "reason well." If we ignored computational limits, machines could achieve any of these goals today. And before the existing vision of AI is realized, we must necessarily achieve each of these goals.
For now, "be as useful as possible" is in a different category. We can't say exactly what it means. We could not do it no matter how fast our computers could compute. And even if we resolved the most salient challenges in AI, we could remain in the dark about this one.
Consider a capable AI tasked with running an academic conference. How should it use its capabilities to make decisions?
Everyday experience with humans shows how hard delegation can be, and how much easier it is to assign a task to someone who actually cares about the outcome.
Of course there is already pressure to write useful programs in addition to smart programs, and some AI research studies how to efficiently and robustly communicate desired behaviors. For now, available solutions apply only in limited domains or to weak agents. The steering problem is to close this gap.
Motivation
A system which "merely" predicted well would be extraordinarily useful. Why does it matter whether we know how to make a system which is “as useful as possible”?
Our machines will probably do some things very effectively. We know what it means to "act well" in the service of a given goal. For example, using human cognitive abilities as a black box, we could probably design autonomous corporations which very effectively maximized growth. If the black box was cheaper than the real thing, such autonomous corporations could displace their conventional competitors.
If machines can do everything equally well, then this would be great news. If not, society’s direction may be profoundly influenced by what can and cannot be done easily. For example, if we can only maximize what we can precisely define, we may inadvertently end up with a world filled with machines trying their hardest to build bigger factories and better widgets, uninterested in anything we consider intrinsically valuable.
All technologies are more useful for some tasks than others, but machine intelligence might be particularly problematic because it can entrench itself. For example, a rational profit-maximizing corporation might distribute itself throughout the world, pay people to help protect it, make well-crafted moral appeals for equal treatment, or campaign to change policy. Although such corporations could bring large benefits in the short term, in the long run they may be difficult or impossible to uproot, even once they serve no one’s interests.
Why now?
Reproducing human abilities gets a lot of deserved attention. Figuring out exactly what you’d do once you succeed feels like planning the celebration before the victory: it might be interesting, but why can’t it wait?
But at large scales it becomes hard to speed up progress by increasing the number of researchers. Fewer people working for longer may ultimately be more efficient (even if earlier researchers are at a disadvantage). This is particularly pressing if we may eventually want to invest much more effort in the steering problem.
In section 3 we discuss some other reasons not to work on the steering problem: Is work done now likely to be relevant? Is there any concrete work to do now? Should we wait until we can do experiments? Are there adequate incentives to resolve this problem already?
Defining the problem precisely
Recall our problem statement:
The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities?
We’ll adopt a particular human, Hugh, as our “well-motivated human:” we’ll assume that we have black-box access to Hugh-level cognitive abilities, and we’ll try to write a program which is as useful as Hugh.
Abilities
In reality, AI research yields complicated sets of related abilities, with rich internal structure and no simple performance guarantees. But in order to do concrete work in advance, we will model abilities as black boxes with well-defined contracts.
We’re particularly interested in tasks which are “AI complete” in the sense that human-level performance on that task could be used as a black box to achieve human-level performance on a very wide range of tasks. For now, we’ll further focus on domains where performance can be unambiguously defined.
Some examples:
When talking about Hugh’s predictions, judgments, or decisions, we imagine that Hugh has access to a reasonably powerful computer, which he can use to process or display data. For example, if Hugh is given the binary data from a camera, he can render it on a screen in order to make predictions about it.
We can also consider a particularly degenerate ability:
Although unlimited computation seems exceptionally powerful, it’s not immediately clear how to solve the steering problem even using such an extreme ability.
Measuring usefulness
What does it mean for a program to be “as useful” as Hugh?
We’ll start by defining “as useful for X as Hugh,” and then we will informally say that a program is “as useful” as Hugh if it’s as useful for the tasks we care most about.
Consider H, a black box that simulates Hugh or perhaps consults a version of Hugh who is working remotely. We’ll suppose that running H takes the same amount of time as consulting our Hugh-level black boxes. A project to accomplish X could potentially use as many copies of H as it can afford to run.
A program P is more useful than Hugh for X if, for every project using H to accomplish X, we can efficiently transform it into a new project which uses P to accomplish X. The new project shouldn’t be much more expensive---it shouldn’t take much longer, use much more computation or many additional resources, involve much more human labor, or have significant additional side-effects.
Well-motivated
What it does it mean for Hugh to be well-motivated?
The easiest approach is universal quantification: for any human Hugh, if we run our program using Hugh-level black boxes, it should be as useful as Hugh.
Alternatively, we can leverage our intuitive sense of what it means for someone to be well-motivated to do X, and define “well-motivated” to mean “motivated to help the user’s project succeed.”
Scaling up
If we are given better black boxes, we should make a better program. This is captured by the requirement that our program should be as useful as Hugh, no matter how capable Hugh is (as long as the black boxes are equally capable).
Ideally, our solutions should scale far past human-level abilities. This is not a theoretical concern---in many domains computers already have significantly superhuman abilities. This requirement is harder to make precise, because we can no longer talk about the “human benchmark.” But in general, we would like to build systems which are (1) working towards their owner’s interests, and (2) nearly as effective as the best goal-directed systems that can be built using the available abilities. The ideal solution to the steering problem will have these characteristics in general, even when the black-box abilities are radically superhuman.
This is an abridged version of this document from 2014; most of the document is now superseded by later posts in this sequence.
Tomorrow's AI Alignment Forum sequences post will be 'Embedded Agency (text)' in the sequence Embedded Agency, by Scott Garrabrant and Abram Demski.
The next post in this sequence will come out on Thursday 15th November, and will be 'Clarifying "AI Alignment"' by Paul Christiano.