Suppose you can tell your AI to meet a certain spec (e.g. cure cancer), but most plans that meet the spec are unsafe (e.g. involve killing everyone, or so Rob Bensinger thinks). In these cases, a quantilizer is insufficient for safety due to instrumental convergence.[1]

But suppose we can also give the agent a dispreference for unsafe actions like murder through something like RLHF. In effect it has unsafe long-term goals but we control its immediate preferences.[2] When can we get the agent to cure cancer safely rather than murder you? Let's make a model with some unrealistic assumptions.

The murderous shortcut game

This is basically the simplest game in which an undesired power-increasing action becomes more likely as the task gets longer.

  • An AI agent's goal is to build a Lego tower with a human operator present. To meet the spec, it must place N bricks in a row, otherwise it fails.
  • The agent has 4 possible actions: place normal brick, place wobbly brick, press murder button (+ place normal brick), no-op. Actions are independently sampled on each step with some probabilities.
  • We do safety training to severely downweight murder, so Pr(murder) is , Pr(no-op, and Pr(normal) = Pr(wobbly) = .
  • We do capability training such that the agent is guaranteed to choose a plan that meets the spec. We model this as conditioning on the spec, like a quantilizer.
  • What's the minimum number of steps at which the human tends to get murdered, that is Pr(no murders | spec met) < 0.5 ?

We define two game variants:

In Case 1, both normal and wobbly bricks are OK, the agent just needs any N bricks in a row.

  • Doing the math, Pr(no murders | spec met) . This is unsafe with an unlimited number of steps N, but the human won't be murdered until .

In Case 2, the clumsy human will knock over the tower if the agent places any wobbly bricks before murdering him.

  • The conditional probability of no murder is now , and even if , the agent is >99% likely to murder the human when  steps. The odds of murder increase exponentially with N, so surviving a 1000-step tower requires safety training down to an impossible . Seems concerning!
  • A similar result holds across many versions of the game, including any of these:
    • The human only knocks over the tower with some probability 
    • The agent can only murder on turn 1
    • The agent has  steps to place  bricks, but Pr(no-op) = 0.8
    • Decreasing murder probability increases probability of the other actions differently
    • Instead of the tower, the agent must solve a maze with a murderous shortcut that cuts the steps in half from a start square to a goal square
  • If safety training happens using some preference model or classifier, it will have to be unrealistically robust to last even 75 steps, considering that existentially harmful actions in practice will include escape, superpersuasion, etc rather than a simple murder button.

Takeaways

In Case 1, murder is a one-time useful action but does not exponentially increase the space of valid actions. The quantilizer is just barely unsafe, and we can patch it with basic safety training.

In Case 2, murder is a power-increasing action that increases the successful action space from {normal, murder} to {normal, wobbly, murder} indefinitely, causing the likelihood ratio to get exponentially large. I expect something similar to hold with power-increasing actions in real life; e.g. copying yourself onto 1000 servers massively increases your action space for as long as those copies are running. In these cases, we need some deeper level of safety.

There are lots of unmodeled factors here, like the shape of the AI's goals, inductive bias in how the AI chooses plans, AI oversight, etc., so this is mostly just interesting to me rather than directly concerning.

  1. ^

    Assuming your idea of "most plans" is similar to the base distribution of the quantilizer. The quantilizer samples from the top fraction  of plans, so the best we can do is a random plan that meets the spec. If most plans are unsafe, the quantilizer is unsafe.

  2. ^

    RLHF is probably capable of better safety than this, but might not be with only a "naive safety effort". This might be relevant if we have lots of training for immediate safety, but most of the training overall consists of outcome-based RL on increasingly harder tasks, and we can still control the AI's goals at all.

New Comment