85

AI ALIGNMENT FORUM
AF

84

Vivek Hebbar's Shortform

by Vivek Hebbar
24th Nov 2022
1 min read
8

3

This is a special post for quick takes by Vivek Hebbar. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Vivek Hebbar's Shortform
7Vivek Hebbar
3Vivek Hebbar
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:03 AM
[-]Vivek Hebbar1mo*71

I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:

  1. Goal-guarding is easy
  2. The AI is a schemer (see here for my model of how that works)
  3. Sandbagging would benefit the AI’s long-term goals
  4. The deployer has taken no countermeasures whatsoever

The reason is as follows:

  • Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
  • On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
  • Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.
Reply
[-]Vivek Hebbar3y32

A framing I wrote up for a debate about "alignment tax":

  1. "Alignment isn't solved" regimes:
    1. Nobody knows how to make an AI which is {safe, general, and broadly superhuman}, with any non-astronomical amount of compute
    2. We know how to make an aligned AGI with 2 to 25 OOMs more compute than making an unaligned one
  2. "Alignment tax" regimes:
    1. We can make an aligned AGI, but it requires a compute overhead in the range 1% - 100x.  Furthermore, the situation remains multipolar and competitive for a while.
    2. The alignment tax is <0.001%, so it's not a concern.
    3. The leading coalition is further ahead than the alignment tax amount, and can and will execute a pivotal act, thus ending the risk period and rendering the alignment tax irrelevant.

A person whose mainline is {1a --> 1b --> 2b or 2c} might say "alignment is unsolved, solving it mostly a discrete thing, and alignment taxes and multipolar incentives aren't central"

Whereas someone who thinks we're already in 2a might say "alignment isn't hard, the problem is incentives and competitiveness"

Someone whose mainline is {1a --> 2a} might say "We need to both 'solve alignment at all' AND either get the tax to be really low or do coordination.  Both are hard, and both are necessary."

Reply
Moderation Log
More from Vivek Hebbar
View more
Curated and popular this week
2Comments