Richard Ngo

Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.

Sequences

Understanding systematization
Shaping safer goals
AGI safety from first principles

Wikitag Contributions

Comments

Sorted by

When you think of goals as reward/utility functions, the distinction between positive and negative motivations (e.g. as laid out in this sequence) isn’t very meaningful, since it all depends on how you normalize them.

But when you think of goals as world-models (as in predictive processing/active inference) then it’s a very sharp distinction: your world-model-goals can either be of things you should move towards, or things you should move away from.

This updates me towards thinking that the positive/negative motivation distinction is more meaningful than I thought.

Nice post. I read it quickly but think I agree with basically all of it. I particularly like the section starting "The AI doesn’t have a cached supergoal for “maximize reward”, but it decides to think anyway about whether reward is an instrumental goal".

"The distinct view that truly terminal reward maximization is kind of narrow or bizarre or reflection-unstable relative to instrumental reward maximization" is a good summary of my position. You don't say much that directly contradicts this, though I do think that even using the "terminal reward seeker" vs "schemer" distinction privileges the role of reward a bit too much. For example, I expect that even an aligned AGI will have some subagent that cares about reward (e.g. maybe it'll have some sycophantic instincts still). Is it thereby a schemer? Hard to say.

Aside from that I'd add a few clarifications (nothing major):

  1. The process of deciding on a new supergoal will probably involve systematizing not just "maximize reward" but also a bunch of other drives too—including ones which had previously been classified as special cases of "maximize reward" (e.g. "make humans happy") but upon reflection are more naturally understood as special cases of the new supergoal.
  2. It seems like you implicitly assume that the supergoal will be "in charge". But I expect that there will be a bunch of conflict between supergoal and lower-level goals, analogous to the conflict between different layers of an organizational hierarchy (or between a human's System 2 motivations and System 1 motivations). I call the spectrum from "all power is at the top" to "all power is at the bottom" the systematizing-conservatism spectrum.

I think that formalizing the systematizing-conservatism spectrum would be a big step forward in our understanding of misalignment (and cognition more generally). If anyone reading this is interested in working with me on that, apply to my MATS stream in the next 5 days.

I didn't end up putting this in my coalitional agency post, but at one point I had a note discussing our terminological disagreement:

I don’t like the word hierarchical as much. A theory can be hierarchical without being scale-free—e.g. a theory which describes something in terms of three different layers doing three different things is hierarchical but not scale-free.

Whereas coalitions are typically divided into sub-coalitions (e.g. the "western civilization" coalition is divided into countries which are divided into provinces/states; political coalitions are divided into different factions and interest groups; etc). And so "coalitional" seems much closer to capturing this fractal/scale-free property.

I've now edited that section. Old version and new version here for posterity.

Old version:

None of these is very satisfactory! Intuitively speaking, Alice and Bob want to come to an agreement where respect for both of their interests is built in. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to weighted averages. The best they can do is to agree on a probabilistic mixture of EUMs—e.g. tossing a coin to decide between option 1 and option 2—which is still very inflexible, since it locks in one of them having priority indefinitely.

Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to follow through on commitments they made about which decision procedure to follow (or even hypothetical commitments).

New version:

These are all very unsatisfactory. Bob wouldn’t want #1, Alice wouldn’t want #2, and #3 is extremely non-robust. Alice and Bob could toss a coin to decide between options #1 and #2, but then they wouldn’t be acting as an EUM (since EUMs can’t prefer a probabilistic mixture of two options to either option individually). And even if they do, whoever loses the coin toss will have a strong incentive to renege on the deal.

We could see these issues merely as the type of frictions that plague any idealized theory. But we could also seem them as hints about what EUM is getting wrong on a more fundamental level. Intuitively speaking, the problem here is that there’s no mechanism for separately respecting the interests of Alice and Bob after they’ve aggregated into a single agent. For example, they might want the EUM they form to value fairness between their two original sets of interests. But adding this new value is not possible if they’re limited to (a probability distribution over) weighted averages of their utilities. This makes aggregation very risky when Alice and Bob can’t consider all possibilities in advance (i.e. in all realistic settings).

Based on similar reasoning, Scott Garrabrant rejects the independence axiom. He argues that the axiom is unjustified because rational agents should be able to lock in values like fairness based on prior agreements (or even hypothetical agreements).

I was a bit lazy in how I phrased this. I agree with all your points; the thing I'm trying to get at is that this approach falls apart quickly if we make the bargaining even slightly less idealized. E.g. your suggestion "Form an EUM which is totally indifferent about the cake allocation between them and thus gives 100% of the cake to whichever agent is cheaper/easier to provide cake for":

  1. Strongly incentivizes deception (including self-deception) during bargaining (e.g. each agent wants to overstate the difficulty of providing cake for it).
  2. Strongly incentivizes defection from the deal once one of the agents realize that they'll get no cake going forward.
  3. Is non-robust to multi-agent dynamics (e.g. what if one of Alice's allies later decides "actually I'm going to sell pies to the Alice+Bob coalition more cheaply if Alice gets to eat them"? Does that then divert Bob's resources towards buying cakes for Alice?)

EUM treats these as messy details. Coalitional agency treats them as hints that EUM is missing something.

EDIT: another thing I glossed over is that IIUC Harsanyi's theorem says the aggregation of EUMs should have a weighted average of utilities, NOT a probability distribution over weighted averages of utilities. So even flipping a coin isn't technically kosher. This may seem nitpicky but I think it's yet another illustration of the underlying non-robustness of EUM.

FWIW I think of "OpenAI leadership being untrustworthy" (a significant factor in me leaving) as different from "OpenAI having bad safety policies" (not a significant factor in me leaving). Not sure if it matters, I expect that Scott was using "safety policies" more expansively than I do. But just for the sake of clarity:

I am generally pretty sympathetic to the idea that it's really hard to know what safety policies to put in place right now. Many policies pushed by safety people (including me, in the past) have been mostly kayfabe (e.g. being valuable as costly signals, not on the object level). There are a few object-level safety policies that I really wish OpenAI would do right now (most clearly, implementing better security measures) but I didn't leave because of that (if I had, I would have tried harder to check before I left what security measures OpenAI did have, made specific objections internally about them before I left, etc).

This may just be a semantic disagreement, it seems very reasonable to define "don't make employees sign non-disparagements" as a safety policy. But in my mind at least stuff like that is more of a lab governance policy (or maybe a meta-level safety policy).

Cool, ty for (characteristically) thoughtful engagement.

I am still intuitively skeptical about a bunch of your numbers but now it's the sort of feeling which I would also have if you were just reasoning more clearly than me about this stuff (that is, people who reason more clearly tend to be able to notice ways that interventions could be surprisingly high-leverage in confusing domains).

1. Yepp, seems reasonable. Though FYI I think of this less as some special meta argument, and more as the common-sense correction that almost everyone implicitly does when giving credences, and rationalists do less than most. (It's a step towards applying outside view, though not fully "outside view".)

2. Yepp, agreed, though I think the common-sense connotations of "if this became" or "this would have a big effect" are causal, especially in the context where we're talking to the actors who are involved in making that change. (E.g. the non-causal interpretation of your claim feels somewhat analogous to if I said to you "I'll be more optimistic about your health if you take these pills", and so you take the pills, and then I say "well the pills do nothing but now I'm more optimistic, because you're the sort of person who's willing to listen to recommendations". True, but it also undermines people's willingness/incentive to listen to my claims about what would make the world better.)

3. Here are ten that affect AI risk as much one way or the other:

  1. The US government "waking up" a couple of years earlier or later (one operationalization: AISIs existing or not right now).
  2. The literal biggest names in the field of AI becoming focused on AI risk.
  3. The fact that Anthropic managed to become a leading lab (and, relatedly, the fact that Meta and other highly safety-skeptical players are still behind).
  4. Trump winning the election.
  5. Elon doing all his Elon stuff (like founding x.AI, getting involved with Trump, etc).
  6. The importance of transparency about frontier capabilities (I think of this one as more of a logical update that I know you've made).
  7. o1-style reasoning as the next big breakthrough.
  8. Takeoff speeds (whatever updates you've made in the last three years).
  9. China's trajectory of AI capabilities (whatever updates you've made about that in last 3 years).
  10. China's probability of invading Taiwain (whatever updates you've made about that in last 3 years).

And then I think in 3 years we'll be able to publish a similar list of stuff that mostly we just hadn't predicted or thought about before now.

I expect you'll dispute a few of these; happy to concede the ones that are specifically about your updates if you disagree (unless you agree that you will probably update a bunch on them in the next 3 years).

But IMO the easiest way for safety cases to become the industry-standard thing is for AISI (or internal safety factions) to specifically demand it, and then the labs produce it, but kinda begrudgingly, and they don't really take them seriously internally (or are literally not the sort of organizations that are capable of taking them seriously internally—e.g. due to too much bureaucracy). And that seems very much like the sort of change that's comparable to or smaller than the things above.

I think I would be more sympathetic to your view if the claim were "if AI labs really reoriented themselves to take these AI safety cases as seriously as they take, say, being in the lead or making profit". That would probably halve my P(doom), it's just a very very strong criterion.

We have discussed this dynamic before but just for the record:

I think that if it became industry-standard practice for AGI corporations to write, publish, and regularly update (actual instead of just hypothetical) safety cases at at this level of rigor and detail, my p(doom) would cut in half.

This is IMO not the type of change that should be able to cut someone's P(doom) in half. There are so many different factors that are of this size and importance or bigger (including many that people simply have not thought of yet) such that, if this change could halve your P(doom), then your P(doom) should be oscillating wildly all the time.

I flag this as an example of prioritizing inside-view considerations too strongly in forecasts. I think this is the sort of problem that arises when you "take bayesianism too seriously", which is one of the reasons why I wrote my recent post on why I'm not a bayesian (and also my earlier post on Knightian uncertainty).

For context: our previous discussions about this related to Daniel's claim that appointing one specific person to one specific important job could change his P(doom) by double digit percentage points. I similarly think this is not the type of consideration that should be able to swing people's P(doom) that much (except maybe changing the US or Chinese leaders, but we weren't talking about those).

Lastly, since this is a somewhat critical comment, I should flag that I really appreciate and admire Daniel's forecasting, have learned a lot from him, and think he's generally a great guy. The epistemology disagreements just disproportionately bug me.

The mistakes can (somewhat) be expressed in the language of Bayesian rationalism by doing two things:

  1. Talking about partial hypotheses rather than full hypotheses. You can't have a prior over partial hypotheses, because several of them can be true at once (though you can still assign them credences and update those credences according to evidence).
  2. Talking about models with degrees of truth rather than just hypotheses with degrees of likelihood. E.g. when using a binary conception of truth, general relativity is definitely false because it's inconsistent with quantum phenomena. Nevertheless, we want to say that it's very close to the truth. In general this is more of an ML approach to epistemology (we want a set of models with low combined loss on the ground truth).
Load More