All of jon_crescent's Comments + Replies

The type 1 vs. type 2 feedback distinction here seems really central. I'm interested if this seems like a fair characterization to both of you.

Type 1: Feedback which we use for training (via gradient descent)
Type 2: Feedback which we use to decide whether to deploy trained agent. 

(There's a bit of gray between Type 1 and 2, since choosing whether to deploy is another form of selection, but I'm assuming we're okay stating that gradient descent and model selection operate in qualitatively distinct regimes.)

The key disagreement is whether we expect type ... (read more)

2Paul Christiano
I think that by default we will search for ways to build systems that do well on type 2 feedback. We do likely have a large dataset of type-2-bad behaviors from the real world, across many applications, and can make related data in simulation. It also seems quite plausible that this is a very tiny delta, if we are dealing with models that have already learned everything they would need to know about the world and this is just a matter of selecting a motivation, so that you can potentially get good type 2 behavior using a very small amount of data. Relatedly, it seems like really all you need is to train predictors for type 2 feedback (in order to use those predictions for training/planning), and that the relevant prediction problems often seem much easier than the actual sophisticated behaviors we are interested in. Another important of my view about type 1 ~ type 2 is that if gradient descent handles the scale from [1 second, 1 month] then it's not actually very far to get from [1 month, 2 years]. It seems like we've already come 6 orders of magnitude and now we are talking about generalizing 1 more order of magnitude. At a higher level, I feel like the important thing is that type 1 and type 2 feedback are going to be basically the same kind of thing but with a quantitative difference (or at least we can set up type 1 feedback so that this is true). On the other hand "what we really want" is a completely different thing (that we basically can't even define cleanly). So prima facie it feels to me like if models generalize "well" then we can get them to generalize from type 1 to type 2, whereas no such thing is true for "what we really care about."
2Richard Ngo
A couple of clarifications: Let's also include feedback which we can use to decide whether to stop deploying an agent; the central example in my head is an agent which has been deployed for some time before we discover that it's doing bad things. Relatedly, another argument for type 1 !~ type 2 which seems important to me: type 2 feedback can look at long time horizons, which I expect to be very useful. (Maybe you included this in the cost estimate, but idk how to translate between longer times and higher cost directly.) This doesn't seem right. We design type 1 feedback so that resulting agents perform well on our true goals. This only matches up with type 2 feedback insofar as type 2 feedback is closely related to our true goals. But if that's the case, then it would be strange for agents to learn the motivation of doing well on type 2 feedback without learning the motivation of doing well on our true goals. In practice, I expect that misaligned agents which perform well on type 2 feedback will do so primarily by deception, for instrumental purposes. But it's hard to picture agents which carry out this type of deception, but which don't also decide to take over the world directly.

Okay, that makes sense (and seems compelling, though not decisive, to me). I'm happy to leave it here - thanks for the answers!

It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn't train on D then run on D* (and you don't need to!).

Sorry yes, you're completely right. (I previously didn't like that there's a model trained on  which only gets used for finding z*, but realized it's not a big deal.)

The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.

I agree - I mean for the... (read more)

3Paul Christiano
Even if you were taking D as input and ignoring tractability, IDA still has to decide what to do with D, and that needs to be at least as useful as what ML does with D (and needs to not introduce alignment problems in the learned model). In the post I'm kind of vague about that and just wrapping it up into the philosophical assumption that HCH is good, but really we'd want to do work to figure out what to do with D, even if we were just trying to make HCH aligned (and I think even for HCH competitiveness matters because it's needed for HCH to be stable/aligned against internal optimization pressure).

Thanks. This is helpful. I agree that LTP with the distilled core assumption buys us a lot, both theoretically and probably in practice too.

> The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation... My main reservation is that this seems really hard... If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network

Great, agreed with all of this.

> In writing the original post I w... (read more)

3Paul Christiano
It seems like you should either run separate models for D and D*, or jointly train the model on both D and D*, definitely you shouldn't train on D then run on D* (and you don't need to!). The goal is to be as good as an unaligned ML system though, not just to be better than humans. And the ML system updates on D, so we need to update on D too.

I'm trying to get a better handle on what the benefits coming from LTP are. Here's my current picture - are there points here where I've misundersood?
_________

The core problem: We have a training distribution (x, y) ~ D and a deployment distribution (x*, y*) ~ D*, where D != D*. We would rather not rely on ML OOD generalization from D to D*. Instead, we would rather have a human label D*, train an ML model on those labels, and only rely on IID generalization. Suppose D is too large for a human to process. If the human knows how to label D* without learning... (read more)

3Paul Christiano
I think your description is correct. The distilled core assumption seems right to me because the neural network weights are already a distilled representation of D, and we only need to compete with that representation. For that reason, I expect z* to have roughly the same size as the neural network parameters. My main reservation is that this seems really hard (and maybe in some sense just a reframing of the original problem). We want z to be a representation of what the neural network learned that a human can manipulate in order to reason about what it implies on D*. But what is that going to look like? If we require competitiveness then it seems like z has to look quite a lot like the weights of a neural network... In writing the original post I was imagining z* being much bigger than a neural network but distilled by a neural network in some way. I've generally moved away from that kind of perspective, partly based on the kinds of considerations in this post. I now think we're going to have to actually have z* reflect something more like the structure of the unaligned neural network, rather than another model (Mz) that outputs all of the unaligned neural network's knowledge. That said, I'm not sure we require OOD generalization even if we represent z via a model Mz. E.g. suppose that Mz(i) is the ith word of the intractably-large z. Then the prior calculation can access all of the words i in order to evaluate the plausibility of the string represented by Mz. We then use that same set of words at training time and test time. If some index i is used at test time but not at training time, then the model responsible for evaluating Prior(z) is incentivized to access that index in order to show that z is unnecessarily complex. So every index i should be accessed on the training distribution. (Though they need not be accessed explicitly, just somewhere in the implicit exponentially large tree). Like I said, I'm a bit less optimistic about doing this kind of massiv