All of Liam Donovan's Comments + Replies

Huh, I thought that all amplification/distillation procedures were intended as a way to approximate HCH, which is itself a tree. Can you not meaningfully discuss "this amplification procedure is like an n-depth approximation of HCH at step x", for any amplification procedure?

For example, the internal structure of the distilled agent described in Christiano's paper is unlikely to look anything like a tree. However, my (potentially incorrect?) impression is that the agent's capabilities at step x are identical to an HCH tree of depth x i... (read more)

3Rohin Shah
No, you can't. E.g. If your amplification procedure only allows you to ask a single subagent a single question, that will approximate a linear HCH instead of a tree-based HCH. If your amplification procedure doesn't invoke subagents at all, but instead provides more and more facts to the agent, it doesn't look anything like HCH. The canonical implementations of iterated amplification are trying to approximate HCH though. That sounds right to me.

Huh, what would you recommend I do to reduce my uncertainty around meta-execution (e.g. "read x", "ask about it as a top level question", etc)?

3Rohin Shah
I'm not sure how relevant meta-execution is any more, I haven't seen it discussed much recently. So probably you'd want to ask Paul, or someone else who was around earlier than I was.

Is this necessarily true? It seems like this describes what Christiano calls "delegation" in his paper, but wouldn't apply to IDA schemes with other capability amplification methods (such as the other examples in the appendix of "Capability Amplification").

3Rohin Shah
"Depth" only applies to the canonical tree-based implementation of IDA. If you slot in other amplification or distillation procedures, then you won't necessarily have "depths" any more. You'll still have recursion, and that recursion will lead to more and more capability. Where it ends up depends on your initial agent and how good the amplification and distillation procedures are.

I found this immensely helpful overall, thank you!

However, I'm still somewhat confused by meta-execution. Is it essentially a more sophisticated capability amplification strategy that replaces the role filled by "deliberation" in Christiano's IA paper?

2Rohin Shah
I maybe think so? I'm not sure about meta-execution (as the comment thread above shows).

Two basic questions I couldn't figure out (sorry):

Can you use a different oracle for every subquestion? If you can, how would this affect the concern Wei_Dai raises?

If we know the oracle is only optimizing for the specified objective function, are mesa-optimisers still a problem for the proposed system as a whole?

3Evan Hubinger
You can use a different oracle for every subquestion, but it's unclear what exactly that does if you don't know what the oracle's actual objective is. For example, you could imagine one system that cares about the reward given to copies equally as much as reward given to itself, and another system that only cares about the reward given to itself, and these two systems would be near-indistinguishable if you were just doing empirical analysis on some training distribution. The key here, I think, is the degree to which you're willing to make an assumption of the form you mention--that is, how much leeway are you willing to allow in assuming that the oracle is in fact only going to be optimizing for the specified objective function. On one level, it makes sense to separate out outer alignment and inner alignment concerns, but the problem in this case is that the sorts of objectives you're allowed to specify here are in fact going to be very dependent on whatever inner alignment solution you're assuming you have access to. For example, does your inner alignment solution require access to training data? If so, then that's a big constraint on the sorts of objectives you can specify. Or, going back to your previous question, whether or not I get to pick between an objective that doesn't care about other copies or one that does is also going to depend on exactly what sort of inner alignment solution you're assuming.