(ETA: The name "policy approval" wasn't great. I think I will use the term "policy alignment" to contrast with "value alignment" going forward, at the suggestion of Wei Dai in the comments.)

I recently had a conversation with Stuart Armstrong in which I claimed that an agent which learns your utility function (pretending for a moment that "your utility function" really is a well-defined thing) and attempts to optimize it is still not perfectly aligned with you. He challenged me to write up specific examples to back up my claims.

I'll also give a very sketchy alternative to value learning, which I call policy alignment. (The policy alignment idea emerged out of a conversation with Andrew Critch.)

Background

Stuart Armstrong has recently been doing work showing the difficulty of inferring human values. To summarize: because humans are irrational, a value-learning approach like CIRL needs to jointly estimate the human utility function and the degree to which the human is rational -- otherwise, it would take all the mistakes humans make to be preferences. Unfortunately, this leads to a severe problem of identifiability: humans can be assigned any values whatsoever if we assume the right kind of irrationality, and the usual trick of preferring simpler hypotheses doesn't seem to help in this case.

I also want to point out that a similar problem arises even without irrationality. Vladimir Nesov explored how probability and utility can be mixed into each other without changing any decisions an agent makes. So, in principle, we can't determine the utility or probability function of an agent uniquely based on the agent's behavior alone (even including hypothetical behavior in counterfactual situations). This fact was discovered earlier by Jeffrey and Bolker, and is analyzed in more detail in the book The Logic of Decision. For this reason, I call the transform "Jeffrey-Bolker rotation".

To give an illustrative example: it doesn't matter whether we assign very low probability to an event, or care very little about what happens given that event. Suppose a love-maximizing agent is unable to assign nonzero utility to a universe where love isn't real. The agent may appear to ignore evidence that love isn't real. We can interpret this as not caring what happens conditioned on love not being real; or, equally valid (in terms of the actions which the agent chooses), we can interpret the agent as having an extremely low prior probability on love not being real.

At MIRI, we sometimes use the term "probutility" to indicate the probability,utility pair in a way which reminds us that they can't be disentangled from one another. Jeffrey-Bolker rotation changes probabilities and utilities, but does not change the overall probutilities.

Given these problems, it would be nice if we did not actually need to learn the human utility function. I'll advocate that position.

My understanding is that Stuart Armstrong is optimistic that human values can be inferred despite these problems, because we have a lot of useful prior information we can take advantage of.

It is intuitive that a CIRL-like agent should learn what is irrational and then "throw it out", IE, de-noise human preferences by looking only at what we really prefer, not at what we mistakenly do out of short-sightedness or other mistakes. On the other hand, it is not so obvious that the probability/utility distinction should be handled in the same way. Should an agent disentangle beliefs from preferences just so that it can throw out human beliefs and optimize the preferences alone? I argue against this here.

Main Claim

Ignoring issues of irrationality or bounded rationality, what an agent wants out of a helper agent is that the helper agent does preferred things.

Suppose a robot is trying to help a perfectly rational human. The human has probability function and utility function . The robot is in epistemic state e. The robot has a set of actions . The proposition "the robot takes the ith action when in epistemic state e" is written as . The set of full world-states is S. What the human would like the robot to do is given by:

(Or by the analogous causal counterfactual, if the human thinks that way.)

This notion of what the human wants is invariant to Jeffrey-Bolker rotation; the robot doesn't need to disentangle probability and utility! It only needs to learn probutilities.

The equation written above can't be directly optimized, since the robot doesn't have direct access to human probutilities. However, I'll broadly call any attempt to approximate that equation "policy alignment".

Notice that this is closely analogous to UDT. UDT solves dynamic inconsistencies -- situations in which an AI could predictably dislike the decisions of its future self -- by optimizing its actions from the perspective of a fixed prior, IE, its initial self. Policy alignment resolves inconsistencies between the AI and the human by optimizing the AI's actions from the human's perspective. The main point of this post is that we can use this analogy to produce counterexamples to the typical value-learning approach, in which the AI tries to optimize human utility but not according to human beliefs.

I will somewhat ignore the distinction between UDT1.0 and UDT1.1.

Examples

These examples serve to illustrate that "optimizing human utility according to AI beliefs" is not exactly the same as "do what the human would want you to do", even when we suppose "the human utility function" is perfectly well-defined and can be learned exactly by the AI.

In these examples, I will suppose that the AI has its own probability distribution . It reasons updatelessly with respect to evidence e it sees, but with full prior knowledge of the human utility function:

I use an updateless agent to avoid accusations that of course an updateful agent would fail classic UDT problems. However, it is not really very important for the examples.

I assume prior knowledge of to avoid any tricky issues which might arise by attempting to combine updatelessness with value learning.

Counterfactual Mugging

It seems reasonable to suppose that the AI will start out with some mathematical knowledge. Imagine that the AI has a database of theorems in memory when it boots up, including the first million digits of pi. Treat these as part of the agent's prior.

Suppose, on the other hand, that the human which the AI wants to help does not know more than a hundred digits of pi.

The human and the AI will disagree on what to do about counterfactual mugging with a logical coin involving digits of pi which the AI knows and the human does not. If Omega approaches the AI, the AI will refuse to participate, but the human will wish the AI would. If Omega approaches the human, the AI may try to prevent the human from participating, to the extent that it can do so without violating other aspects of the human utility function.

"Too Updateless"

Maybe the problem with the counterfactual mugging example is that it doesn't make sense to program the AI with a bunch of knowledge in its prior which the human doesn't have.

We can go in the opposite extreme, and make a broad prior such as the Solomonoff distribution, with no information about our world in particular.

I believe the observation has been made before that running UDT on such a prior could have weird results. There could be a world with higher prior probability than ours, inhabited by Omegas who ask the AI to optimize alien values in most universes (including Earth) in exchange for the Omegas maximizing in their own world. (This particular scenario doesn't seem particularly probable, but it does seem quite plausible that some weird universes will have higher probability than our universe in the Solomonoff prior, and may make some such bargain.)

Again, this is something which can happen in the maximization using but not in the one using -- unless humans themselves would approve of the multiversal bargain.

"Just Having a Very Different Prior"

Maybe is neither strictly more knowledgable than nor less, but the two are very different on some specific issues. Perhaps there's a specific plan which, when is conditioned on evidence so far, looks very likely to have many good consequences. considers the plan very likely to have many bad consequences. Also suppose that there aren't any interesting consequences of this plan in counterfactual branches, so UDT considerations don't come in.

Also, suppose that there isn't time to test the differing hypotheses involved which make humans think this is such a bad plan while AIs think it is so good. The AI has to decide right now whether to enact the plan.

The value-learning agent will implement this plan, since it seems good on net for human values. The policy-alignment agent will not, since humans wouldn't want it to.

Obviously, one might question whether it is reasonable to assume that things got to a point where there was such a large difference of opinion between the AI and the humans, and no time to resolve it. Arguably, there should be safeguards against this scenario which the value-learning AI itself would want to set up, due to facts about human values such as "the humans want to be involved in big decisions about their future" or the like.

Nonetheless, faced with this situation, it seems like policy-alignment agents do the right thing while value-learning agents do not.

Issues/Objections

Aren't human beliefs bad?

Isn't it problematic to optimize via human beliefs, since human beliefs are low-quality?

I think this is somewhat true and somewhat not.

  • Partly, this is like saying "isn't UDT bad because it doesn't learn?" -- actually, UDT acts as if it updates most of the time, so it is wrong to think of it as incapable of learning. Similarly, although the policy-alignment agent uses , it will mostly act as if it has updated on a lot of information. So, maybe you believe human beliefs aren't very good -- but do you think we're capable of learning almost anything eventually? If so, this may address a large component of the concern. In particular, if you trust the output of certain machine learning algorithms more than you trust yourself, the AI can run those algorithms and use their output.
  • On the other hand, humans probably have incoherent , and not just because of logical uncertainty. So, the AI still needs to figure out what is "irrational" and what is "real" in , just like value-learning needs to do for .

If humans would want an AI to optimize via human beliefs, won't that be reflected in the human utility function?

Or: If policy-alignment were good, wouldn't a value-learner self modify into policy-alignment anyway?

I don't think this is true, but I'm not sure. Certainly there could be simple agents who value-learners cooperate with without ever deciding to self-modify into policy-alignment agents. Perhaps there is something about human preference which desires the AI to cooperate with the human even when the AI thinks this is (otherwise) net-negative for human values.

Aren't I ignoring the fact that the AI needs its own beliefs?

In "Just Having a Very Different Prior", I claimed that if and disagree about the consequences of a plan, value-learning can do something humans strongly don't want it to do, whereas policy-alignment cannot. However, my definition of policy-alignment ignores learning. Realistically, the policy-alignment agent needs to also have beliefs , which it uses to approximate the human approval of its actions. Can't the same large disagreement emerge from this?

I think the concern is qualitatively less, because the policy-alignment agent uses only to estimate and . If the AI knows that humans would have a large disagreement with the plan, the policy-alignment agent would not implement the plan, while the value-learning agent would.

For policy-alignment to go wrong, it needs to have a bad estimate of and .

The policy is too big.

Even if the process of learning is doing the work to turn it into a coherent probability distribution (removing irrationality and making things well-defined), it still may not be able to conceive of important possibilities. The evidence which the AI uses to decide how to act, in the equations given earlier, may be a large data stream with some human-incomprehensible parts.

As a result, it seems like the AI needs to optimize over compact/abstract representations of its policy, similarly to how policy selection in logical inductors works.

This isn't an entirely satisfactory answer, since (1) the representation of a policy as a computer program could still escape human understanding, and (2) it is unclear what it means to correctly represent the policy in a human-understandable way.

Terminology

[Aside from issues with the approach, my term "policy approval" may be terrible. It sounds too much like "approval-directed agent", which means something different. I think there are similarities, but they aren't strong enough to justify referring to both as "approval". Any suggestions?]

[Now using "Policy Alignment" for this. Editing post accordingly.]

Advantages

(These are very speculative.)

Logical Updatelessness?

One of the major obstacles to progress in decision theory right now is that we don't know of a good updateless perspective for logical uncertainty. Maybe a policy-alignment agent doesn't need to solve this problem, since it tries to optimize from the human perspective rather than its own. Roughly: logical updatelessness is hard because it tends to fall into the "too updateless" issue above. So, maybe it can be a non-issue in the right formulation of policy alignment.

Corrigibility?

Stuart Armstrong is somewhat pessimistic about corrigibility. Perhaps there is something which can be done in policy-alignment land which can't be done otherwise. The "Just Having Very Different Priors" example points in this direction; it is an example where policy-alignment acts in a much more corrigible way.

A value-learning agent can always resist humans if it is highly confidant that its plan is a good one which humans are opposing irrationally. A policy-alignment agent can think its plan is a good one but also think that humans would prefer it to be corrigible on principle regardless of that.

On the other hand, a policy-alignment agent isn't guaranteed to think that. Perhaps policy-alignment learning can be specified with some kind of highly corrigible bias, so that it requires a lot of evidence to decide that humans don't want it to behave corrigibly in a particular case?

Conclusion

I've left out some speculation about what policy-alignment agents should actually look like, for the sake of keeping mostly to the point (the discussion with Stuart). I like this idea because it involves a change in perspective of what an agent should be, similar to the change which UDT itself made.

New Comment
13 comments, sorted by Click to highlight new comments since:

What about calling it "policy alignment" in analogy with "value alignment"?

So, the AI still needs to figure out what is “irrational” and what is “real” in , just like value-learning needs to do for .

Since I'm very confused about what my should be (I may be happy to change it in any number of ways if someone gave me the correct solutions to a bunch of philosophical problems), there may not be anything "real" in my that I'd want an AI to learn and use in an uncritical way. It seems like this mostly comes down to what probabilities really are: if probabilities are something objective like "how real" or "how much existence" each possible world is/has, then I'd want an AI to use its greater intellect to figure out what is the correct prior and use that, but if probabilities are something subjective like how much I care about each possible world, then maybe I'd want the AI to learn and use my . I'm kind of confused that you give a bunch of what seem to me to be less important considerations on whether the AI should use my probability function or its own to make decisions, and don't mention this one.

"Policy alignment" seems like an improvement, especially since "policy approval" invokes government policy.

With respect to the rest:

On the one hand, I'm tempted to say that to the extent you recognize how confused you are about what probabilities are, and that this confusion has to do with how you reason in the real world, your is going to change a lot when updated on certain philosophical arguments. As a result, optimizing a strategy updatelessly via is going to take that into account, shifting behavior significantly in contingencies in which various philosophical arguments emerge, and potentially putting a significant amount of processing power toward searching for such arguments.

On the other hand, I buy my "policy alignment" proposal only to the extent that I buy UDT, which is not entirely. I don't know how to think about UDT together with the shifting probabilities which come from logical induction. The problem is similar to the one you outline: just as it is unclear that a human should think its own has any useful content which should be locked in forever in an updateless reasoner, it is similarly unclear that a fixed logical inductor state (after running for a finite amount of time) has any useful content which one would want to lock in forever.

I don't yet know how to think about this problem. I suspect there's something non-obvious to be said about the extent to which trusts other belief distributions (IE, something at least a bit more compelling than the answer I gave first, but not entirely different in form).

I'm confused. If the AI knows a million digits of pi, and it can prevent Omega from counterfactually mugging me where it knows I will lose money... shouldn't it try to prevent that from happening? That seems like the right behavior to me. Similarly, if I knew that the AI knows a million digits of pi, then if it gets counterfactually mugged, it shouldn't give up the money.

(Perhaps the argument is that as long as Omega was uncertain about the digit when deciding what game to propose, then you should pay up as necessary, regardless of what you know. But if that's the argument, then why can't the AI go through the same reasoning?)

Ignoring issues of irrationality or bounded rationality, what an agent wants out of a helper agent is that the helper agent does preferred things.

If the AI knows the winning numbers for the lottery, then it should buy that ticket for me, even though (if I don't know that the AI knows the winning numbers) I would disprefer that action. Even better would be if it explained to me what it was doing, after which I would prefer the action, but let's say that wasn't possible for some reason (maybe it performed a very complex simulation of the world to figure out the winning number).

It seems like if the AI knows my utility function and is optimizing it, that does perform well. Now for practical reasons, we probably want to instead build an AI that does what we prefer it to do, but this seems to be because it would be hard to learn the right utility function, and errors along the way could lead to catastrophe, not because it would be bad for the AI to optimize the right utility function.

ETA: My strawman-ML-version of your argument is that you would prefer imitation learning instead of inverse reinforcement learning (which differ when the AI and human know different things). This seems wrong to me.

I'm confused. If the AI knows a million digits of pi, and it can prevent Omega from counterfactually mugging me where it knows I will lose money... shouldn't it try to prevent that from happening? That seems like the right behavior to me. Similarly, if I knew that the AI knows a million digits of pi, then if it gets counterfactually mugged, it shouldn't give up the money.

If you don't think one should pay up in counterfactual mugging in general, then my argument won't land. Rather than arguing that you want to be counterfactually mugged, I'll try and argue a different decision problem.

Suppose that Omega is running a fairly simple and quick algorithm which is nonetheless able to predict an AI with more processing power, due to using a stronger logic or similar tricks. Omega will put either $10 or $1000 in a box. Our AI can press a button on the box to get either all or half of the money inside. Omega puts in $1000 if it predicts that our AI will take half the money; otherwise, it puts in $10.

We suppose that, since there is a short proof of exactly what Omega does, it is already present in the mathematical database included in the AI's prior.

If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is -- taking less money just has a lower expected utility. So, it will get only $10 from Omega.

If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human's expectation: taking half, or taking it all. It's quite possible in this case that it takes all the money.

(Perhaps the argument is that as long as Omega was uncertain about the digit when deciding what game to propose, then you should pay up as necessary, regardless of what you know. But if that's the argument, then why can't the AI go through the same reasoning?)

That is part of the argument for paying up in counterfactual mugging, yes. But both us and Omega need to be uncertain about the digit, since if our prior can already predict that Omega is going to ask us for $10 rather than give us any money, there's no reason for us to pay up. So, it depends on the prior, and can turn out differently if our vs the agent's prior is used.

If the AI knows the winning numbers for the lottery, then it should buy that ticket for me, even though (if I don't know that the AI knows the winning numbers) I would disprefer that action.

If I think that the AI tends to be miscalibrated about lottery-ticket beliefs, there is no reason for me to want it to buy the ticket. If I think it is calibrated about lottery-tirket beliefs, I'll like the policy of buying lottery tickets in such cases, so the AI will buy.

You could argue that an AI which is trying to be helpful will buy lottery tickets in such cases no matter how deluded the humans think it is. But, not only is this not very corrigible behavior, but also it doesn't make any sense from our perspective to make an AI reason in that way: we don't want the AI to act in ways which we have good reason to believe are unreliable.

ETA: My strawman-ML-version of your argument is that you would prefer imitation learning instead of inverse reinforcement learning (which differ when the AI and human know different things). This seems wrong to me.

The analogy isn't perfect, since the AI can still do things to maximize human approval which the human would never have thought of, as well as things which the human could think of but didn't have the computational resources to do. It does seem like a fairly good analogy, though.

Okay, I think I misunderstood what you were claiming in this post. Based on the following line:

I claimed that an agent which learns your utility function (pretending for a moment that "your utility function" really is a well-defined thing) and attempts to optimize it is still not perfectly aligned with you.

I thought you were arguing, "Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you." (Yes, having written it down I can see that is not what you actually said, but that's the interpretation I originally ended up with.)

I would now rephrase your claim as "Even assuming we know the true utility function, optimizing it is hard."

Examples:

You could argue that an AI which is trying to be helpful will buy lottery tickets in such cases no matter how deluded the humans think it is. But, not only is this not very corrigible behavior, but also it doesn't make any sense from our perspective to make an AI reason in that way: we don't want the AI to act in ways which we have good reason to believe are unreliable.

Yeah, an AI that optimizes the true utility function probably won't be corrigible. From a theoretical standpoint, that seems fine -- corrigibility seems like an easier target to shoot for, not a necessary aspect of an aligned AI. The reason we don't want the scenario above is "we have good reason to believe [the AI is] unreliable", which sounds like the AI is failing to optimize the utility function correctly.

If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is -- taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human's expectation: taking half, or taking it all. It's quite possible in this case that it takes all the money.

This also sounds like the value-learning agent is simply bad at correctly optimizing the true utility function. (It seems to me that all of decision theory is about how to properly optimize a utility function in theory.)

We can go in the opposite extreme, and make PR a broad prior such as the Solomonoff distribution, with no information about our world in particular.
I believe the observation has been made before that running UDT on such a prior could have weird results.

Again, seems like this proposal for making an aligned AI is just bad at optimizing the true utility function.

So I guess the way I would summarize this post:

  • Value learning is hard.
  • Even if you know the correct utility function, optimizing it is hard.
  • Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.

Is this right?

I thought you were arguing, "Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you." (Yes, having written it down I can see that is not what you actually said, but that's the interpretation I originally ended up with.)

I would correct it to "Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you."

I would now rephrase your claim as "Even assuming we know the true utility function, optimizing it is hard."

This part is tricky for me to interpret.

On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can't directly see what the consequences of your actions will be. So, I'm specifically pointing at an aspect of "optimizing it is hard" which is about having a good prior. You could say that "utility" is the true target, and "expected utility" is the proxy which you have to use in decision theory.

On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we're actually in) would be perfectly aligned: "If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned." This isn't necessarily objectionable, but it is not the notion of alignment in the post.

  • If the AI magically has the "true universe" prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
  • The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn't make sense to think about "the true human utility function" in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I'm not even sure it is sensible to think of alone as capturing human preferences; maybe doesn't really make sense apart from .

So, to summarize,

1. I agree that "even assuming we know the true utility function, optimizing it is hard" -- but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.

2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I'm not sure it makes sense to say there's a "true human utility function" -- the VNM theorem only gets a which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this "rotation" which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.

3. If we think of the objective as "building AI such that there is a good argument for humans trusting that the AI has human interest in mind" rather than "building AI which optimizes human utility", then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don't actually have to figure out which part of preferences are "probability" vs "utility".

I'm not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn't. I see that I need to do some careful disambiguation of background assumptions and language.

Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.

This part, at least, is getting at the same intuition I'm coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn't end up talking much about directly learning the policies. (I am thinking I'll write another post to make that connection clearer.)

I will have to think harder about the difference between how you're framing things and how I would frame things, to try to clarify more.

Separately, I still don't understand the counterfactual mugging case. (Disclaimer, I haven't gone through any math around counterfactual mugging.) It seems really strange that if the human was certain about the digit, they wouldn't pay up, but if the human is uncertain about the digit but is certain that the AI knows the digit, then the human would not want the AI to intervene. But possibly it's not worth getting into this detail.

Omega will put either $10 or $1000 in a box. Our AI can press a button on the box to get either all or half of the money inside. Omega puts in $1000 if it predicts that our AI will take half the money; otherwise, it puts in $10.
We suppose that, since there is a short proof of exactly what Omega does, it is already present in the mathematical database included in the AI's prior.
If the AI is a value-learning agent, it will take all the money, since it already knows how much money there is -- taking less money just has a lower expected utility. So, it will get only $10 from Omega.
If the AI is a policy-approval agent, it will think about what would have a higher expectation in the human's expectation: taking half, or taking it all. It's quite possible in this case that it takes all the money.

I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior. Presumably, "what Omega does" depends on your own policy, so if you have a proof about what Omega does, that proof also determines your action, and there is nothing left for the agent to consider.

To be clear, I think it's reasonable to consider AIs that try to figure out proofs of "what Omega does", but if that's taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does. And if it's not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.

I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior.

You may not recognize it as such, especially if Omega is using a different axiom system than you. So, you can still be ignorant of what you'll do while knowing what Omega's prediction of you is. This makes it impossible for your probability distribution to treat the two as correlated anymore.

but if that's taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does

Yeah, that's the problem here.

And if it's not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.

Only if the agent takes that one proof out of the prior, but still has enough structure in the prior to see how the decision problem plays out. This is the problem of constructing a thin prior. You can (more or less) solve any decision problem by making the agent sufficiently updateless, but you run up against the problem of making it too updateless, at which point it behaves in absurd ways (lacking enough structure to even understand the consequences of policies correctly).

Hence the intuition that the correct prior to be updateless with respect to is the human one (which is, essentially, the main point of the post).

Hey there!

A useful thing would be an example of when a policy approval agent would do something that a human wouldn't, and what gains in efficiency the policy approval agent has over a normal human acting.

I feel that the formulation "the humans have a utility function" may obscure part of what's going on. Part of the advantages of approval agents is that they allow humans to express their sometimes incoherent meta-preferences as well ("yeah, I want to do X, but don't force me to do it"). Assuming the human preferences are already coherent reduces the attraction of the approach.

Ah, I agree that this proposal may have better ways to relax the assumption that the human has a utility function than value-learning does. I wanted to focus on the simpler case here. Perhaps I'll write a follow-up post considering the generalization.

Maybe I'll try to insert an example where the policy approval agent does something the human wouldn't into this post, though.

Here's a first stab: suppose that the AI has a subroutine which solves complex planning problems. Furthermore, the human trusts the subroutine (does not expect it to be cleverly choosing plans which solve the problems as stated but cause other problems). The human is smart enough to formulate day-to-day management problems which arise at work as formally-specified planning problems, and would like to be told what the answer to those problems are. In this case, the AI will tell the human those answers.

This also illustrates a limited way the policy-approval agent can avoid over-optimizing simplified problem statements: if the human does not trust the planning subroutine (expects it to goodhart or such), then the AI will not use such a subroutine.

(This isn't maximally satisfactory, since the human may easily be mistaken about what subroutines to trust. I think the AI can do a little better than this, but maybe not in a way which addresses the fundamental issue.)

Iterated distillation and amplification seems like an example of a thing that is like policy approval, and it could do lots of things that a human is unable to, such as becoming really good at chess or Go. (You can imagine removing the distillation steps if those seem too different from policy approval, and the point still applies.)

I think there are interesting connections between HCH/IDA and policy approval, which I hope to write more about some time.