Coherent behaviour in the real world is an incoherent concept

[-]Wei Dai7y110

I wonder if we can rescue Eliezer's argument. Informally (as far as I understand it) Eliezer's argument is that if an agent is the result of some optimization process, that optimization process will tend to notice and fix any incoherent behavior in the agent because that behavior will likely cause the agent to do something that counts as a clear loss from the optimization process's perspective.

So instead of letting O be either world states or world trajectories, make it the set of all possible combinations of properties of world trajectories that optimization processes in our world might care about. Formally we can define this as a partition of all possible world trajectories into mutually exclusive subsets where two trajectories are in the same subset iff no optimization process in our light-cone is likely to distinguish between them in any way. (BTW I believe it's standard or at least not unusual in decision theory to think of O as coarse-grained outcomes that people might care about, rather than micro states or micro trajectories.)

Now Rohin's objection no longer applies because we can't always find "a utility function which assigns maximal utility to all and only the world-trajectories in which those choices were made". Consider an agent that twitches according to some random sequence R. Since no optimization process in our world is likely to care that an agent twitches exactly according to R, any element of O that contains a trajectory where the agent twitches according to R would also contain a trajectory where the agent twitches according to some other sequence R', so there is no utility function which assigns maximal utility to all and only the world-trajectories in which the agent twitches according to R.

Having (hopefully) formalized the argument in a way that is no longer vacuous, I have to say I'm not entirely sure what the larger point of it is. Rohin seems to think the point is "Simply knowing that an agent is intelligent lets us infer that it is goal-directed" but Eliezer doesn't seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That's actually one of MIRI's research objectives even though they take a different approach from Paul's.) Can anyone link to places where Eliezer uses this argument as part of some larger argument?

[-]Rohin Shah7y40

Here's an example of Eliezer using the argument: AI Alignment: Why It’s Hard, and Where to Start

[-]Richard_Ngo7y20

From Rohin's post, a quote which I also endorse:

You could argue that while [building AIs with really weird utility functions] is possible in principle, no one would ever build such an agent. I wholeheartedly agree, but note that this is now an argument based on particular empirical facts about humans (or perhaps agent-building processes more generally).

And if you're going to argue based on particular empirical facts about what goals we expect, then I don't think that doing so via coherence arguments helps very much.

[-]Wei Dai7y40

And if you’re going to argue based on particular empirical facts about what goals we expect, then I don’t think that doing so via coherence arguments helps very much.

I note that the first sentence of your post is "Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function." so it seems worth pointing out that there's a reasonable way to interpret “sufficiently optimised agents appear coherent” which isn't subject to that criticism.

Beyond that, as I mentioned, it's not clear to me what Eliezer was arguing for. (It seems plausible that he considered “sufficiently optimised agents appear coherent”, or the immediate corollary that such agents can be viewed as approximate EU maximizers with utility functions over the O that I defined, interesting in itself as a possibly surprising prediction that we can make about such agents.) What larger conclusion do you think he was arguing for, and why (preferably with citations)? Once we settle that, maybe then we can discuss whether his argumentative strategy was a good one?

[-]Rohin Shah7y10

Rohin seems to think the point is "Simply knowing that an agent is intelligent lets us infer that it is goal-directed" but Eliezer doesn't seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That's actually one of MIRI's research objectives even though they take a different approach from Paul's.)

I think the point (from Eliezer's perspective) is "Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer". The main implication is that there is no way to affect the details of a superintelligent AI except by affecting its utility function, since everything else is fixed by math (specifically the VNM theorem). Note that this is (or rather, appears to be) a very strong condition on what alignment approaches could possibly work -- you can throw out any approach that isn't going to affect the AI's utility function. I think this is the primary reason for Eliezer making this argument. Let's call this the "intelligence implies EU maximization" claim.

Separately, there is another claim that says "EU maximization by default implies goal-directedness" (or the presence of convergent instrumental subgoals, if you prefer that instead of goal-directedness). However, this is not required by math, so it is possible to avoid this implication, by designing your utility function in just the right way.

Corrigibility is possible under this framework by working against the second claim, i.e. designing the utility function in just the right way that you get corrigible behavior out. And in fact this is the approach to corrigibility that MIRI looked into.

I am primarily taking issue with the "intelligence implies EU maximization" argument. The problem is, "intelligence implies EU maximization" is true, it just happens to be vacuous. So I can't say that that's what I'm arguing against. This is why I rounded it off to arguing against "intelligence implies goal-directedness", though this is clearly a bad enough summary that I shouldn't be saying that any more.

[-]Wei Dai7y50

I think the point (from Eliezer’s perspective) is “Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer”.

Eliezer explicitly disclaimed this:

A cognitively powerful agent might not be sufficiently optimized

Scenarios that negate "Relevant powerful agents will be highly optimized", such as brute forcing non-recursive intelligence, can potentially evade the 'sufficiently optimized' condition required to yield predicted coherence. E.g., it might be possible to create a cognitively powerful system by overdriving some fixed set of algorithms, and then to prevent this system from optimizing itself or creating offspring agents in the environment. This could allow the creation of a cognitively powerful system that does not appear to us as a bounded Bayesian. (If, for some reason, that was a good idea.)

In Relevant powerful agents will be highly optimized he went into even more detail about how one might create an intelligent agent that is not "highly optimized" and hence not an expected utility maximizer.

In summary it seems like you misunderstood Eliezer due to not noticing a distinction that he draws between "intelligent" (or "cognitively powerful") and "highly optimized".

[-]Rohin Shah7y10

In summary it seems like you misunderstood Eliezer due to not noticing a distinction that he draws between "intelligent" (or "cognitively powerful") and "highly optimized".

That's true, I'm not sure what this distinction is meant to capture. I'm updating that the thing I said is less likely to be true, but I'm still somewhat confident that it captures the general gist of what Eliezer meant. I would bet on this at even odds if there were some way to evaluate it.

Eliezer explicitly disclaimed this: [...]

In Relevant powerful agents will be highly optimized he went into even more detail about how one might create an intelligent agent that is not "highly optimized" and hence not an expected utility maximizer.

This is a tiny bit of his writing, and his tone makes it clear that this is unlikely. This is different from what I expected (when something has the force of a theorem you don't usually call its negation just "unlikely" and have a story for how it could be true), but it still seems consistent with the general story I said above.

In any case, I don't want to spend any more time figuring out what Eliezer believes, he can say something himself if he wants. I mostly replied to this comment to clarify the particular argument I'm arguing against, which I thought Eliezer believed, but even if he doesn't it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.

[-]Wei Dai7y50

In any case, I don’t want to spend any more time figuring out what Eliezer believes, he can say something himself if he wants. I mostly replied to this comment to clarify the particular argument I’m arguing against, which I thought Eliezer believed, but even if he doesn’t it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.

It seems fine to debunk what you think is a common implicit belief in the rationalist AI safety crowd, but I think it's important to be fair to other researchers and not attribute errors to them when you don't know or aren't sure that they actually committed such errors. For people who aren't domain experts (which is most people), reputation is highly important for them to evaluate claims in a technical field like AI safety, so we should take care not to misinform them about, for example, how often someone makes technical errors.

[-]Rohin Shah7y30

I'm pretty sure I have never mentioned Eliezer in the Value Learning sequence. I linked to his writings because they're the best explanation of the perspective I'm arguing against. (Note that this is different from claiming that Eliezer believes that perspective.) This post and comment thread attributed the argument and belief to Eliezer, not me. I responded because it was specifically about what I was arguing against in my post, and I didn't say "I am clarifying the particular argument I am arguing against and am unsure what Eliezer's actual position is" because a) I did think that it was Eliezer's actual position, b) this is a ridiculous amount of boilerplate and c) I try not to spend too much time on comments.

I'm not feeling particularly open to feedback currently, because honestly I think I take far more care about this sort of issue than the typical researcher, but if you want to list a specific thing I could have done differently, I might try to consider how to do that sort of thing in the future.

[-]Ben Pace6y20

Just a note that in the link that Wei Dai provides for "Relevant powerful agents will be highly optimized", Eliezer explicitly assigns '75%' to 'The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.'

even if he doesn't it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.

Agreed.

[-]Rohin Shah6y20

Just a note that in the link that Wei Dai provides for "Relevant powerful agents will be highly optimized", Eliezer explicitly assigns '75%' to 'The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.'

Yeah, it's worth noting that I don't understand what this means. By my intuitive read of the statement, I'd have given it 95+% of being true, in the sense that you aren't going to randomly stumble upon a powerful agent. But also by my intuitive read, the negative example given on that page would be a positive example:

An example of a scenario that negates RelevantPowerfulAgentsHighlyOptimized is KnownAlgorithmNonrecursiveIntelligence, where a cognitively powerful intelligence is produced by pouring lots of computing power into known algorithms, and this intelligence is then somehow prohibited from self-modification and the creation of environmental subagents.

On my view, known algorithms are already very optimized? E.g. Dijkstra's algorithm is highly optimized for efficient computation of shortest paths.

So TL;DR idk what optimized is supposed to mean here.

[-]James Payor7y60

First problem with this argument: there are no coherence theories saying that an agent needs to maintain the same utility function over time.

This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

As a separate note, the "further out" your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.

(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)

[-]Richard_Ngo7y10

This seems pretty false to me.

I agree that this problem is not a particularly important one, and explicitly discard it a few sentences later. I hadn't considered your objection though, and will need to think more about it.

(Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)

Mind explaining why? Is this more a stylistic preference, or do you think most of them are wrong/irrelevant?

the "further out" your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.

Also true if you make world states temporally extended.

[-]TurnTrout6y10

And in fact, definition 1 turns out to have further problems. For example: I haven't yet defined how a coherent agent is meant to choose between equally good options. One natural approach is to simply allow it to make any choice in those situations - it can hardly be considered irrational for doing so, since by assumption whatever it chooses is just as good as any other option. However, in that case any behaviour is consistent with the indifferent preference function (which rates all outcomes as equal). So even under definition 1, any sequence of actions is coherent. Now, I don't think it's very realistic that superintelligent AGIs will actually be indifferent about the effects of most of their actions, so perhaps we can just rule out preferences which feature indifference too often. But note that this adds an undesirable element of subjectivity to our definition.

For what it's worth, under any continuous distribution over reward functions, only a measure zero subset of reward functions has more than one optimal trajectory from any state. So, it's a little less subjective to rule out indifference (assume continuity and ignore measure zero events), but it still subjective and doesn't deal with the other problems with defn 1.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

21

Coherent behaviour in the real world is an incoherent concept

21