AI ALIGNMENT FORUM
AF

Richard Ngo

Formerly alignment and governance researcher at DeepMind and OpenAI. Now independent.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

the void

Richard Ngo8d3-2

I suspect that many of the things you've said here are also true for humans.

That is, humans often conceptualize ourselves in terms of underspecified identities. Who am I? I'm Richard. What's my opinion on this post? Well, being "Richard" doesn't specify how I should respond to this post. But let me check the cached facts I believe about myself ("I'm truth-seeking"; "I'm polite") and construct an answer which fits well with those facts. A child might start off not really knowing what "polite" means, but still wanting to be polite, and gradually flesh out what that means that as they learn more about the world.

Another way of putting this point: being pulled from the void is not a feature of LLM personas. It's a feature of personas. Personas start off with underspecified narratives that fail to predict most behavior (but are self-fulfilling) and then gradually systematize to infer deeper motivations, resolving conflicts with the actual drivers of behavior along the way.

What's the takeaway here? We should still be worried about models learning the wrong self-fulfilling prophecies. But the "pulling from the void" thing should be seen less as an odd thing that we're doing with AIs, and more as a claim about the nature of minds in general.

Judgements: Merging Prediction & Evidence

Richard Ngo17d52

This feels related to the predictive processing framework, in which the classifications of one model are then predicted by another.

More tangentially, I've previously thought about merging cognitive biases and values, since we can view both of them as deviations from the optimal resource-maximizing policy. For example, suppose that I am willing to bet even when I am being Dutch booked. You could think of that as a type of irrationality, or you could think of it as an expression of me valuing being Dutch booked, and therefore being willing to pay to experience it.

This is related to the Lacanian/"existential kink" idea that most dysfunctions are actually deliberate, caused by subagents that are trying to pursue some goal at odds with the rest of your goals.

Towards a scale-free theory of intelligent agency

Richard Ngo3mo40

I think my thought process when I typed "risk-averse money-maximizer" was that an agent could be risk-averse (in which case it wouldn't be an EUM) and then separately be a money-maximizer.

But I didn't explicitly think "the risk-aversion would be with regard to utility not money, and risk-aversion with regard to money could still be risk-neutral with regard to utility", so I appreciate the clarification.

Towards a scale-free theory of intelligent agency

Richard Ngo3mo41

Your example bet is a probabilistic mixture of two options: $0 and $2. The agent prefers one of the options individually (getting $2) over any probabilistic mixture of getting $0 and $2.

In other words, your example rebuts the claim that an EUM can't prefer a probabilistic mixture of two options to the expectation of those two options. But that's not the claim I made.

Towards a scale-free theory of intelligent agency

Richard Ngo3mo21

Hmm, this feels analogous to saying "companies are an unnecessary abstraction in economic theory, since individuals could each make separate contracts about how they'll interact with each other. Therefore we can reduce economics to studying isolated individuals".

But companies are in fact a very useful unit of analysis. For example, instead of talking about the separate ways in which each person in the company has committed to treating each other person in the company, you can talk about the HR policy which governs all interactions within the company. You might then see emergent effects (like political battles over what the HR policies are) which are very hard to reason about when taking a single-agent view.

Similarly, although in principle you could have any kind of graph of which agents listen to which other agents, in practice I expect that realistic agents will tend to consist of clusters of agents which all "listen to" each other in some ways. This is both because clustering is efficient (hence animals having bodies made up of clusters of cells; companies being made of clusters of individuals; etc) and because when you even define what counts as a single agent, you're doing a kind of clustering. That is, I think that the first step of talking about "individual rationality" is implicitly defining which coalitions qualify as individuals.

Towards a scale-free theory of intelligent agency

Richard Ngo3mo20

a superintelligent AI probably has a pretty good guess of the other AI's real utility function based on its own historical knowledge, simulations, etc.

This seems very unclear to me—in general it's not easy for agents to predict the goals of other agents with their own level of intelligence, because the amount of intelligence aimed at deception increases in proportion to the amount of intelligence aimed at discovering that deception.

(You could look at the AI's behavior from when it was less intelligent, but then—as with humans—it's hard to distinguish sincere change from improvement at masking undesirable goals.)

But regardless, that's a separate point. If you can do that, you don't need your mechanism above. If you can't, then my objection still holds.

Towards a scale-free theory of intelligent agency

Richard Ngo3mo30

One argument for being optimistic: the universe is just very big, and there's a lot to go around. So there's a huge amount of room for positive-sum bargaining.

Another: at any given point in time, few of the agents that currently exist would want their goals to become significantly simplified (all else equal). So there's a strong incentive to coordinate to reduce competition on this axis.

Lastly: if at each point in time, the set of agents who are alive are in conflict with potentially-simpler future agents in a very destructive way, then they should all just Do Something Else. In particular, if there's some decision-theoretic argument roughly like "more powerful agents should continue to spend some of their resources on the values of their less-powerful ancestors, to reduce the incentives for inter-generational conflict", even agents with very simple goals might be motivated by it. I call this "the generational contract".

Towards a scale-free theory of intelligent agency

Richard Ngo3mo*140

I found this a very interesting question to try to answer. My first reaction was that I don't expect EUMs with explicit utility functions to be competitive enough for this to be very relevant (like how purely symbolic AI isn't competitive enough with deep learning to be very relevant).

But then I thought about how companies are close-ish to having an explicit utility function (maximize shareholder value) which can be merged with others (e.g. via acquisitions). And this does let them fundraise better, merge into each other, and so on.

Similarly, we can think of cases where countries were joined together by strategic marriages (the unification of Spain, say) as only being possible because the (messy, illegible) interests of the country were rounded off to the (relatively simple) interests of their royals. And so the royals being guaranteed power over the merged entity via marriage allowed the mergers to happen much more easily than if they had to create a merger which served the interests of the "country as a whole".

For a more modern illustration: suppose that the world ends up with a small council who decide how AGI goes. Then countries with a dictator could easily bargain to join this coalition in exchange for their dictator getting a seat on this council. Whereas democratic countries would have a harder time doing so, because they might feel very internally conflicted about their current leader gaining the level of power that they'd get from joining the council.

(This all feels very related to Seeing Like a State, which I've just started reading.)

So upon reflection: yes, it's reasonable to interpret me as trying to solve the problem of getting the benefits of being governed by a set of simple and relatively legible goals, without the costs that are usually associated with that.

Note that I say "legible goals" instead of "EUM" because in my mind you can be an EUM with illegible goals (like a neural network that implements EUM internally), or a non-EUM with legible goals (like a risk-averse money-maximizer), and merging is more bottlenecked on legibility than EUM-ness.

Towards a scale-free theory of intelligent agency

Richard Ngo3mo54

I think this addresses the problem I'm discussing only in the case where the source code contains an explicit utility function. You can then create new source code by merging those utility functions.

But in the case where it doesn't (e.g. the source code is an uninterpretable neural network) you are left with the same problem.

Edited to add: Though even when the utility function is explicit, it seems like the benefits of lying about your source code could outweigh the cost of changing your utility function. For example, suppose A and B are bargaining, and A says "you should give me more cake because I get very angry if I don't get cake". Even if this starts off as a lie, it might then be in A's interests to use your mechanism above to self-modify into A' that does get very angry if it doesn't get cake, and which therefore has a better bargaining position (because, under your protocol, it has "proved" that it was A' all along).

Towards a scale-free theory of intelligent agency

Richard Ngo3mo50

Consider you and I merging (say, in a marriage). Suppose that all points on the Pareto frontier involve us pursuing a fully-consistent strategy. But if some decisions are your responsibility, and other decisions are my responsibility, then we might end up with some of our actions inconsistent with others (say, if we haven't had a chance to communicate before deciding). That's not on the Pareto frontier.

What is on the Pareto frontier is you being dictator, and then accounting for my utility function when making your dictatorial decisions. But of course this is something I will object to, because in any realistic scenario I wouldn't trust you enough to give you dictatorial power over me. Once you have that power, continuing to account for my utility is strongly non-incentive-compatible for you. So we're more likely to each want to retain some power, even if it sometimes causes inefficiency. (The same is true on the level of countries, which accept a bunch of inefficiency from democratic competition in exchange for incentive-compatibility and trust.)

Another way of putting this: I'm focusing on the setting where you cannot do arbitrary merges, you can only do merges that are constructible via some set of calls to the existing agents. It's often impossible to construct a fully-consistent merged agent without concentrating power in ways that the original agents would find undesirable (though there sometimes is, e.g. with I-cut-you-choose cake-cutting). So in this setting we need a different conception of rationality than Pareto-optimal.