Vanessa Kosoy

Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I'm also supported by the LTFF. See also LinkedIn.

E-mail: {first name}@alter.org.il

Wiki Contributions

Comments

Sorted by

Another excellent catch, kudos. I've really been sloppy with this shortform. I corrected it to say that we can approximate the system arbitrarily well by VNM decision-makers. Although, I think it's also possible to argue that a system that selects a non-exposed point is not quite maximally influential, because it's selecting somethings that's very close to delegating some decision power to chance.

Also, maybe this cannot happen when  is the inverse limit of finite sets? (As is the case in sequential decision making with finite action/observation spaces). I'm not sure.

Example: Let , and  consist of the probability intervals  and . Then, it is (I think) consistent with the desideratum to have .

 

Not only that interpreting  requires an unusual decision rule (which I will be calling "utility hyperfunction"), but applying any ordinary utility function to this example yields a non-unique maximum. This is another point in favor of the significance of hyperfunctions.

You're absolutely right, good job! I fixed the OP.

TLDR: Systems which locally maximal influence can be described as VNM decision-makers.

There are at least 3 different motivations leading to the concept of "agent" in the context of AI alignment:

  1. The sort of system we are concerned about (i.e. which poses risk)
  2. The sort of system we want to build (in order to defend from dangerous systems)
  3. The sort of systems that humans are (in order to meaningfully talk about "human preferences")

Motivation #1 naturally suggests a descriptive approach, motivation #2 naturally suggests a prescriptive approach, and motivation #3 is sort of a mix of both: on the one hand, we're describing something that already exists, on the other hand, the concept of "preferences" inherently comes from a normative perspective. There are also reasons to think these different motivation should converge on a single, coherent concept.

Here, we will focus on motivation #1.

A central reason why we are concerned about powerful unaligned agents, is that they are influential. Agents are the sort of system that, when instantiated in a particular environment is likely to heavily change this environment, potentially in ways inconsistent with the preferences of other agents.

Bayesian VNM

Consider a nice space[1]  of possible "outcomes", and a system that can choose[2] out of a closed set of distributions . I propose that an influential system should satisfy the following desideratum:

The system cannot select  which can be represented as a non-trivial lottery over other elements in . In other words,  has to be an extreme point of the convex hull of .

Why? Because a system that selects a non-extreme point leaves something to chance. If the system can force outcome , or outcome  but chooses instead outcome , for  and , this means the system gave up on its ability to choose between  and  in favor of a -biased coin. Such a system is not "locally[3] maximally" influential[4].

[EDIT: The original formulation was wrong, h/t @harfe for catching the error.]

The desideratum implies that there is a convergent sequence of utility functions  s.t.

  • For every  has a unique maximum  in .
  • The sequence  converges to .

In other words, such a system can be approximated by a VNM decision-maker within any precision. For finite , we don't need the sequence, instead there is some  s.t.  is the unique maximum of  over . This observation is mathematically quite simple, but I haven't seen it made elsewhere (but I would not be surprised if it did appear in the decision theory literature somewhere).

Infra-Bayesian VNM?

Now, let's say that the system is choosing out of a set of credal sets (crisp infradistributions) . I propose the following desideratum:

[EDIT: Corrected according to a suggestion by @harfe, original version was too weak.]

Let  be the closure of  w.r.t. convex combinations and joins[5]. Let  be selected by the system. Then:

  • For any  and , if  then .
  • For any , if  then .

The justification is, a locally maximal influential system should leave the outcome neither to chance nor to ambiguity (the two types of uncertainty we have with credal sets).

We would like to say that this implies that the system is choosing according to maximin relatively to a particular utility function. However, I don't think this is true, as the following example shows:

Example: Let , and  consist of the probability intervals  and . Then, it is (I think) consistent with the desideratum to have .

Instead, I have the following conjecture:

Conjecture: There exists some space , some  and convergent sequence  s.t.

As before, the maxima should be unique.

Such a "generalized utility function" can be represented as an ordinary utility function with a latent -valued variable, if we replace  with  defined by

However, using utility functions constructed in this way leads to issues with learnability, which probably means there are also issues with computational feasibility. Perhaps in some natural setting, there is a notion of "maximally influential under computational constraints" which implies an "ordinary" maximin decision rule.

This approach does rule out optimistic or "mesomistic" decision-rules. Optimistic decision makers tend to give up on influence, because they believe that "nature" would decide favorably for them. Influential agents cannot give up on influence, therefore they should be pessimistic.

Sequential Decision-Making

What would be the implications in a sequential setting? That is, suppose that we have a set of actions , a set of observations , a prior  and

In this setting, the result is vacuous because of an infamous issue: any policy can be justified by a contrived utility functions that favors it. However, this is only because the formal desideratum doesn't capture the notion of "influence" sufficiently well. Indeed, a system whose influence boils down entirely to its own outputs is not truly influential. What motivation #1 asks of us, is talk about systems that influence the world-at-large, including relatively "faraway" locations.

One way to fix some of the problem is, take  and define  accordingly. This singles out systems that have influence over their observations rather than only their actions, which is already non-vacuous (some policies are not such). However, such a system can still be myopic. We can take this further, and select "long-term" influence by projecting onto late observations or some statistics over observations. However, in order to talk about actually "far-reaching" influence, we probably need to switch to the infra-Bayesian physicalism setting. There, we can set , i.e. select for system that have influence over physically manifest computations.

  1. ^

    I won't keep track of topological technicalities here, probably everything here works at least for compact Polish spaces.

  2. ^

    Meaning that the system has some output, and different counterfactual outputs correspond to different elements of .

  3. ^

    I say "locally" because it refers to something like a partial order, not a global scalar measure of influence.

  4. ^

    See also Yudkowsky's notion of efficient systems "not leaving free energy".

  5. ^

    That is, if  then their join (convex hull)   is also in , and so is  for every . Moreover,  is the minimal closed superset of  with this property. Notice that this implies  is closed w.r.t. arbitrary infra-convex combinations, i.e. for any  and , we have .

Do you mean that seeing the opponent make dumb moves makes the AI infer that its own moves are also supposed to be dumb, or something else?

Apparently someone let LLMs play against the random policy and for most of them, most games end in a draw. Seems like o1-preview is the best of those tested, managing to win 47% of the time.

This post states and speculates on an important question: are there different mind types that are in some sense "fully general" (the author calls it "unbounded") but are nevertheless qualitatively different. The author calls these hypothetical mind taxa "cognitive realms".

This is how I think about this question, from within the LTA:

To operationalize "minds" we should be thinking of learning algorithms. Learning algorithms can be classified according to their "syntax" and "semantics" (my own terminology). Here, semantics refers to questions such as (i) what type of object is the algorithm learning (ii) what is the feedback/data available to the algorithm and (iii) what is the success criterion/parameter of the algorithm. On the other hand, syntax refers to the prior and/or hypothesis class of the algorithm (where the hypothesis class might be parameterized in a particular way, with particular requirements on how the learning rate depends on the parameters).

Among different semantics, we are especially interested in those that are in some sense agentic. Examples include reinforcement learning, infra-Bayesian reinforcement learning, metacognitive agents and infra-Bayesian physicalist agents.

Do different agentic semantics correspond to different cognitive realms? Maybe, but maybe not: it is plausible that most of them are reflectively unstable. For example Christiano's malign prior might be a mechanism for how all agents converge to infra-Bayesian physicalism.

Agents with different syntaxes is another candidate for cognitive realms. Here, the question is whether there is an (efficiently learnable) syntax that is in some sense "universal": all other (efficiently learnable) syntaxes can be efficiently translated into it. This is a wide open question. (See also "frugal universal prior".)

In the context of AI alignment, in order to achieve superintelligence it is arguably sufficient to use a syntax equivalent to whatever is used by human brain algorithms. Moreover, it's plausible that any algorithm we can come up can only have an equivalent or weaker syntax (the process of us discovering the new syntax suggests an embedding of the new syntax into our own). Therefore, even if there are many cognitive realms, then for our purposes we mostly only care about one of them. However, the multiplicity of realms has implications on how simple/natural/canonical should we expect the choice of syntax for our theory of agents to be (the less realms, the more canonical).

Load More