Vanessa Kosoy

Director of AI research at ALTER, where I lead a group working on the learning-theoretic agenda for AI alignment. I'm also supported by the LTFF. See also LinkedIn.

E-mail: {first name}@alter.org.il

Wiki Contributions

Comments

Sorted by

This post tries to push back against the role of expected utility theory in AI safety by arguing against various ways to derive expected utility axiomatically. I heard many such arguments before, and IMO they are never especially useful. This post is no exception.

The OP presents the position it argues against as follows (in my paraphrasing): "Sufficiently advanced agents don't play dominated strategies, therefore, because of [theorem], they have to be expected utility maximizers, therefore they have to be goal-directed and [other conclusions]". They then proceed to argue that there is no theorem that can make this argument go through.

I think that this entire framing is attacking a weak man. The real argument for expected utility theory is:

  • In AI safety, we are from the get-go interested in goal-directed systems because (i) we want AIs to achieve goals for us (ii) we are worried about systems with bad goals and (iii) stopping systems with bad goals is also a goal.
  • The next question is then, what is a useful mathematical formalism for studying goal-directed systems.
  • The theorems quoted in the OP are moderate evidence that expected utility has to be part of this formalism, because their assumptions resonate a lot with our intuitions for what "rational goal-directed behavior" is. Yes, of course we can still quibble with the assumptions (like the OP does in some cases), which is why I say "moderate evidence" rather than "completely watertight proof", but given how natural the assumptions are, the evidence is good.
  • More importantly, the theorems are only a small part of the evidence base. A philosophical question is never fully answered by a single theorem. Instead, the evidence base is holistic: looking at the theoretical edifices growing up from expected utility (control theory, learning theory, game theory etc) one becomes progressively more and more convinced that expected utility correctly captures some of the core intuitions behind "goal-directedness".
  • If one does want to present a convincing case against expected utility, quibbling with the assumption of VNM or whatnot is an incredibly weak move. Instead, show us where the entire edifice of existing theory runs ashore because of expected utility and how some alternative to expected utility can do better (as an analogy, see how infra-Bayesianism supplants Bayesian decision theory).

In conclusion, there are coherence theorems. But, more important than individual theorems are the "coherence theories".

This remains the best overview of the learning-theoretic agenda to-date. As a complementary pedagogic resource, there is now also a series of video lectures.

Since the article was written, there were several new publications:

In addition, some new developments were briefly summarized in short-forms:

  • A proposed solution for the monotonicity problem in infra-Bayesian physicalism. This is potentially very important since the monotonicity problem was by far the biggest issue with the framework (and as a consequence, with PSI).
  • Multiple developments concerning metacognitive agents (see also recorded talk). This framework seems increasingly important, but an in-depth analysis is still pending.
  • A conjecture about a possible axiomatic characterization of the maximin decision rule in infra-Bayesianism. If true, it would go a long way to allaying any concerns about whether maximin is the "correct" choice.
  • Ambidistributions: a cute new mathematical gadget for formalizing the notion of "control" in infra-Bayesianism.

Meanwhile, active research proceeds along several parallel directions:

  • I'm working towards the realization of the "frugal compositional languages" dream. So far, the problem is still very much open, but I obtained some interesting preliminary results which will appear in an upcoming paper (codename: "ambiguous online learning"). I also realized this direction might have tight connections with categorical systems theory (the latter being a mathematical language for compositionality). An unpublished draft was written by my MATS scholars on the subject of compositional polytope MDPs, hopefully to be completed some time during '25.
  • Diffractor achieved substantial progress in the theory of infra-Bayesian regret bounds, producing an infra-Bayesian generalization of decision-estimation coefficients (the latter is a nearly universal theory of regret bounds in episodic RL). This generalization has important connections to Garrabrant induction (of the flavor studied here), finally sketching a unified picture of these two approaches to "computational uncertainty" (Garrabrant induction and infra-Bayesianism). Results will appear in upcoming paper.
  • Gergely Szucs is studying the theory of hidden rewards, starting from the realization in this short-form (discovering some interesting combinatorial objects beyond what was described there).

It remains true that there are more shovel-ready open problems than researchers, and hence the number of (competent) researchers is still the bottleneck.

Seems right, but is there a categorical derivation of the Wentworth-Lorell rules? Maybe they can be represented as theorems of the form: given an arbitrary Markov category C, such-and-such identities between string diagrams in C imply (more) identities between string diagrams in C.

This article studies a potentially very important question: is improving connectomics technology net harmful or net beneficial from the perspective of existential risk from AI? The author argues that it is net beneficial. Connectomics seems like it would help with understanding the brain's reward/motivation system, but not so much with understanding the brain's learning algorithms. Hence it arguably helps more with AI alignment than AI capability. Moreover, it might also lead to accelerating whole brain emulation (WBE) which is also helpful.

The author mentions 3 reasons why WBE is helpful: 

  • We can let WBEs work on alignment.
  • We can more easily ban de novo AGI by letting WBEs fill its economic niche
  • Maybe we can derive aligned superintelligence from modified WBEs.

I think there is another reason: some alignment protocols might rely on letting the AI study a WBEs and use it for e.g. inferring human values. The latter might be viable even if actually running the WBE too slow to be useful with contemporary technology.

I think that performing this kind of differential benefit analysis for various technologies might be extremely important, and I would be glad to see more of it on LW/AF (or anywhere).

This article studies a natural and interesting mathematical question: which algebraic relations hold between Bayes nets? In other words, if a collection of random variables is consistent with several Bayes nets, what other Bayes nets does it also have to be consistent with? The question is studied both for exact consistency and for approximate consistency: in the latter case, the joint distribution is KL-close to a distribution that's consistent with the net. The article proves several rules of this type, some of them quite non-obvious. The rules have concrete applications in the authors' research agenda.

Some further questions that I think would be interesting to study:

  • Can we derive a full classification of such rules?
  • Is there a category-theoretic story behind the rules? Meaning, is there a type of category for which Bayes nets are something akin to string diagrams and the rules follow from the categorical axioms?

Tbf, you can fit a quadratic polynomial to any 3 points. But triangular numbers are certainly an aesthetically pleasing choice. (Maybe call it "triangular voting"?)

I feel that this post would benefit from having the math spelled out. How is inserting a trader a way to do feedback? Can you phrase classical RL like this?

Two thoughts about the role of quining in IBP:

  • Quine's are non-unique (there can be multiple fixed points). This means that, viewed as a prescriptive theory, IBP produces multi-valued prescriptions. It might be the case that this multi-valuedness can resolve problems with UDT such as Wei Dai's 3-player Prisoner's Dilemma and the anti-Newcomb problem[1]. In these cases, a particular UDT/IBP (corresponding to a particular quine) loses to CDT. But, a different UDT/IBP (corresponding to a different quine) might do as well as CDT.
  • What to do about agents that don't know their own source-code? (Arguably humans are such.) Upon reflection, this is not really an issue! If we use IBP prescriptively, then we can always assume quining: IBP is just telling you to follow a procedure that uses quining to access its own (i.e. the procedure's) source code. Effectively, you are instantiating an IBP agent inside yourself with your own prior and utility function. On the other hand, if we use IBP descriptively, then we don't need quining: Any agent can be assigned "physicalist intelligence" (Definition 1.6 in the original post, can also be extended to not require a known utility function and prior, along the lines of ADAM) as long as the procedure doing the assigning knows its source code. The agent doesn't need to know its own source code in any sense.
  1. ^

    @Squark is my own old LessWrong account.

Ambidistributions

I believe that all or most of the claims here are true, but I haven't written all the proofs in detail, so take it with a grain of salt.

Ambidistributions are a mathematical object that simultaneously generalizes infradistributions and ultradistributions. It is useful to represent how much power an agent has over a particular system: which degrees of freedom it can control, which degrees of freedom obey a known probability distribution and which are completely unpredictable.

Definition 1: Let  be a compact Polish space. A (crisp) ambidistribution on  is a function  s.t.

  1. (Monotonocity) For any , if  then .
  2. (Homogeneity) For any  and .
  3. (Constant-additivity) For any  and .

Conditions 1+3 imply that  is 1-Lipschitz. We could introduce non-crisp ambidistributions by dropping conditions 2 and/or 3 (and e.g. requiring 1-Lipschitz instead), but we will stick to crisp ambidistributions in this post.

The space of all ambidistributions on  will be denoted .[1] Obviously,  (where  stands for (crisp) infradistributions), and likewise for ultradistributions.

Examples

Example 1: Consider compact Polish spaces  and a continuous mapping . We can then define  by

That is,  is the value of the zero-sum two-player game with strategy spaces  and  and utility function .

Notice that  in Example 1 can be regarded as a Cartesian frame: this seems like a natural connection to explore further.

Example 2: Let  and  be finite sets representing actions and observations respectively, and  be an infra-Bayesian law. Then, we can define  by

In fact, this is a faithful representation:  can be recovered from .

Example 3: Consider an infra-MDP with finite state set , initial state  and transition infrakernel . We can then define the "ambikernel"  by

Thus, every infra-MDP induces an "ambichain". Moreover:

Claim 1:  is a monad. In particular, ambikernels can be composed. 

This allows us defining

This object is the infra-Bayesian analogue of the convex polytope of accessible state occupancy measures in an MDP.

Claim 2: The following limit always exists:

Legendre-Fenchel Duality

Definition 3: Let  be a convex space and . We say that  occludes  when for any , we have

Here,  stands for convex hull.

We denote this relation . The reason we call this "occlusion" is apparent for the  case.

Here are some properties of occlusion:

  1. For any .
  2. More generally, if  then .
  3. If  and  then .
  4. If  and  then .
  5. If  and  for all , then .
  6. If  for all , and also , then .

Notice that occlusion has similar algebraic properties to logical entailment, if we think of  as " is a weaker proposition than ".

Definition 4: Let  be a compact Polish space. A cramble set[2] over  is  s.t.

  1.  is non-empty.
  2.  is topologically closed.
  3. For any finite  and , if  then . (Here, we interpret elements of  as credal sets.)

Question: If instead of condition 3, we only consider binary occlusion (i.e. require , do we get the same concept?

Given a cramble set , its Legendre-Fenchel dual ambidistribution is

Claim 3: Legendre-Fenchel duality is a bijection between cramble sets and ambidistributions.

Lattice Structure

Functionals

The space  is equipped with the obvious partial order:  when for all  . This makes  into a distributive lattice, with

This is in contrast to  which is a non-distributive lattice.

The bottom and top elements are given by

Ambidistributions are closed under pointwise suprema and infima, and hence  is complete and satisfies both infinite distributive laws, making it a complete Heyting and co-Heyting algebra.

 is also a De Morgan algebra with the involution

For  is not a Boolean algebra:  and for any  we have .

One application of this partial order is formalizing the "no traps" condition for infra-MDP:

Definition 2: A finite infra-MDP is quasicommunicating when for any 

Claim 4: The set of quasicommunicating finite infra-MDP (or even infra-RDP) is learnable.

Cramble Sets

Going to the cramble set representation,  iff 

 is just , whereas  is the "occlusion hall" of  and .

The bottom and the top cramble sets are

Here,  is the top element of  (corresponding to the credal set .

The De Morgan involution is

Operations

Definition 5: Given  compact Polish spaces and a continuous mapping , we define the pushforward  by

When  is surjective, there are both a left adjoint and a right adjoint to , yielding two pullback operators :

 

Given  and  we can define the semidirect product  by

There are probably more natural products, but I'll stop here for now.

Polytopic Ambidistributions

Definition 6: The polytopic ambidistributions  are the (incomplete) sublattice of  generated by .

Some conjectures about this:

  • For finite , an ambidistributions  is polytopic iff there is a finite polytope complex  on  s.t. for any cell  of  is affine.
  • For finite , a cramble set  is polytopic iff it is the occlusion hall of a finite set of polytopes in .
  •  and  from Example 3 are polytopic.
  1. ^

    The non-convex shape  reminds us that ambidistributions need not be convex or concave.

  2. ^

    The expression "cramble set" is meant to suggest a combination of "credal set" with "ambi".

Here's the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth's "selection theorem" concept.

Consider any environment  of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function

Suppose that there exists a policy  of finite Kolmogorov complexity (i.e. computable) that's optimal for  in the slow discount limit. That is,

Then,  cannot be the only environment with this property. Otherwise, this property could be used to define  using a finite number of bits, which is impossible[1]. Since  requires infinitely many more bits to specify than  and , there has to be infinitely many environments with the same property[2]. Therefore,  is a reinforcement learning algorithm for some infinite class of hypothesis.

Moreover, there are natural examples of  as above. For instance, let's construct  as an infinite sequence of finite communicating infra-RDP refinements that converges to an unambiguous (i.e. "not infra") environment. Since each refinement involves some arbitrary choice, "most" such  have infinite Kolmogorov complexity. In this case,  exists: it can be any learning algorithm for finite communicating infra-RDP with arbitrary number of states.

Besides making this a rigorous theorem, there are many additional questions for further investigation:

  • Can we make similar claims that incorporate computational complexity bounds? It seems that it should be possible to at least constraint our algorithms to be PSPACE in some sense, but not obvious how to go beyond that (maybe it would require the frugal universal prior).
  • Can we argue that  must be an infra-Bayesian learning algorithm? Relatedly, can we make a variant where computable/space-bounded policies can only attain some part of the optimal asymptotic reward of ?
  • The setting we described requires that all the traps in  can be described in a finite number of bits. If this is not the case, can we make a similar sort of an argument that implies  is Bayes-optimal for some prior over a large hypothesis class?
  1. ^

    Probably, making this argument rigorous requires replacing the limit with a particular regret bound. I ignore this for the sake of simplifying the core idea.

  2. ^

    There probably is something more precise that can be said about how "large" this family of environment is. For example, maybe it must be uncountable.

Load More