Seeking Collaborators

abramdemski

I've been accepted as a mentor for the next AI Safety Camp. You can apply to work with me on the tiling problem. The goal will be to develop reflectively consistent UDT-inspired decision theories, and try to prove tiling theorems for them.

The deadline for applicants is November 17.

The program will run from January 11 to April 27. It asks for a 10 hour/week commitment.

I am not being funded for this.^[1] You can support my work on Patreon.

My project description follows:

Summary

The Tiling Agents problem (aka reflective consistency) consists of analysing when one agent (the "predecessor") will choose to deliberately modify another agent (the "successor"). Usually, the predecessor and successor are imagined as the same agent across time, so we are studying self-modification. A set of properties "tiles" if those properties, when present in both predecessor and successor, guarantee that any self-modifications will avoid changing those properties.

You can think of this as the question of when agents will preserve certain desirable properties (such as safety-relevant properties) when given the opportunity to self-modify. Another way to think about it is the slightly broader question: when can one intelligence trust another? The bottleneck for avoiding harmful self-modifications is self-trust; so getting tiling results is mainly a matter of finding conditions for trust.

The search for tiling results has four main motivations:

AI-AI tiling, for the purpose of finding conditions under which AI systems will want to preserve safety-relevant properties.
Human-AI tiling, for the purpose of understanding when we can justifiably trust AI systems.
AI-Human tiling, as a model of corrigibility. (Will AIs choose to interfere with human decisions?)
Tiling as a consistency constraint on decision theories, for the purpose of studying rationality.

These three application areas have a large overlap, and all four seem important.

The non-summary

Motivation

In the big picture, tiling seems like perhaps the single most well-motivated approach to theoretical AI safety: it allows us to directly formally address the question of how humans can justifiably trust AIs. However, extremely few people are directly working on this approach. Indeed, out of all the people who have worked on tiling in the past, the only person I’m currently aware of who continues to work on this is myself.

I think part of this is about timelines. Tiling results remain very theoretical. I am optimistic that significant improvements to tiling results are possible with some focused attention, but I think there is a long way to go before tiling results will be realistic enough to offer concrete advice which helps with the real problems we face. (At least, a long way to go to yield “non-obvious” advice.)

However, it still seems to me like more people should be working on this approach overall. There are many unpublished results and ideas, so my aim with this project is to get some people up to speed and in a position to make progress.

Tiling Overview

The basic idea of tiling is that an agent architecture is “no good” in some sense if we can show that an agent designed according to that architecture would self-modify to something which does not match the architecture, given the choice. This is a highly fruitful sort of coherence requirement to impose, in the sense that it rules out a lot of proposals, including most existing decision theories.

One motivation for this criterion is for building robust AI systems: if an AI system would remove its safety constraints given the chance, then we are effectively in a fight with it (we have to avoid giving it a chance). Hence, even if we don’t plan to allow the AI to self-modify, it seems wise to build safety precautions in a way which we can show to be self-preserving. This perspective on tiling focuses on AI-AI tiling.

A broader motivation is to study the structure of trust. AI alignment is, from this perspective, the study of how to build trustworthy systems. Tiling is the study of how one mindlike thing can trust another mindlike thing. If we can make tiling theorems sufficiently realistic, then we can derive principles of trust which can provide guidance about how to create trustworthy systems in the real world. This perspective focuses on human-AI tiling.

A third motivation imposes tiling as a constraint on decision theories, as a way of trying to understand rationality better, in the hopes of subsequently using the resulting decision theory to guide our decision-making (eg, with respect to AI risk). Tiling appears to be a rather severe requirement for decision theories, so (provided one is convinced that tiling is an important consistency requirement) it weeds out a lot of bad answers that might otherwise seem good. Novel decision-theoretic insights might point to crucial considerations which would otherwise have been missed.

Reflective Oracles

The first big obstacle to a theorem showing when Bayesian-style agents can be reflectively stable was the difficulty of Bayesian agents representing themselves at all. Tiling was first studied in the setting of formal logic, because logic has thoroughly researched tools for self-reference. Probability theory lacked such tools.

MIRI research carried out by Jessica Taylor, Paul Christiano, and Benja Fallenstein gave rise to the necessary tools to represent Bayesian agents capable of having hypotheses about themselves and other agents similar to themselves: Reflective Oracles (and very similarly, Paul’s reflective probability distribution).

However, these conceptual tools solve the problem by specifying a computational complexity class in which a sort of “prophecy” is possible: computations can look ahead to the output of any computation, including themselves. We don’t live in a universe in which this flavor of computation appears feasible.

As a consequence, tiling results based on Reflective Oracles would not solve the “Vingean Reflection” problem: it does not show us how it is possible to trust another agent without being able to reason about that agent’s future actions in detail. It is easy to trust someone if you can anticipate everything they will do with precision. Unfortunately, it is not at all realistic.

Vingean Reflection is a crucial desideratum to keep in mind when looking for tiling results.

Logical Uncertainty

We call the more realistic domain, where an agent does not have the ability to run arbitrary computations to completion, “logical uncertainty” (although in hindsight, “computational uncertainty” would have been a clearer name).

MIRI’s Logical Induction (primarily due to Scott G) gives us a mathematical framework which slightly generalizes Bayesianism, to address logical uncertainty in a more suitable way. This gives rise to a “bounded rationality” perspective, where agents are not perfectly rational, but avoid a given class of easily recognizable inconsistencies.

We have a limited tiling result for logical-induction-based decision theory (due to Sam). I hope to significantly improve upon this by building on research into logically uncertain updateless decision theory which Martin Soto and I collaborated on last summer (none of which has been published yet), and also by building on the idea of open-minded updatelessness.

This area contains the clearest open problems and the lowest-hanging fruit for new tiling theorems.

However, I can imagine this area of research getting entirely solved without yet providing significant insight into the human-AI tiling problem (that is, the AI safety problem). My intuition is that it primarily addresses AI-AI tiling, and specifically in the case where the “values” of the AI are entirely pinned down in a strong sense. Therefore, to derive significant insights about AI risks, it seems important to generalise tiling further, including more of the messiness of the real-world problems we face.

Value Uncertainty

Open-minded updatelessness allows us to align with an unknown prior, in the same way that regular updatelessness allows us to align with a known prior.

Specifying agents who are aligned with unknown values is the subject of value learning & Stuart Russel’s alignment program, focusing on assistance games (CIRL).

If we combine the two, we get a general notion of aligning with unknown preference structures. This gives us a highly general decision-theoretic concept which I hope to formally articulate and study over the next year. This is effectively a new type of uncertainty. You could call it meta-uncertainty, although that’s probably a poor name choice since it could point to so many other things. Perhaps “open-minded uncertainty”?

In particular, with traditional uncertainty, we can model an AI which is uncertain about human values and trying to learn them from humans; however, the humans themselves have to know their own values (as is assumed in CIRL). With open-minded uncertainty, I think there will be a much better picture of AIs aligning with humans who are themselves uncertain about their own values. My suspicion is that this will offer a clear solution to the corrigibility problem.

More generally, I think of this as a step on a trajectory toward stripping away traditional decision-theoretic assumptions and creating an “empty” tiling structure which can tile a broad variety of belief structures, value structures, decision theories, etc. If the AI just wants to do “what the humans would want in this situation” then it can conform to whatever policy humans “would want”, even if it is irrational by some definition.

Another important research thread here is how to integrate the insights of Quantilization; softly optimizing imperfect proxy values seems like a critical safety tool. Infrabayesianism appears to offer important theoretical insight into making quantilizer-like strategies tile.

Value Plurality

Here’s where we bring in the concerns of Critch’s negotiable reinforcement learning: an agent aligned to multiple stakeholders whose values may differ from each other. Several directions for moving beyond Critch’s result present themselves:

Making it into a proper ‘tiling’ result by using tools from previous subsections; Critch only shows Pareto-optimality, but we would like to show that we can trust such a system in a deeper sense.
Combining tools outlined in previous subsections to analyze realistic stakeholders who don’t fully know what they want or what they believe, and who are themselves boundedly rational.
Using concepts from bargaining theory and voting theory / social choice theory to go beyond pareto-optimality and include notions of fairness. In particular, we would like to ensure that the outcome is not catastrophic with respect to the values of any of the stakeholders. We also need to care about how manipulable the values are under ‘strategic voting’ and how this impacts the outcome.
Is there an appealing formalism for multi-stakeholder Quantilization?

In terms of general approach: I want to take the formalism of logical induction and try to extend it to recognize the “traders” as potentially having their own values, rather than only beliefs. This resembles some of the ideas of shard theory.

Ontology Plurality

Critch’s negotiable RL formalism not only assumes that the beliefs and values of the stakeholders are known; it also assumes that the stakeholders and the AI agent all share a common ontology in which these beliefs and values can be described. To meaningfully address problems such as ontological crisis, we need to move beyond such assumptions, and model “where the ontology comes from” more deeply.

My ideas here are still rather murky and speculative. One thread involves extending the market metaphor behind logical induction, to model new stocks being introduced to the market. Another idea involves modeling the market structure using linear logic.

Cooperation & Coordination

“Value plurality” addresses the many-to-one alignment problem, IE, many stakeholders with one AI serving their interests. To mitigate risk scenarios associated with multipolar futures (futures containing multiple superhuman intelligences), we want to address the more general many-to-many case. This involves seeking some assurance that powerful AIs with differing values will cooperate with each other rather than engage in value-destroying conflict.

The basic picture of multi-agent rationality that we get from game theory appears very pessimistic in this regard. While cooperative equilibria are possible in some situations, those same situations usually support “folk theorems” which show that “almost any” equilibrium is possible, even really bad equilibria where no one gets what they want.

Two research threads which interest me here are Nisan’s higher-order game theory and Payor’s Lemma. Nisan has some unpublished results about cooperation. My hope is that these results can also apply to more realistic formalisms by reinterpreting his hierarchy in terms of iterated policy selection. Meanwhile, Payor’s Lemma offers new hope for getting phenomena similar to Lobian cooperation to be more robust.

^{^}
I'm still seeking funding, although I have some options I am considering.

Thanks for writing up some of the theory of change for the tiling agents agenda!

I'd be curious on your take on the importance of the Löbian obstacle: I feel like it's important to do this research for aligning full-blown RSI-to-superintelligence, but at the same time it introduces quite some extra difficulty, and I'd be more excited about research (which ultimately aims for pivotal-act level alignment) where we're fine assuming some "fixed meta level" in the learning algorithm but general enough that the object-level AI can get very powerful. It seems to me that this might make it easier to prove/heuristically-argue-for that the AI will end up with some desirable properties.

Relatedly, I feel like on arbital there were the categories "RSI" and "KANSI", but AFAICT not clearly some third category like "unknown-algorithm non-full-self-improving (UANFSI?) AI". (Where IMO current deep learning clearly fits into the third category, though there might be a lot more out there which would too.) I'm currently working on KANSI AI, but if I didn't I'd be a bit more excited about (formal) UANFSI approaches than full RSI theory, especially since the latter seems to have been tried more. (E.g. I guess I'd classify Vanassa Kosoy's work as UANFSI, but I didn't look much at it yet.) (Also there can still be some self-improvement for UANFSI AIs, but as said there would be some meta level that would be fixed.)

But possible I strongly misunderstand something (e.g. maybe the Löbian obstacle isn't that central?).

(In any case I think there ought to be multiple people continuing this line of work.)

I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren't precisely true).

I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb's Theorem with Payor's Lemma in this context.

I don't completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.

How would you become confident that a UANFSI approach was NFSI?

Idk that could be part of finding heuristic arguments for desireable properties for what an UANFSI converges to. Possibly it's easier to provide probabilistic convergence guarantees for systems that don't do FSI so this would already give some implicit evidence. But we could also just say that it's fine if FSI happens as long as we have heuristic convergence arguments - like that UANFSI is just allowing for a broader class of algorithms which might make stuff easier - though i mostly don't expect we'd get FSI alignment through this indirect alignment path from UANFSI but that we'd get an NFSI AI if we get some probabilistic convergence guarantees.

(Also I didn't think much about it at all. As said I'm trying KANSI for now.)

I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. "Unknown Algorithm" to me means that we don't know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can't just think about what it "converges to" (convergence time will be too long for interesting learning systems).

Hm interesting. I mean I'd imagine that if we get good heuristic guarantees for a system it would basically mean that all the not-perfectly-aligned subsystems/subsearches are limited and contained enough that they won't be able to engage in RSI. But maybe I misunderstand your point? (Like maybe you have specific reason to believe that it would be very hard to predict reliably that a subsystem is contained enough to not engage in RSI or so?)

(I think inner alignment is very hard and humans are currently not (nearly?) competent enough to figure out how to set up training setups within two decades. Like for being able to get good heuristic guarantees I think we'd need to at least figure out at least something sorta like the steering subsystem which tries to align the human brain, only better because it's not good enough for smart humans I'd say. (Though Steven Byrnes' agenda is perhaps a UANFSI approach that might have sorta a shot because it might open up possibilities of studying in more detail how values form in humans. Though it's a central example of what I was imagining when I coined the term.))

How bottlenecked is your agenda by philosophy skills (like being good at thought experiments for deriving stuff like UDT, or like being good at figuring out the right ontology for thinking about systems or problems) vs math skill vs other stuff?

I think I've articulated a number of concrete subgoals that require less philosophical skill (they can be approached as math problems). However, in the big picture, novel tiling theorems require novel ideas. This requires philosophical skill.