AI ALIGNMENT FORUM
AF

I'm especially interested in the analogy between AI alignment and democracy. (I guess this goes under "Social Structures and Institutions".) Democracy is supposed to align a superhuman entity with the will of the people, but there are a lot of failures, closely analogous to well-known AI alignment issues:

politicians optimize for the approval of low-information voters, rather than truly optimizing the people's wellbeing (deceptive alignment)
politician, pacs, parties, permanent bureaucrats are agents with their own goals that don't align with the populace (mesa optimizers)

I think it's more likely that insights will transfer from the field of AI alignment to the field of government design than vice versa. Easier to do experiments on the AI side, and clearer thinkers.

[-]Koen.Holtman4y20

I'm especially interested in the analogy between AI alignment and democracy.

This is indeed a productive analogy. Sadly, on this forum, this analogy is used in 99% of the cases to generate AI alignment failure mode stories, whereas I am much more interested in using it to generate useful ideas about AI safety mechanisms.

You may be interested in my recent paper 'demanding and designing', just announced here, where I show how to do the useful idea generating thing. I transfer some insights about aligning powerful governments and companies to the problem of aligning powerful AI.

Moderation Log

Introduction

How would you go about scientifically studying aliens? Arik Kershenbaum’s The Zoologist Guide to The Galaxy proposes to use evolutionary thinking to uncover constraints on how alien species could evolve. One of his most interesting points is that evolution constrains function far more than form, because function depends significantly less on the details of the environment. Hence we should expect crisper answers to “How would aliens behave?” than “What would aliens look like?”. And in the course of his book, he gives the best answer he can find to the former question.

So when confronted with the question of how to study something he couldn’t gather data on, Kershenbaum leveraged analogies to biological systems he could and had studied, and the underlying constraints brought on by the mechanisms of natural selection.

On a completely unrelated note, the new summer fellowship Principles of Intelligent Behavior in Biological and Social Systems (PIBBSS) (funded by the LTFF) aims at creating valuable AI alignment research through studying analogies to many complex systems (evolution, brains, language, social structures…). Fellows will have graduate research experience in fields studying such systems, working on a concrete alignment project in collaboration with an established alignment researcher. The fellowship will run during all of Summer 2022.

The point of this post is to introduce this fellowship, explain the reasoning behind it and give more concrete details about how it will go. Note that I’m not an organizer of this fellowship, I’m just assisting with the writing of this post; credits for the ideas and arguments should go to Nora Ammann and TJ, the organizers of the fellowship.

Analogies as General Epistemic Strategies for Alignment

As I’ve written elsewhere, alignment cannot directly leverage most epistemic strategies and approaches used in Science and Engineering, because it’s about solving a problem that doesn’t exist yet on a technology we still have to invent.

One epistemic strategy that survives this major problem is the leveraging of analogies with existing biological or social systems that implement complex or intelligent behavior. Consider how a wide variety of such systems (from biology, physics, linguistics and other fields) exhibit similar properties: adaptation, robustness, goal-directed behavior, learning, embeddedness, modularity, phase transitions, and more. Since AI research focuses on mechanisms that lead to complex and intelligent behavior with many of these properties, careful analogies with these complex systems may allow us to transfer knowledge about all these behaviors and properties to the study of alignment and AGI.

Also note that the other main epistemic strategy used in alignment, figuring things out from first principles, can and often does take inspiration from other existing systems like evolution, brains and languages.

Examples of Successful Analogies in Alignment

If analogical thinking is such a valuable epistemic approach to alignment, we should find ample examples of valuable alignment research using such analogies. And that’s indeed what we see.

Risks from Learned Optimization by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant explicitly uses biological evolution of humans as an example of/inspiration for mesa-optimizers. Evolution here is the search process, with a base-objective of increasing fitness and propagation of genes, but the learned models (humans) end up doing search for different goals.
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) by Andrew Critch explicitly builds on the concept of fields from sociology, and the structure and agency debate in the social sciences about the place of social structure vs personal choices in human behavior.
Value loading in the human brain: a worked example by Steve Byrnes is one of many, many posts where Steve uses neuroscience as a source of analogies for alignment. Note that here the analogy is more literal and doing more work than in previous example, since Steve thinks that Brain-like AGI might be the first one to be created in the near future.
Public Static: What is Abstraction? by John Wentworth (and really all his work on abstraction) uses many examples (transistors and logic gates, statistical mechanics and temperature, maps and city streets) to direct and explain his deconfusion of abstraction. This is a general trend, as John’s method starts with finding many different examples from distinct fields and thinking about what they have in common.

Of course, none of these works completely exploit the analogy, nor do they encompass all analogies relevant for alignment. The previous list just serves to illustrate that analogical thinking is an integral part of many examples of current alignment research.

The Problem: Difficulty of Epistemic Translation

If analogies to other systems already abound in alignment, what is the point of the fellowship?

Here, the concept of epistemic translation, as discussed by Nora Ammann, might give us a better idea of what it takes to make fruitful analogies. Fundamentally, linking system A with system B requires the creation of a translation between the two, a bridge faithful enough to let us transform insights about system B into ones about system A (for example, the analogy between magic and pizza fails this condition).

Exploiting analogies faithfully thus involves:

Understanding the other field in detail
Understanding alignment in detail
Being able to see the powerful concepts and ideas of the other fields that share similarities with alignment and AGI
Think through the analogy in enough details to check that it holds and is able to tell us something significant about alignment.
Understand where and how a given analogies “breaks” and what that can tell us about alignment

Currently, alignment researchers are supposed to do all of that. Including building a deep expertise in the field they’re drawing from. With the exception of Steve Byrnes who did basically learn neuroscience for his research, most people don’t have the time to do that. As a consequence, they don’t find all valuable analogies for their work, or let them die in a drawer, or, if they do find and explore them, they might do so insufficiently or badly.

On the other hand, most people in fields that are ripe for analogies with alignment don’t know about the latter and don’t have any incentive to work on it. It’s also hard to get up to speed on alignment, especially when coming from a field outside computer science.

So alignment researchers want more varied and detailed analogies, and experts in other fields have the ability to help with these analogies and provide tools for studying the systems in question (and some could be interested by alignment as a challenging problem or a cause), but the current incentives and constraints makes it hard for the two sides to interface.

The PIBBSS Fellowship exists to bridge this gap by providing an institutional context for these collaborations.

Proposed Solution: Creating Institutional Context for Collaborations around Analogies

For the fellowship, alignment researchers get to propose projects related to analogies with biological or social systems that they would like to explore. Fellows with expertise in the corresponding field receive funding for the duration of the fellowship (12 weeks in Summer 2022) to collaborate with the alignment researcher on exploring the analogy and what it can bring to alignment.

Which fields are most promising? Well, fields don’t seem like the right granularity to discuss promising analogies here. Instead, the complex systems which are presented as analogical fit better (and they can be studied from different angles by different fields). Recall that for epistemic translation and analogies to be useful, there need to be insights, concepts and epistemic strategies for the analogous complex system to transfer over. So the systems most ripe for this work have been studied enough to gather a long tradition of results, appropriate epistemic hygiene, and, relative to the expected density of insight, they’re insufficiently represented in alignment.

This leads us to a tentative list that currently includes:

Evolution and Ecology
Brains
Languages and Medias
Social Structures and Institutions
Engines

That said, the fellowship is open to other complex systems and fields that may have been overlooked at the moment, but share the properties that we care about (ie. insightfulness in existing literature and associated community, and relevance to AI alignment). In practice, the evaluation of the promisingness of a given analogy happens at the level of specific project proposals more so than at the level of entire disciplines.

Lastly, in some cases, the fellowship is open to epistemic transfer towards topic areas that do not fall under AI safety or governance, narrowly constructed. Examples include relevant topics on digital and emulated minds, advanced institutional design and collective intelligence, and industrial and scientific automation and progress.

Pre-mortem: what could go wrong?

This is a nice story, but let’s ask ourselves the important question: how could it fail?

First, even if new analogies result from this fellowship, there is a risk that they are shallow, at best useless and at worst confusing. An example of a condemnation of a class of such analogies is Yudkowsky’s criticism of biology-based timelines.

Proposed solution: In part, this will come from having experts of the other field provide enough details to reveal the shallowness of the analogy. And in cases where the core of the issues comes from understanding the mechanisms behind AGI and how it will appear, the alignment researchers involved should be able to catch it eventually. The mentor-fellow pair thus represent the first line of defense against epistemic pollution, and the wider epistemic communities in which they are embedded provide further source of feedback and scrutiny. All in all, the focus on analogies and their non-shallowness in this fellowship should increase the scrutiny enough to catch most of the shallow proposals.

Another issue comes from the difficulty of distinguishing valuable/deep analogies from useless/shallow analogies at a glance, before investing a lot of work on it, potentially wasting time.

Proposed solution:

The fellowship addresses this problem by letting (epistemic) demand drive (epistemic) supply. Concretely, this means that alignment researchers (and not fellows with expertise in other fields) propose projects according to how valuable they expect them to be. Thus the current proxy of the expected value of a given analogy is whether or not a given alignment researcher is sufficiently excited about a project to want to invest time in mentoring it.

The time constraints of the fellowship also privilege exploration, which is the main way to find out about the value of each research direction. Only a small fraction of projects needs to turn into fruitful research agendas to make up for many failed attempts.

Lastly, maybe interdisciplinary research between alignment researchers and experts from other fields is just too hard and fraught with miscommunication to work in most cases.

Interdisciplinary research is hard, and so is doing good research in general. The purpose of the fellowship is to find out more about these potential issues and solve them as well as possible.

Details of the Fellowship Program

From the website of the fellowship

The fellowship is designed for individuals with graduate-level research experience, or equivalent, in their domain of expertise who are motivated by the mission of making AI systems safe and aligned.
Between June and August 2022, fellows will work on selected projects at the intersection between the fellow’s field of expertise and AI alignment and governance. Each fellow will work in close collaboration with a mentor who will help them facilitate the domain interface.
Fellows, mentors, and selected guests will meet at two multi-day, in-person retreats held in Europe to learn about AI alignment, complex systems, epistemic challenges of interdisciplinary work, and more. Throughout the summer, fellows will benefit from a diverse program consisting of regular talks by external speakers, social events, and personal support sessions with program facilitators.
Fellows will receive a stipend of 3’000 USD per month and are expected to work full-time on their projects over the course of the fellowship, though exceptions may be possible.
Fellows can work from anywhere in the world or participate in a local residency. Any travel costs, within reason, will be reimbursed by the program.

Appendix: Sample of project proposals

The below sample of project proposals is meant to give readers a taste for the types of projects PIBBSS is hoping to facilitate.

Biodiversity and Heterogeneity in Energy Flows

Source Domain: Systems Ecology

Topic Summary:

A commonly discussed puzzle in ecology is related to the latitudinal distribution of biodiversity. A number of scholars have proposed that this is related to metabolism and the amount of energy flowing through the ecosystem. (Brown, James H., Why are there so many species in the tropics?. Journal of Biogeography 2013) An additional observation we might make is that in energy-rich ecosystems, such as tropical rainforests, where we encounter higher biodiversity, we also find a large number of organisms engaging in relatively simpler forms of energy consumption. Whereas, in energy-scarce ecosystems, there are fewer species and several organisms amongst them exhibit relatively more general intelligence in terms of their ability to source food and energy.

There has been a debate in the last few years regarding whether we should anticipate artificial agents with general intelligence or ecosystems of specialized services. To inform this debate, we want to understand:

Whether the observation about the relationships between specialized vs general energy consumption strategies and the energy-richness of ecosystems generalizes to a wide range of other biological ecosystems, e.g. deserts, alpine areas, marine ecosystems, etc.?
Whether existing formal models of the relationship between total energy flow, metabolic rates and biodiversity are helpful in modelling the degree of specialization of energy sourcing strategies at the organism level?
Whether these models teach us something about how the presence of economic incentives and/or compute availability influences whether specialized AI services will be more efficient than integrated agent-like systems?

[h/t Jan Kulveit]

Basins of Robustness in Search Spaces

Source Domain: Evolutionary Biology

Topic Summary:

Within evolutionary theory, there are two approaches to explaining robustness observed in biological systems. The first is that random search is likely to find basins of robustness simply because such basins occupy significant probability mass. The second approach argues that robustness is selected for by evolution as a response to mutations and environmental perturbations. (Wagner, A., Robustness and Evolvability in Living Systems. Princeton University Press 2005).

Better understanding of the relative causal roles played by these phenomena can help us in building better models for the study of robustness and corrigibility in AI.

For example, we may ask:

What are the core confusions and disagreements between these explanatory approaches, and what are conventional justifications used within the evolutionary biology literature to distinguish them?
In the case of the ‘basins of robustness are large’ approach, are there any structural justifications used within the evolutionary biology literature to posit the large-ness of these basins? If not, can we uncover the justifications implicit in the works of biologists and philosophers of biology who have written on this subject?
Finally, can we synthesize the key insights of the field into some formalisms that can help us better model how evolutionary search encounters basins of robust phenotypes? Could such formalisms help us distinguish between different kinds of parametric search trajectories and their distinct likelihoods of encountering such basins? And how can we design training methods of ML systems that can look for robust and corrigible models, using these insights?

[h/t TJ]

Institutional foundations of Linguistic Innovation

Source Domain: Sociolinguistics

Topic Summary:

Language, its evolution, and its current usage within society might limit which novel concepts can be acquired and become broadly recognized. Participants in a linguistic community experience agency to use language in innovative ways (‘creativity’ in Chomsky 1965), and therefore also exert influence over how linguistic affordances (concepts, vocabularies, etc.) evolve over time. Often such creativity is also built on top of existing morphological and lexical resources (‘productivity’ in Hockett 1958, Bauer 2001). (Also see: Expanding the Lexicon, eds. S Arndt-Lappe et al. 2018)

These forms of linguistic innovation and evolution, however, are also balanced by evolutionary pressures that help maintain reasonable levels of lexical and semantic stability in the language, allowing language to be useful for coordination. The generation, diffusion and autoregulation of linguistic innovation can therefore also be seen as being mediated by cultural and institutional factors. By better understanding the different factors that shape linguistic innovation and evolution, we can both: a) better reflect on the role played by differential deployment of Large Language Models (LLMs) in the near-term, as well as, b) better understand which of these dynamics can be extrapolated for understanding linguistic and cognitive competencies of future AI systems. Some specific questions of interest might include:

What implications do the different frameworks for studying linguistic evolution have for modeling the rate of evolution and ways to measure it? Can they help modeling growth rates for linguistic competencies?
Can existing theories of asymmetric agency over linguistic and conceptual affordances (Fricker 2007, Anderson 2012) provide insights, or be generalized, to perform normative analysis of the evolutionary structure?
Are there theories that identify signs of stagnation in linguistic evolution? How do these theories see the relative role of cognitive, cultural and institutional factors? How would they change when dealing with linguistically competent artificial systems?

Social learning and the limitations of the RL framework

Source Domain: Cognitive science

Topic Summary:

Reinforcement learning is the dominant framework at the moment in the psychology and neuroscience of human and animal behavior. In thinking about digital minds, that is convenient because, if it's true that humans and animals are basically reinforcement learners, it follows that artificial RL systems are (in some sense) things of basically the same kind. It also seems to influence thinking about agency and motivation in the alignment space.

An interesting question to us is thus: what are the limitations of the RL framework for explaining human behavior? In particular, there exists preliminary evidence (Ho et al, 2017) for a limitation of RL in the area of social learning from evaluative feedback, which would seem particularly relevant to alignment.

[h/t Patrick Butlin]