When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems.
In this post, we:
Introduce a simple model of the epistemic situation, and
Share some desiderata for maps useful for alignment.
We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight the deceptively simple model brings, or come away with a version of the model which is rounded off to something less useful (i.e. "yeah, there is this map and territory distinction").
As a meta recommendation, we suggest reading this post slowly, and ideally immediately trying to apply the model to some confusion or disagreement about AI alignment.
The space of systems and the space of maps
Imagine the space of possible intelligent systems:
Two things seem especially important about this space:
It’s very large; much larger than the space of current systems.[1]
We don’t get direct epistemic access to it.
This is obviously true of systems which don’t currently exist.
In a weaker sense, it also seems true of systems which do exist. Even when we get to directly interact with a system:[2]
Our thinking about these parts of the space is still filtered through our past experiences, priors, predictive models, cultural biases, theories…
We often don’t understand the emergent complexity of the systems in question.
If we don’t get direct epistemic access to the space of systems, what are we doing when we reason about it?
Let’s imagine a second space, this time a space of “maps”:
The space of maps is an abstract representation of all the possible “maps” that can be constructed about the space of intelligent systems. The maps are ways of thinking about (parts of) the space of systems. For example:
Replicable descriptions of how a machine learning model works and was trained are a way of thinking about that model (a point in the space of intelligent systems).
An ethnographic study of a particular human community is a way of thinking about that community (another point in the space of systems).
The theory of evolution is a way of thinking about evolved creatures, including intelligent ones.
Expected utility theory is a way of thinking about some part of the space which may or may not include future AI systems.
Historical analysis of trends in technological development is a way of thinking about whichever parts of the space of intelligent systems are governed by similar dynamics to those governing past technological developments.
When we’re reasoning about intelligent systems, we’re using some part of the space of maps to think about some part of the space of intelligent systems:[3]
Different maps correspond to different regions of the space of intelligent systems.
Of course, thinking in terms of the space of systems and the space of maps is a simplification. Some of the ways that reality is more complicated:
The space of systems looks different on different maps.
Maps can affect which parts of the space of systems actually get developed.[4]
Maps are themselves embedded in the space of systems.
Which maps and systems actually exist at a given time is evolving and dynamic.
AI will play a big role in both the space of maps and the space of systems.
We think that the space of systems and the space of maps is a useful simplification which helps us to think more clearly about future AI systems. Some salient examples of how this simplification can help us think about future AI systems:
Disagreements are often driven by using different maps, or talking about different parts of the space of systems.
Thinking about the distinction and the interplay between the space of maps and the space of systems makes it more obvious that our actions and research directions influence which systems end up getting built, which seems like an important strategic consideration.
What sorts of maps do we need for AI alignment?
When it comes to AI alignment, we need accurate maps which hold for systems which don’t exist yet, and which are good enough to help us build these systems in ways that are safe.
There are few different properties it would be good for these maps to have:
Generality/robustness: maps which cover large parts of the space of systems.
Precision: maps which are very detailed.
Accuracy: maps which actually correspond well to the parts of the space of systems which they are mapping.
Usefulness: maps which help us to chart paths toward states we want.
Probably other things too.
And there are trade-offs here between the properties. For example:
It would be great to have very precise maps of advanced AI systems in particular, but this seems hard to do robustly.
Some theories are very general and cover very large parts of the space of systems (e.g. information theory), but alone these theories don’t tell us much about how to chart paths towards states we want.
Finding maps which are useful for AI alignment
A lot of AI alignment work involves taking maps that have been developed for thinking about one part of the space of systems, and applying them to a part of the space of systems that we hope includes “potentially dangerous future AI systems”. For example:
Experimental work often involves developing research methods for looking at existing AI systems that will (hopefully) scale to future AI systems.
Decision theory is a region in the space of maps that was built to model a (heavily idealised) human decision maker. Many of its ideas have been applied to possible future AI systems.
“Convergence” comes from evolutionary biology - a set of maps which has been built to think about biological systems. In a future post, we’ll try to apply these ideas to AI systems.
Being aware of which maps you are using and their potential limitations for the systems you want to study seems super useful for doing good research. [5]
We don’t know that much about where in the space of systems potentially dangerous AI will be. As a result, one good bet seems to be to try and find maps that are general enough to cover everywhere in the space of systems that future AI could be.
One way of making general maps is trying to decontextualise / generalise existing maps, by unpicking which features are specific to (~contingent on) the map in question, and which could generalise beyond the context in which they were originally built.
Another is to start with maps that are already pretty general (whilst still being accurate, confirmed by experiment, and falsifiable). This is one of the reasons why we are excited about active inference.
Given that we care about aligning AI to humans and human collectives, it also seems useful for maps to cover these areas of system-space as well (or more specifically, to cover relations between the human part of system space and the “possible future AI systems” part of system space).
Finding general maps isn’t the only promising approach here:
Finding precise maps can also be a useful tactic in some contexts.
As there’s a tradeoff between generality and precision, it’s important to try to identify what features general maps are likely to miss - and what work is needed to fill in these (contingent) features.
Another strategy, which seems at the core of all existing sensible approaches to alignment, is to try to skilfully combine insights from multiple maps.[6]
The ideas in this post come variously from Jan, Nora and Clem (some ideas come from one person; others were independently generated by multiple people) or from an older FHI project on AGI epistemics done by Jan with Chris van Merwijk and Ondřej Bajgar. Rose did most of the writing.
Other ways of saying this: some maps are design paradigms/blueprints.
This post draws a distinction between maps (for understanding reality) and blueprints (for building new parts of reality). The way we’re using ‘maps’ here is broader and contains both of those kinds of map.
When we're trying to do AI alignment, we're often studying systems which don't yet exist. This is a pretty weird epistemic activity, and seems really hard to get right. This post offers one frame for thinking about what we're actually doing when we're thinking about AI alignment: using parts of the space of maps to reason about parts of the space of intelligent systems.
In this post, we:
We hope that the content is mostly the second kind of obvious: obvious once you see things in this way, which you maybe already do. In our experience, this comes with a risk: reading too fast, you may miss most of the nuance and useful insight the deceptively simple model brings, or come away with a version of the model which is rounded off to something less useful (i.e. "yeah, there is this map and territory distinction").
As a meta recommendation, we suggest reading this post slowly, and ideally immediately trying to apply the model to some confusion or disagreement about AI alignment.
The space of systems and the space of maps
Imagine the space of possible intelligent systems:
Two things seem especially important about this space:
If we don’t get direct epistemic access to the space of systems, what are we doing when we reason about it?
Let’s imagine a second space, this time a space of “maps”:
The space of maps is an abstract representation of all the possible “maps” that can be constructed about the space of intelligent systems. The maps are ways of thinking about (parts of) the space of systems. For example:
When we’re reasoning about intelligent systems, we’re using some part of the space of maps to think about some part of the space of intelligent systems:[3]
Different maps correspond to different regions of the space of intelligent systems.
Of course, thinking in terms of the space of systems and the space of maps is a simplification. Some of the ways that reality is more complicated:
We think that the space of systems and the space of maps is a useful simplification which helps us to think more clearly about future AI systems. Some salient examples of how this simplification can help us think about future AI systems:
What sorts of maps do we need for AI alignment?
When it comes to AI alignment, we need accurate maps which hold for systems which don’t exist yet, and which are good enough to help us build these systems in ways that are safe.
There are few different properties it would be good for these maps to have:
And there are trade-offs here between the properties. For example:
Finding maps which are useful for AI alignment
A lot of AI alignment work involves taking maps that have been developed for thinking about one part of the space of systems, and applying them to a part of the space of systems that we hope includes “potentially dangerous future AI systems”. For example:
Being aware of which maps you are using and their potential limitations for the systems you want to study seems super useful for doing good research. [5]
We don’t know that much about where in the space of systems potentially dangerous AI will be. As a result, one good bet seems to be to try and find maps that are general enough to cover everywhere in the space of systems that future AI could be.
Given that we care about aligning AI to humans and human collectives, it also seems useful for maps to cover these areas of system-space as well (or more specifically, to cover relations between the human part of system space and the “possible future AI systems” part of system space).
Finding general maps isn’t the only promising approach here:
The ideas in this post come variously from Jan, Nora and Clem (some ideas come from one person; others were independently generated by multiple people) or from an older FHI project on AGI epistemics done by Jan with Chris van Merwijk and Ondřej Bajgar. Rose did most of the writing.
See also Design space of minds in general.
See this post for another discussion of this sort of epistemic challenge.
This post implicitly argues something similar. Visualising the space of AI systems here is also related.
Other ways of saying this: some maps are design paradigms/blueprints.
This post draws a distinction between maps (for understanding reality) and blueprints (for building new parts of reality). The way we’re using ‘maps’ here is broader and contains both of those kinds of map.
C.f. Adam Shimi on Epistemological Vigilance.
C.f. Adam Shimi on pluralism and “no one-size-fits-all epistemic strategy”.