These all seem to be pointing to different aspects of the same problem.

  • Cross-ontology goal translation: given a utility function over a latent variable in one model, find an equivalent utility function over latent variables in another model with a different ontology. One subquestion here is how the first model’s input data channels and action variables correspond to the other model’s input data channels and action variables - after all, the two may not be “in” the same universe at all, or they may represent entirely separate agents in the same universe who may or may not know of each other's existence.
  • Correspondence theorems: quantum mechanics should reduce to classical mechanics in places where classical worked well, special relativity should reduce to Galilean relativity in places where Galilean worked well, etc. As we move to new models with new ontologies, when and how should the structure of the old models be reproduced?
  • The indexing problem: I have some system containing three key variables A, B, and C. I hire someone to study these variables, and after considerable effort they report that X is 2.438. Apparently they are using different naming conventions! What is this variable X? Is it A? B? C? Something else entirely? Where does their X fit in my model?
  • How do different people ever manage to point to the same thing with the same word in the first place? Clearly the word “tree” is not a data structure representing the concept of a tree; it’s just a pointer. What’s the data structure? What’s its type signature? Similarly, when I point to a particular tree, what’s the data structure for the concept of that particular tree? How does the “pointer” aspect of these data structures work?
  • When two people are using different words for the same thing, how do they figure that out? What about the same word for different things?
  • I see a photograph of a distinctive building, and wonder “Where is this?”. I have some data - i.e. I see the distinctive building - but I don’t know where in the world the data came from, so I don’t know where in my world-model to perform an update. Presumably I need to start building a little side-model of “wherever this picture was taken”, and then patch that side-model into my main world model once I figure out “where it goes”.
  • Distributed models and learning: a bunch of different agents study different (but partially overlapping) subsystems of a system - e.g. biologists study different subsystems of a bacteria. Sometimes the agents end up using different names or even entirely different ontologies - e.g. some parts of a biological cell require thinking about spatial diffusion, while some just require overall chemical concentrations. How do we combine submodels from different agents, different ontologies and different data? How can we write algorithms which learn large model structures via stitching together small structures each learned independently from different subsystems/data?

Abstraction plays a role in these, but it’s not the whole story. It tells us how high-level concepts relate to low-level, and why very different cognitive architectures would lead to surprisingly similar abstractions (e.g. neural nets learning similar concepts to humans). If we can ground two sets of high-level abstractions in the same low level world, then abstraction can help us map from one high-level to the low-level to the other high-level. But if two neural networks are trained on different data, and possibly even different kinds of data (like infrared vs visual spectrum photos), then we need a pretty detailed outside model of the shared low-level world in order to map between them.

Humans do not seem to need a shared low-level world model in order to pass concepts around from human to human. Things should ultimately be groundable in abstraction from the low level, but it seems like we shouldn’t need a detailed low-level model in order to translate between ontologies.

In some sense, this looks like Ye Olde Symbol Grounding Problem. I do not know of any existing work on that subject which would be useful for something like “given a utility function over a latent variable in one model, find an equivalent utility function over latent variables in another model”, but if anybody knows of anything promising then let me know.

Not Just Easy Mode

After poking at these problems a bit, they usually seem to have an “easy version” in which we fix a particular Cartesian boundary.

In the utility function translation problem, it’s much easier if we declare that both models use the same Cartesian boundary - i.e. same input/output channels. Then it’s just a matter of looking for functional isomorphism between latent variable distributions.

For correspondence theorems, it’s much easier if we declare that all models are predicting exactly the same data, or predict the same observable distribution. Again, the problem roughly reduces to functional isomorphism.

Similarly with distributed models/learning: if a bunch of agents build their own models of the same data, then there are obvious (if sometimes hacky) ways to stitch them together. But what happens when they’re looking at different data on different variables, and one agent’s inferred latent variable may be another agent’s observable?

The point here is that I don’t just want to solve these on easy mode, although I do think some insights into the Cartesian version of the problem might help in the more general version.

Once we open the door to models with different Cartesian boundaries in the same underlying world, things get a lot messier. To translate a variable from model A into the space of model B, we need to “locate” model B’s boundary in model A, or locate model A’s boundary in model B, or locate both in some outside model. That’s the really interesting part of the problem: how do we tell when two separate agents are pointing to the same thing? And how does this whole "pointing" thing work to begin with?

Motivation

I’ve been poking around the edges of this problem for about a month, with things like correspondence theorems and seeing how some simple approaches to cross-ontology translation break. Something in this cluster is likely to be my next large project.

Why this problem?

From an Alignment as Translation viewpoint, this seems like exactly the right problem to make progress on alignment specifically (as opposed to embedded agency in general, or AI in general). To the extent that the “hard part” of alignment is translating from human concept-space to some AI’s concept-space, this problem directly tackles the bottleneck. Also closely related is the problem of an AI building a goal into a successor AI - though that’s probably somewhat easier, since the internal structure of an AI is easier to directly probe than a human brain.

Work on cross-ontology transport is also likely to yield key tools for agency theory more generally. I can already do some neat things with embedded world models using the tools of abstraction, but it feels like I’m missing data structures to properly represent certain pieces - in particular, data structures for the “interface” where a model touches the world (or where a self-embedded model touches itself). The indexing problem is one example of this. I think those interface-data-structures are the main key to solving this whole cluster of problems.

Finally, this problem has a lot of potential for relatively-short-term applications, which makes it easier to build a feedback cycle. I could imagine identifying concept-embeddings by hand or by ad-hoc tricks in one neural network or probabilistic model, then using ontology translation tools to transport those concept-embeddings into new networks or models. I could even imagine whole “concept libraries”, able to import pre-identified concepts into newly trained models. This would give us a lot of data on how robust identified abstract concepts are in practice. We could even run stress tests, transporting concepts from model to model to model in a game of telephone, to see how well they hold up.

Anyway, that’s one potential vision. For now, I’m still figuring out the problem framing. Really, the reason I’m looking at this problem is that I keep running into it as a bottleneck to other, not-obviously-similar problems, which makes me think that this is the limiting constraint on a broad class of problems I want to solve. So, over time I expect to notice additional possibilities which a solution would unblock.

New Comment
5 comments, sorted by Click to highlight new comments since:

Interesting. I can't recall if I commented on the alignment as translation post about this, but I think this is in fact the key thing standing in the way of addressing alignment, and put together a formal model that identified this as the problem, i.e. how do you ensure that two minds agree about preference ordering, or really even the statements being ordered.

Clearly the word “tree” is not a data structure representing the concept of a tree; it’s just a pointer. What’s the data structure?

I have some thoughts, but if they're right, then this would be getting into the domain of "detailed AGI algorithm design" which I don't think are productive to share given the state of the world viz. AGI prep, and if they're wrong (more likely anyway), there's similarly no point in sharing them.

I was not thinking about it before reading this comment, but even partial solutions to the problem in this post would probably both advance capabilities and safety. My first impression is that it helps build capability in a way that ensures more alignment, so it might be a net positive for alignment and safety. But that wouldn't necessarily hold if we also care about the misuse of aligned AI (which we probably should).

As always, nice post. The problem does seem central to many applications of abstraction indeed, especially assuming as you do that alignment reduces to translation between our ontology and the AI's ontology.

I especially like this summary/main takeaway:

Things should ultimately be groundable in abstraction from the low level, but it seems like we shouldn’t need a detailed low-level model in order to translate between ontologies.

Also, reading this, it seems like you consider that you have solved abstraction (you write about this being your next project). Is that the case, or are you just changing problem for a while to keep things fresh?

At this point, I think that I personally have enough evidence to be reasonably sure that I understand abstraction well enough that it's not a conceptual bottleneck. There are still many angles to pursue - I still don't have efficient abstraction learning algorithms, there's probably good ways to generalize it, and of course there's empirical work. I also do not think that other people have enough evidence that they should believe me at this point, when I claim to understand well enough. (In general, if someone makes a claim and backs it up by citing X, then I should assign the claim lower credence than if I stumbled on X organically, because the claimant may have found X via motivated search. This leads to an asymmetry: sometimes I believe a thing, but I do not think that my claim of the thing should be sufficient to convince others, because others do not have visibility into my search process. Also I just haven't clearly written up every little piece of evidence.)

Anyway, when I consider what barriers are left assuming my current model of abstraction and how it plays with the world are (close enough to) correct, the problems in the OP are the biggest. One of the main qualitative takeaways from the abstraction project is that clean cross-model correspondences probably do exist surprisingly often (a prediction which neural network interpretability work has confirmed to some degree). But that's an answer to a question I don't know how to properly set up yet, and the details of the question itself seem important. What criteria do we want these correspondences to satisfy? What criteria does the abstraction picture predict they satisfy in practice? What criteria do they actually satisfy in practice? I don't know yet.