Consider concepts such as "a vector", "a game-theoretic agent", or "a market". Intuitively, those are "purely theoretical" abstractions: they don't refer to any specific real-world system. Those abstractions would be useful even in universes very different from ours, and reasoning about them doesn't necessarily involve reasoning about our world.

Consider concepts such as "a tree", "my friend Alice", or "human governments". Intuitively, those are "real-world" abstractions. While "a tree" bundles together lots of different trees, and so doesn't refer to any specific tree, it still refers to a specific type of structure found on Earth, and shaped by Earth-in-particular's specific conditions. While tree-like structures can exist in other places in the multiverse, there's an intuitive sense that any such "tree" abstraction would "belong" to the region of the multiverse in which the corresponding trees grow.

Is there a way to formalize this, perhaps in the natural-abstraction framework? To separate the two categories, to find the True Name of "purely theoretical concepts"?


Motivation

Consider a superintelligent agent/optimization process. For it to have disastrous real-world consequences, some component of it would need to reason about the real world. It would need to track where in the world it's embedded, what input-output pathways there are, and how it can exploit these pathways in order hack out of the proverbial box/cause other undesirable consequences.

If we could remove its ability to think about "unapproved" real-world concepts, and make it model itself as not part of the world, then we'd have something plausibly controllable. We'd be able to pose it well-defined problems (in math and engineering, up to whatever level of detail we can specify without exposing it to the real world – which is plenty) and it'd spit out solutions to them, without ever even thinking about causing real-world consequences. The idea of doing this would be literally outside its hypothesis space!

There are tons of loopholes and open problems here, but I think there's promise too.


Ideas

(I encourage you to think about the topic on your own before reading my attempts.)

 

Take 1: Perhaps this is about "referential closure". For concepts such as "vectors" or "agents", we can easily specify the list of formal axioms that would define the frameworks within which these concepts make sense. For things like "trees", however, we would have to refer to the real world directly: to the network of causes and effects entangled with our senses.

... Except that we more or less can, nowadays, specify the mathematical axioms underlying the processes generating our universe (something something Poincaré group). To a sufficiently advanced superintelligence, there'd be no real difference.

Take 2: Perhaps the intuitions are false, and the difference is quantitative, not qualitative.

"Vectors" are concepts such that there's a simple list of axioms under which they're simple to describe/locate: they have low Kolmogorov complexity. By comparison, "trees" have a simple generator, but locating them within that generator's output (the quantum multiverse) takes very many bits.

  • Optimistic case: There's a bimodal distribution, with "real-world abstractions" being on one end, and "theoretical concepts" being on the other end. We can lop off the high-complexity end of the distribution and end up with just the "theoretical" concepts.
  • Pessimistic case: "Theoretical concepts" and "real-world abstractions" sit on a continuum, from e. g. "this specific bunch of atoms" to "my friend Alice across time" to "humans" to "agents". It's impossible to usefully separate them into two non-overlapping categories.

I guess this is kind of plausible – indeed, it's probably the null hypothesis – but it doesn't feel satisfying.

Especially the pessimistic case: the "continuum" idea doesn't make sense to me. I think there's a big jump between "a human" and "an agent", and I don't see what abstractions could sit between them. (An abstraction over {humans, human governments, human corporations}, which is nevertheless more specific than "an agent in general"? Empirically, humanity hasn't been making use of this abstraction – we don't have a term for it – so it's evidently not convergently useful.)

Take 3: Causality-based definitions. Perhaps "theoretical abstractions" are convergently useful abstractions which can't be changed by any process within our universe (i. e., within the net of causes and effects entangled with our senses)? "Trees" can be wiped out or modified, "vectors" can't be.

This doesn't really work, I think. There are two approaches:

  • We model "changing a concept" as "physical interventions that change whether this concept is applicable". Then coloring all tree leaves in the world purple would causally impact the "tree" abstraction.
    • ... except then blowing up the Earth would "causally impact" the "agent" or "market" abstractions as well, by making the corresponding "purely theoretical" concepts inapplicable.
  • We model "concepts" as timeless...
    • ... in which case "a green-leaved tree" would remain unchanged by our coloring all tree leaves purple.

Intuitively, it feels like there's something to the "causality" angle, but I haven't been able to find a useful approach here.

Take 4: Perhaps this is about reoccurrence.

Consider the "global ontology" of convergently useful concepts defined over our universe. A concept such as "an Earthly tree" appears in it exactly once: as an abstraction over all of Earth's trees (which are abstractions over their corresponding bundles-of-atoms which have specific well-defined places, etc.). "An Earthly tree", specifically, doesn't reoccur anywhere else, at higher or lower or sideways abstraction levels.

Conversely, consider "vectors" or "markets". They never show up directly. Rather, they serve as "ingredients" in the makeup of many different "real-world" abstractions. "Markets" can model human behavior in a specific shop, or in the context of a country, and in relation to many different types of "goods" – or even the behavior of biological and even purely physical systems.

Similar for "agents" (animals, humans, corporations, governments), and even more obviously for "vectors".

Potential counterarguments:

  • "An Earthly tree" can be meaningfully used to model abstract processes: for example, you can reason about trees-the-data-structures as being physical-tree-like (rather than the other way around). Similarly, you can define "an agent" by taking the "human" abstraction and then subtracting various human idiosyncrasies from it...
    • ... but that's the key point here: subtraction. Even if you start out using the "human" abstraction to model agents in general (e. g., ascribing personhood to governments or corporations), you'd eventually "wash out" the human details, until you're left with a general-purpose "agent" abstraction that can be fitted with human-specific or government-specific details when you're talking about humans or governments.
    • That is, "a human" is not actually a usefully reoccurring abstraction.
  • Consider concepts such as "trees as used in human cultural symbolism" or "trees in this book I'm reading". Those are copies of the "tree" abstraction that "exist" in different places in the global ontology, aren't they?
    • ... But those concepts are either (1) meaningfully different from "an Earthly tree" (e. g., alien trees in a sci-fi book), or (2) they're pointers to the "Earthly tree" abstraction, rather than copies of it.
  • Are "reoccurring ingredients" and "pointers" the same thing? That is, if "a human artist's conception of a tree" is defined as "the pointer to 'a tree' abstraction PLUS that artist's idiosyncrasies", should we not consider "the US market" as "the pointer to 'a market' abstraction PLUS that country's idiosyncrasies"? Then "a tree" would be as purely theoretical as "a market", once again.
    • I think there's a meaningful difference. "Pointers" feel like a relationship between abstractions that shows up specifically in the context of embedded agents – with them having world-models, desires to talk about convergently useful abstractions, et cetera. By comparison, "vectors" aren't reoccurring in the makeup of gravitational systems and fluid dynamics "because" gravity and fluids want to "talk about" them.
    • I. e.: for "pointers", there's an embedded causal process that involves "copying" these abstractions across layers. Whereas for "reoccurring ingredients", it happens spontaneously.

Take 4 seems fairly promising to me, overall. Can you spot any major issues with it? Alternatively, a way to more properly flesh it out/formalize it?

New Comment


2 comments, sorted by Click to highlight new comments since:

I don't think this has much direct application to alignment, because although you can build safe AI with it, it doesn't differentially get us towards the endgame of AI that's trying to do good things and not bad things. But it's still an interesting question.

It seems like the way you're thinking about this, there's some directed relations you care about (the main one being "this is like that, but with some extra details") between concepts, and something is "real"/"applied" if it's near the edge of this network - if it doesn't have many relations directed towards even-more-applied concepts. It seems like this is the sort of thing you could only ever learn by learning about the real world first - you can't start from a blank slate and only learn "the abstract stuff", because you only know which stuff is abstract by learning about its relationships to less abstract stuff.

It seems like this is the sort of thing you could only ever learn by learning about the real world first

Yep. The idea is to try and get a system that develops all practically useful "theoretical" abstractions, including those we haven't discovered yet, without developing desires about the real world. So we train some component of it on the real-world data, then somehow filter out "real-world" stuff, leaving only a purified superhuman abstract reasoning engine.

One of the nice-to-have properties here would be is if we don't need to be able to interpret its world-model to filter out the concepts – if, in place of human understanding and judgement calls, we can blindly use some ground-truth-correct definition of what is and isn't a real-world concept.