This is part of a series covering my current research agenda. Refer to the linked post for additional context.
Suppose we have a dataset consisting of full low-level histories of various well-abstracting universes. For simplicity, imagine them as -vectors, corresponding e. g. to the position of each particle at a given moment, or denoting the coordinates of that state in a classical state-space.
Suppose we wanted to set up a system which maps any such sample to its minimal-description-length representation; i. e., to a well-structured world-model that takes advantage of its abstract regularities. How would it work? What are the algorithms that implement search for such representations?
I'll be building the model piece by piece, incrementally introducing complications and proposed solutions to them.
Part 1 covers the following problem: if we're given a set of variables over which an abstraction hierarchy is defined, what is the type signature of that hierarchy, and how can we learn it?
Suppose we have a set of random variables . What is the most compact way to represent it without loss of information?
In the starting case, let's also assume there's no synergistic information.
Formally, we can define the task as follows: define a set of deterministic functions of the variables such that is minimized under the constraint of .
Intuitively, we need to remove all redundancies. If two variables have some redundant information, we need to factor it out a separate variable, and only leave them with the information unique to them. But if two different redundant-information variables , produced by two subsets , also have some shared information, we need to factor it out into an even "higher-level" redundant-information variable as well, leaving , with only whatever information is "uniquely shared" within subsets and respectively. This already hints at a kind of hierarchy...
A natural algorithm for the general case is:
Intuitively, each subset becomes associated with a variable containing all information uniquely shared among the variables in that subset – information that is present in all of them, but nowhere else. Many subsets would have no such information, their -variables deleted.
The -variables would then form a natural hierarchy/partial ordering. The highest-level -variables would contain information redundant across most -variables, and which was initially present in most intermediate-level -variables. The lowest-level -variables would contain information unique to a given -variable.
A way to think about it is to imagine each -variable as a set of "atomic" random variables – call them "abstraction factors" – with some of those factors reoccurring in other -variables. What the above procedure does is separating out those factors. (The term "abstraction factors" is chosen to convey the idea that those factors, in separation, may not actually constitute independent abstractions. I'll elaborate on that later.)
(All this assumes that we can just decompose the variables like this via deterministic functions... But that assumption is already largely baked into the idea that this can be transformed into a well-structured world-model. In practice, we'd use terms like "approximately extract", "approximately deterministic", et cetera.)
Example: Suppose that we have , , , , , . The resultant hierarchy would be:
Consider picking any given node, such as , and "restoring" all information from the factors that are its ancestors, getting here. This procedure recovers the redundant-information variables that existed prior to the factorization/"extraction" procedure (Step 3 in the above algorithm).
In this case, we get:
To me, this structure already vaguely feels like the right "skeleton" of an abstraction hierarchy, although it's of course rudimentary so far. Each "restored" node would represent a valid abstract object, and the factors are the "concepts" it's made of, from the most-abstract (e. g., being "what animal this is") to most concrete ( being "details about this specific animal").
Further, there's a simple connection to natural latents! If you take any given factor (diagram 1), and condition the "restored" variables corresponding to its children-factors (diagram 2) on all of their other ancestral factors, this factor becomes the natural latent for those variables. It makes them independent (because variables are independent if conditioned on all of their abstraction factors, since those represent all shared information between them), and they all tell the same information about it (namely, they all contain that factor).
For example, take and , , . If we condition those variables on , , and , then would become the natural latent for that set. (Specifically, what we do here is: pick a node on the graph, go down to its children, then iteratively follow the arrows pointing to them upwards, and condition on all of the variables you encounter except the one you picked.)
In fact, this works with any set of factors with a total ordering (i. e., without "siblings"/incomparable elements); or, equivalently stated, for any subset of factors present in a given "restored" variable. For example, any subset of is a natural latent over , , conditional on and that subset's complement. By contrast, the set , or any of its supersets, is not a natural latent for any set of variables. (Well, technically, it'd fill the role for conditioned on and , but then is effectively just random noise and the actual latent is alone.)
Some examples regarding how it all might work in concrete terms, highlighting important aspects of the picture. All assume the above hierarchy. I'll copy it here for convenience:
1. Toy store. Imagine that we have a list of products sold at a toy store. is "a product", is "a toy", is "a car toy", is "a constructor toy", -variables specify the details of individual products. is a LEGO car, and is some non-toy product being sold on the side.
Note that it's specifically the "restored" variables that allow this kind of interpretation, not the individual abstraction factors. is "a toy", not . is some "essence of toy-ness" such that, when combined with , it transforms from an unspecified "product" to a "toy product". Taken in isolation, 's values may not actually have a clear interpretation! They're "correction terms", which may be meaningless without knowing what we're correcting from. This is the sense in which abstraction factors function as "ingredients" for abstractions, rather than as full abstractions in themselves.
Note also the "uneven stacking": there are no "globally synchronized" abstraction levels, some variables/regions have more layers of organization than others, like versus all other variables. This makes broader intuitive sense (the tower of abstractions on Earth is taller than on Mars, since Earth has cells/organisms/populations/societies/civilization).
2. Photo. Imagine that we're looking at a photo. is "the lighting conditions", is "specific objects in the photo", is "dogs in the photo", is "sculptures in the photo". and are dogs, is a dog sculpture, and are some non-dog sculptures, is the background.
Consider the idea of "abstract edits". If we have this decomposition, we could: (1) decompose the picture into factors, (2) modify a specific factor, e. g. "lighting conditions", then (3) re-compose the picture, but now with modified lighting.
3. Machine-Environment System. Consider a system involving the interactions between some machine and the environment. is the machine's state (e. g., "releasing energy" or "consuming energy"), the factors and add specific details about the states of its two modules, -variables from 1 to 5 contain specifics about the states of individual components of the machine, is information about the state of the external environment, and is the system's overall highest-level state (e. g., "there's currently a positive-feedback loop between the machine and the environment"). The component serves as the interface between the two modules; other components belong to the modules separately.
Notes:
4. Machine-Environment System, #2. Consider a machine-environment system made of gears and pulleys. is "a gear", is "a pulley", is "human manufacturing practices", is "this external environment", is "the overall type of this system". and are specific gears, and are specific pulleys, is a toothed pulley.
This system can be the same system as in example (3): it's just that here, we focus on a different subgraph in our partial-order abstraction diagram, the "sibling" of the one in (3). (3) focuses on the system as an abstraction over its constituent parts, while (4) focuses on representing the constituents as instances of commonly reoccurring types of objects.
Note how all of the tools described above can be combined. We can take this system, decompose it into the representation in (4), make some abstract edit to the gears' manufacturing practices, re-assemble the system, re-disassemble it into the (3) representation, take the nodes and , condition on (to remove interactions with the environment), and restore to them (to recover the dynamics between the modules). Intuitively, this corresponds to evaluating how the interactions between the modules change if we switch up our gear manufacturing practices; perhaps running the corresponding simulation.
5. People. Suppose that the -variables are six different people living in the same city. is "the city's state", is "the state of the city's socioeconomic sphere", is "the state of a specific company", is "the state of a popular political movement". Three people work at the company, three people participate in the movement, one person is part of both the company and the movement, and one person is unemployed and not part of the movement.
Focus on specifically here: the person that's part of both the corporate dynamics and political dynamics. Note that (a) those dynamics can be largely non-interacting, and (b) both of those dynamics "share" the substrate on which they run: the third person. This is "polysemanticity" of a sort: a given low-level system can simultaneously implement several independent high-level systems. (This is essentially just another way of looking at "sibling subgraphs", but it conveys different intuitions.)
Distilling, some notable features are:
I think it's pretty neat that all of this expressivity falls out of the simple algorithm described at the beginning. Granted, there's a fair amount of "creative interpretation" going on on my end, but I think it's a promising "skeleton". However, major pieces are still missing.
We'd like to get rid of the "no synergistic information" assumption. But first: what is synergistic information?
Usually, it's described as "the information the set of variables gives us about some target variable , which we can't learn by inspecting any strict subset of in isolation". The typical example is a XOR gate: if , and and are independent random bits, looking at either bit tells us nothing about , while both bits taken together let us compute exactly.
But note that synergistic information can be defined by referring purely to the system we're examining, with no "external" target variable. If we have a set of variables , we can define the variable s such that is maximized under the constraint of . (Where is the set of all subsets of except itself.)
That is: s conveys information about the overall state of without saying anything about any specific variable (or set of variables) in it.
The trivial example are two independent bits: is their XOR.
A more complicated toy example: Suppose our random variable is a 100-by-100 grid of binary variables , and each sample of is a picture where some 8 adjacent variables are set to 1, and all others to 0. can then return the shape the activated variables make. Across all realizations, it tells us approximately zero information about any given subset (because the number of active variables is always the same), but we still learn something about 's overall state.
This is the "true nature" of synergistic information: it tells us about the "high-level" features of the joint samples, and it ignores which specific low-level variables implement that feature.
Another example: emergent dynamics. Consider the difference between the fundamental laws of physics and Newtonian physics:
I. e.: unlike with the fundamental laws, "does this system approximately implement Newtonian physics, yes/no?" depends on synergistic information in large sets of fundamental particles, and many conditions need to be met simultaneously for the answer to be "yes".
Note also that synergistic information is effectively the "opposite" of redundant information. Conditioning lower-level variables on synergistic information creates, not removes, mutual information between them. (Consider conditioning the XOR setup on the value of the input . Suddenly, there's mutual information between the other input and the output ! Or: consider conditioning a bunch of fundamental particles on "this is a Newtonian-physics system". Suddenly, we know they have various sophisticated correspondences!)
How can we add synergistic information into our model from 1.1?
One obvious approach is to just treat synergistic variables as... variables. That is:
As far as I can tell, this mostly just works. Synergistic variables can have shared information with other synergistic variables, or with individual variables; the algorithm from 1.1. handles them smoothly. They always have zero mutual information with their underlying -variables, and with any synergistic variables defined over subsets of their underlying -variables, but that's not an issue.
Note that no further iteration is needed: we don't need to define synergistic variables over sets of synergistic variables. They would just contain parts of the information contained in a "first-iteration" higher-level synergistic variable, and so the algorithm from 1.1 would empty out and delete them.
Important caveats:
(3) is kind of very majorly inconvenient. It's clear why it happens: as stated, the synergistic variable does not actually "extract" any entropy/information from the variables on which it's defined (the way we do when we factor out redundant information), so some information ends up double-counted. I do have some ideas for reconciling this with the overall framework...
The current-best one (which may be fundamentally confused, this reasoning is relatively recent) is:
Again, this is a relatively fresh line of argument, and it may be obviously flawed from some angle I didn't yet look at it from.
As it turns out, the picture we now have cleanly maps to a known concept from information theory: partial information decomposition (PID); or partial entropy decomposition.
This paper provides a basic overview of it. (It uses a very adhockish definition for redundant information, but is otherwise fine.)
PID's steps mirror pretty much all of the steps we've gone through so far. Starting from some set of variables with entropy , PID:
There's a clear correspondence between PID's "atoms" and my abstraction factors, the pre-subtraction atoms are equivalent to my "restored" nodes, there's a procedure isomorphic to deleting "emptied" variables, et cetera.
Major difference: PID does not define any variables. It just expands the expression quantifying the total entropy into the sum of entropies of those atoms. Which is why the entropy of all atoms is able to add up to the total entropy with no complications, by the way: we have no trouble subtracting synergistic-information entropy.
One issue with PID worth mentioning is that they haven't figured out what measure to use for quantifying multivariate redundant information. It's the same problem we seem to have. But it's probably not a major issue in the setting we're working in (the well-abstracting universes).
And if we're assuming exact abstraction in our universe, I expect we'd get exact correspondence: every PID information atom would correspond to an abstraction factor, and the factor's entropy would be the value of that PID atom.
Overall, the fact that the framework incrementally built up by me coincidentally and near-perfectly matches another extant framework is a good sign that I'm getting at something real. In addition, PID fundamentally feels like the "right" way to do things, rather than being some arbitrary ad-hoc construction.
Also: this offers a new way to look at the problem. The goal is to find a correct "constructive" way to do partial entropy decomposition. The functions for learning the abstract-factor variables may, in fact, directly correspond to the functions defining the "correct" way to do partial entropy decomposition.
Let's see what adding synergistic variables to the broader framework lets us do.
I'll be focusing on natural latents here. As stated in 1.1, in my framework, a natural latent can be defined as any set of abstraction factors with a total order (considered over the variables in the appropriate set conditioned on their other ancestral factors).
Let's consider a dog at the atomic scale. Intuitively, the correct theory of abstraction should be able to define the relationship between the dog and the atoms constituting it. However, there are complications:
What I propose is that, in this context, "a dog" is the value of the natural latent over (functions of) specific synergistic variables. "A DNA string", "the shape of this animal's skull", "the sounds this animal makes", "the way this animal thinks" all let us known what animal we're looking at; and they're exactly the features made independent by our knowing what animal this is; and, intuitively, they contain some synergistic information.
However, that seems to just push the problem one level lower. Isn't "a DNA string" itself in same position as the dog relative to the atoms (or subatomic particles) constituting it, with all the same complications? I'll get to that.
First, a claim: Every natural latent/abstraction is a function of the synergistic variable over the set of "subvariables" constituting the variables in the set of variables for which is a natural latent. (Degenerate case: the synergistic variable s over a one-variable set is .)
Let me unpack that one.
We have some set of variables over which a natural latent is defined – such as a set of n dogs, or a set of n features of some object. is a function which can take any as an input, and return some information, such as properties of dogs (or e. g. nuclear reactors).
But those individual are not, themselves, in the grand scheme of things, necessarily "atomic". Rather, they're themselves (functions of) sets of some other variables. And relative to those lower-level variables – let's dub them "subvariables" , with – the function is a function whose value is dependent on the synergistic variable.
Example: Consider a set of programs which all implement some function F, but which implement it in different ways: using different algorithms, different programming languages, etc. The set of programs is independent given : it's a natural latent/abstraction over it. But each program itself consists of a set of lower-level operations , and relative to those, is a synergistic variable: all must be in the correct places for the emergent behavior of "implements " to arise. Simultaneously, the presence of any specific basic operation , especially in a specific place in the program's code, is not required, so provides little information about them.
Another example: Consider the difference between "this specific dog" and "the concept of a dog". What we'd been analyzing above is the "this specific dog" one: an abstraction redundantly represented in some synergistic features/subsystems forming a specific physical system.
But "the concept of a dog" is a synergistic variable over all of those features/subsystems. From the broader perspective of "what kind of object is this?", just the presence of a dog's head is insufficient. For example, perhaps we're looking at a dog's statue, or at some sort of biotech-made chimera?
Here's what I think is going on here. Before we went "what type of animal is this?" "this is a dog", there must have been a step of the form "what type of object is this?" "this is a ordinary animal". And the mutual information between "dog's head", "dog DNA", and all the other dog-features, only appeared after we conditioned on the answer to "what object is this?"; after we conditioned on "this is an ordinary animal".
If we narrow our universe of possibilities to the set of animals, then "a dog" is indeed deducible from any feature of a dog. But in the whole wide world with lots of different weird things, "a dog" is defined as an (almost) exhaustive list of the properties of dog-ness. (Well, approximately, plus/minus a missing leg or tail, some upper/lower bounds on size, etc. I'll explain how that's handled in a bit.)
Those descriptions are fairly rough, but I hope it's clear what I'm pointing at. Conditioning some variables on some latent while treating that latent as encoding synergistic information over those variables may create redundant information corresponding to a different valid natural latent. There's also a natural connection with the idea I outlined at the end of 1.3, about synergistic variables creating probabilistic structures.
This is also how we handle the "downwards ladder". Conditional on "this is an animal", "this is a dog" is redundantly represented in a bunch of lower-level abstract variables like "DNA" or "skull". Whether a given structure we observe qualifies as one of those variables, however, may likewise be either information redundant across some even-lower-level abstractions (e. g., if we observe a distinct part of dog DNA conditional on "this is part of a whole DNA string"), or it may be synergistic information (if the corresponding universe of possibilities isn't narrowed down, meaning we need to observe the whole DNA string to be sure it's dog DNA).
(Note: In (1.1), I described situations where natural latents are valid over some set only conditional on other natural latents. The situation described here is a different way for that to happen: in (1.1), conditioning on other latents removed redundant information and let our would-be natural latent induce independence; here, conditioning on a latent creates redundant information which the new natural latent is then defined with respect to.)
Moving on: I claim this generalizes. Formally speaking, every natural latent/abstraction is a function of the synergistic variable over the set of subvariables constituting the variables in the set of variables for which is a natural latent. Simultaneously, is the redundant information between the variables. (And, again, the synergistic variable over a one-variable set is just .)
General justification: Suppose that contains all necessary information about even without some subvariable . That is,
In that case... Why is there? Intuitively, we want to be, itself, the minimal instance of the latent , and yet we have some random noise added to it. Common-sense check: our mental representation of "a dog" doesn't carry around e. g. random rocks that were lying around.
On the contrary, this would create problems. Suppose that we feel free to "overshoot" regarding what subvariables to put into , such that sometimes we include irrelevant stuff. Then variables in some subset may end up with the same unrelated subvariables added to them (e. g., dogs + rocks), and variables in a different subset may have different unrelated subvariables (e. g., blades of grass). "A dog" would then fail to make all variables in independent of each other.
Indeed, would not have a natural latent at all. We'd end up with two abstractions, "dog + rocks" and "dog + grass"... Except more likely, there'd be much more different types of unrelated subvariables added in, so we won't be able to form a sufficiently big dataset for forming a "dog" abstraction to begin with.
So the "nuisance" subvariables end up "washed out" at the abstraction-formation stage. Meaning all subvariables constituting each are necessary to compute .
Admittedly, this still leaves the possibility that is a function of the unique informations in each , not the synergistic information. I think it makes less intuitive sense, however:
Nitpick: Suppose that the set is such that every -sized subset of it has a nontrivial synergistic variable which contains the same information.
Example: Shamir's scheme, where knowing any shares out of , with arbitrarily larger than , lets you perfectly decrypt the ciphertext, but knowing even one share fewer gives you zero information about the plaintext. Intuitively, all those shares should be bundled together into one abstraction... but the above argument kicks out all but some random subset k of them, right?
And we can imagine similar situations arising in practice, with some abstractions defined over any small subset of some broader set of e. g. features. See the previous dog example: dogs usually have four legs, but some of them have three; most have tails, but some don't; etc.
But the framework as stated handles this at a different step. Recall that we compute the synergistic variables for all subsets, add them to the pool of variables, and then try defining redundant-information variables for all subsets. Since the synergistic variables for all subsets of Shamir shares contain the same information, they would be bundled up when we're defining redundant-information variables.
Similar with dogs: we'd end up with sets of features sufficient for a dog, without there necessarily being a minimal set of features which is both (1) sufficient for "this is a dog", and (2) is a subset of each of those sets.
You may argue it is clunky. It very much is; a better, more general way will be outlined in the next part.
Clarification: some things which I wasn't necessarily arguing above.
Aside: I think there's also a fairly straightforward way to generalize claims from this section to causality. Causality implies correlation/mutual information between a system's states at different times. The way this information is created is by conditioning a history (a set of time-ordered system-states) on a synergistic variable defined over it, with this synergistic variable having the meaning of e. g. "this is a Newtonian-mechanics system". This likewise naturally connects with/justifies the end of 1.3, about interpreting synergistic variables as information about the distribution, rather than about specific joint samples.
Summing up:
Part 1 outlined my high-level model of how we can learn an abstraction hierarchy given a clean set of low-level variables out of which it "grows".
Namely: we can cast it as a "constructive" cousin of partial information decomposition. The algorithm goes as follows:
Each -variable can then be considered an "abstraction factor". Any set of abstraction factors with a total ordering functions as a natural latent for the lowest-level factor's children conditional on that children's other ancestral variables.
This setup has a rich set of features and functionalities, in line with what we'd expect the theory of abstraction to provide, and it seems to easily handle a wide variety of thought experiments/toy models/case studies. Notably, it offers a way to deal with situations in which abstractions seem to share the properties of redundant-information variables and synergistic-information variables: by defining abstractions as natural latents of functions of synergistic variables.
One notable unsolved fatal problem/small inconvenience is that the presence of synergistic variables directly opposes the idea of producing the minimal representation of the initial set of variables. I've not dealt with this yet, but I expect there to be a simple conceptual explanation. A promising line of argument involves treating the probabilistic structure as just higher-abstraction-level random variables, which increases the total entropy to which our decomposition should add up to.
Another problem is, of course, the computational intractability. The above algorithm features several steps steps where we learn a specific variable for every subset of a (likely vast) amount of low-level variables we're given. Obviously, a practical implementation would use some sort of heuristics-based trick to decide on variable-groupings to try.
All those caveats aside, the problem of learning an abstraction hierarchy (the way it's defined in my framework) is now potentially reducible to a machine-learning problem, under a bunch of conditions. That is:
(3) is actually a deceptively major problem. Next part is about that.
What I'm particularly interested in for the purposes of the bounties: Is there some better way to handle synergistic information here? A more "correct" definition for synergistic-information variables? A different way to handle the "overcounting" problem?