First, we could treat the internal activations of machine-learning models such as artificial neural networks as "givens" and try to condense them to yield more interpretable features.
I do think this is a good insight. Or like, it's not new, SAEs do this; but it's fresh way of looking at it that yields: perhaps SAEs are trying to impose a particular structure on the input too much, and instead we should just try to compress the latent stream. Perhaps using diffusion or similar techniques.
Thank you for writing up! I'm still not sure I understand condensation. I would summarize as: instead of encoding the givens, we encode some latents which can be used to compute the set of possible answers to the givens (so we need a distribution over questions).
Also, the total cost of condensation has to be the at least the entropy of the answer distribution (generated by the probability distribution over questions, applied to the givens) because of Shannon's bound.
I feel like if the optimal condensation setup is indeed 1 book per question, then it's not a very good model of latent variables, no? But perhaps it's going in the right direction.
Condensation: a theory of concepts is a model of concept-formation by Sam Eisenstat. Its goals and methods resemble John Wentworth's natural abstractions/natural latents research.[1] Both theories seek to provide a clear picture of how to posit latent variables, such that once someone has understood the theory, they'll say "yep, I see now, that's how latent variables work!".
The goal of this post is to popularize Sam's theory and to give my own perspective on it; however, it will not be a full explanation of the math. For technical details, I suggest reading Sam's paper.
Shannon's information theory focuses on the question of how to encode information when you have to encode everything. You get to design the coding scheme, but the information you'll have to encode is unknown (and you have some subjective probability distribution over what it will be). Your objective is to minimize the total expected code-length.
Algorithmic information theory similarly focuses on minimizing the total code-length, but it uses a "more objective" distribution (a universal algorithmic distribution), and a fixed coding scheme (some programming language). This allows it to talk about the minimum code-length of specific data (talking about particulars rather than average expectations becomes more central).
Both versions of information theory study compression. Sam's alternative is called condensation in contrast.
Condensation doesn't concern itself with the total code-length. Instead, it concerns itself with how easy it is to use the encoding to answer questions about the data. This forces the encoding to be organized in a conceptually meaningful way, rather than compressed together into an uninterpretable blob.
Compression creates density. Condensation forms discrete droplets.
Condensing data creates a system of latent variables, which organize the data into meaningful parts. Sam has proven a theorem showing that different approximately-efficient condensations will posit approximately-isomorphic random variables. IE, two people who condense the data well will have similar "understandings" of the data and will be able to translate between their different conceptual schemes.
Like classical information theory, condensation has both a probabilistic and algorithmic variant. I'll briefly review classical information theory for the sake of comparison.
Claude Shannon's notion of information can be justified in the following way:
This analysis can be further refined by taking the above discussion to apply to the case where all configurations are equally probable. When the probabilities of configurations differ, the information contained in a configuration with probability is measured as if there were configurations (even if is not a whole number); ie, .[4]
Most commonly, we take log base 2, in which case we are measuring information in bits.
I find this derivation to be beautiful, but the immense impact of information theory cannot be justified from this alone. The practical power of this theory comes from the relationship between information and codes.
Imagine that Alice is keeping a diary of birds that visit her birdfeeder. She wants to conserve ink, paper, and her own time spent writing the diary; so, she wants her diary entries to be as brief as possible. However, she does want to accurately record the type of each bird, and the time and duration of its visit. Therefore, Alice wants to come up with an efficient encoding scheme for her diary.
If Alice has probabilistic beliefs, then it makes sense for Alice to leverage these beliefs in designing her encoding scheme by assigning shorter codes to more common events. If most birds that visit are starlings, Alice might decide to record only the time and duration of a starling visit. Bird type only needs to be specified for non-starlings, saving Alice time and ink.
Shannon proved that, no matter how clever Alice's code, the expected code length must be at least the expected information.[5] This gives a lower bound on the compression of information Alice can expect to achieve with her code. Although Shannon did not give good upper bounds, the study of near-optimal codes has continued to progress; these days, we can use arithmetic coding to get extremely close to this lower bound.
Thus, there is a tight relationship between probability distributions and efficient codes. In the case of codes written with the alphabet ("binary codes"), the number of bits of information corresponds closely to the number of 0s and 1s. (Indeed, this correspondence is so close that the difference is commonly elided; 0s and 1s are bits, in common parlance.)
Given a binary code, I can infer a probability distribution: by imagining the 0s and 1s are heads and tails from coin-flips, I get a probability distribution such that . Given a probability distribution, I can infer a binary code: I can apply arithmetic encoding so that is very close to .[6]
This means information is a very good proxy for minimal compression, while being much nicer mathematically. Code-lengths can only ever be whole numbers, but information is a continuous function of probability, and is typically fractional (even when probabilities are rational numbers). The length of the optimal code for an event is a complicated combinatorial problem, which depends on the whole probability distribution; the information of an event is simply a function of that one event's probability. Code-length is a very direct physical quantity, whereas measuring information requires subjective probabilities.
I emphasize this because condensation is going to be a similar sort of abstraction. Optimal codes are a complicated optimization problem; information is a simpler thing which acts like an optimal-or-slightly-better-than-possible code. Similarly, condensation is going to discuss schemes for organizing data, but it discusses them in terms of a bound that's not always achievable.
One more idea will be useful: universal codes (the subject of algorithmic information theory). The basic idea: any useful compression needs to be computable -- meaning, I can write a computer program which decompresses the code to return the original events.[7] However, computer programs are themselves represented via a binary coding scheme (a programming language). Instead of coding a decoder ( bits) and then coding the compressed data ( bits) and then running the decoder on my compressed data to get the result, I could instead just directly write a computer program which, if run, produces the decompressed data.
This approach will take bits to compress the data. Hence, any programming language[8] is "almost as good as" whatever other code you might come up with: it is at most bits worse, where is the description length (in the chosen programming language) of a decompression algorithm for your better code. This constant is ubiquitous in algorithmic information theory. Since it is a constant, it becomes insignificant (in compression-ratio terms) as the data grows. A special-purpose code will always be better if you have a strong probabilistic model of the data you want to compress, but a programming language is a good general-purpose code if you don't have a specific probability distribution to model your data.
In some circumstances, it is conceptually nicer to represent the complexity of the coding scheme; that is, can be the more meaningful measure than alone. For example, the Hutter Prize is a competition for compressing Wikipedia. Since the objective is to compress a single file which is known ahead of time and does not change, from the probabilistic perspective, the optimal compression scheme should compress it down to 0 bits (an empty file). However, if we looked at the code for such a compression scheme, we would see that it "cheats" by memorizing the data. (It could be something like "print: " followed by the Wikipedia text.) Algorithmic information theory penalizes this sort of cheating by counting the length of the decompression program plus the length of the compressed file, whereas probabilistic information theory doesn't.
For our purposes here, the important thing I want to emphasize is how to translate back and forth between the probabilistic view and the algorithmic view.
Shannon & Weaver's seminal book introducing information theory was titled The Mathematical Theory of Communication. However, they do call out some ways in which it falls short of a full theory of communication. In particular, they note that it is not a theory of semantics or meaning (concepts which are, informally, strongly associated with the terms "information" and "communication").
I want to call out a slightly different way in which it falls short.
In my story about Alice keeping a bird-watching journal, Alice's problem was to record everything faithfully, so that she could reconstruct the entire series of events later. Realistically, however, Alice might want to consult her journal to reconstruct specific events (to answer specific questions). Shannon-style optimal codes do not optimize for this.
When a file is compressed, it typically has to be decompressed before anything else can be done to it. Arithmetic Coding, in particular, turns things into an incomprehensible mess of 1s and 0s. Shannon's notion of optimal codes prioritizes compression over everything else.[9]
Similarly, I think many proponents of Algorithmic Information Theory imagine the shortest program that outputs a target string of data to be something like a beautiful scientific theory of that data, but this need not be the case. It can instead be an incomprehensible mess. The only thing the math talks about is compression.
Sam's condensation is, roughly, an attempt to put the right constraints on the encoding to force it to be a beautiful scientific theory instead of an unintelligible compressed mess.
Shannon's theory of communication prioritizes compression alone. I claim that natural languages (as well as most artificial languages, EG, programming languages[10]) prioritize other things as well. I said some informal things about comprehensibility. How can we formalize these ideas?
Some time ago, I had the idea that there should be a theory of universal data-structures. Algorithmic information theory has given rise to something like a theory of universal algorithms,[11] but (as taught in a typical undergrad degree) that's only half of the picture of computer science; the other half is data-structures.
Sam's condensation is the first thing I've seen that feels like progress towards this goal.
Condensation viewed as a universal data-structure:
Sam considers a best-case scenario, where the information can be organized very well (I'll describe this in more detail later). Under this best-case assumption, the prior does not actually matter: well-organized information can focus on answering each set of questions efficiently, rather than prioritizing one over another.
The main point here is that we are optimizing for answering a variety of questions, rather than just optimizing for efficient reproduction of the whole blob of data, as in Shannon's picture.
More generally, I find the idea of optimizing representations for more than just compression highly appealing. It is extremely cognitively plausible that we need to optimize our representations for how they are used. Sam's picture is one step in this direction, and shockingly, gives us an elegant theory rather than a big mess.
I find the "universal data-structure" picture highly technically motivating,[12] but for the purpose of explaining things more intuitively, I prefer to describe things in terms of organizing notebooks.
Suppose you're in charge of managing an information-desk. Guests come to the desk with (one or more) questions. The full list of questions you're obliged to answer is known; anything outside of this list, it is ok if the response is "I don't know". However, you can't get all your clerks to memorize all the answers. Instead, you rely on a set of reference notebooks.
Each reference notebook is "tagged" with one or more questions. Guests ask all their questions at once, before the clerk answers. Once a guest has asked all their questions, the clerk is supposed to retrieve all the notebooks which are tagged with any of the questions.
The clerks can follow instructions in the notebooks perfectly (like a computer), and we don't care about how much cognitive labor the clerk has to do.
The "score" -- the only thing we care about -- is retrieval cost. This is just the total amount of text in all the notebooks retrieved to answer the questions. Or you can think of it as the total weight of the pile of notebooks the clerk has to retrieve.
We've got a probability distribution over what a guest might ask when they approach the information desk, so we can compute an expected score for a set of notebooks based on how well they organize information. (However, as I think I mentioned previously, Sam's theory doesn't end up being very sensitive to this probability distribution.)
Why would we ever tag a notebook with multiple questions? The idea is that such a notebook holds the common information between those two questions. Perhaps you have a notebook about fluid mechanics and a notebook about population dynamics; they both use differential equations, so your notebook on differential equations has a superset of the tags of both, so that a clerk will retrieve the differential equation notebook whenever they retrieve either fluid mechanics or population dynamics. You could have duplicated information about differential equations in both the other notebooks instead, but the duplicated information would worsen your score in cases where you're asked questions about both things. (This is why it is so crucial to Sam's theory that clerks might handle multiple questions at once, rather than dealing with one question at a time.)
So, the game we're playing is similar to compressing the answers individually, except we're incentivized to share information between answers, creating a "library" of useful abstractions.
One of the notebooks can have "all the tags" (IE, clerks always retrieve that notebook). This is similar to the from algorithmic information theory; it tells the clerks what they need to know to interpret any other notebook in the library. (This includes information about how to use notebooks together, not just interpreting them individually.)
At the beginning, I said that condensation was a theory of latent variables, but I haven't mentioned them so far. That's because I've secretly been describing the algorithmic version of condensation, rather than the probabilistic version. The probabilistic version is what Sam actually describes in his paper. Here's a translation key:
| Algorithmic | Probabilistic |
|---|---|
| answers we might need to compute / "given strings" | given variables |
| uncondensed data / data from which the answers are to be computed | underlying probability space in which the givens live |
| notebooks / "latent strings" | latent variables |
| tags | contribution function |
| notebook with all the tags / program which says how to interpret the rest of the latent strings in service of computing answers | "top" latent (latent which contributes to all givens) |
| notebooks with one tag | "bottom" latents (latents which contribute to just one given variable) |
| score (length of all strings retrieved) | conditioned-perfect score |
Since this is a theory of how to postulate latent variables, it might be tempting to interpret Sam's "given variables" as the observations we abstract latents from. This isn't a terrible interpretation, but Sam avoided "observable variables" for a reason: factoring our observations into variables is already putting a conceptual frame on them. Moreover, the "givens" do not have to be directly observable. They can be variables we've come to believe in already.
I think it is better to interpret the given variables as questions of interest to us (or rather, the variables track answers to such questions).
As I mentioned previously, the algorithmic version deals in particulars where the probabilistic version deals in possibilities: the algorithmic givens are given strings (specific answers, EG the static information the clerks need to give guests at the information desk).
Random variables in Sam's paper are just maps from an probability space to another, giving us a "projection" (a limited perspective) of the first space. If the given variables are our pre-existing concepts, their shared probability space is the underlying reality they describe.
Shannon's information theory allows us to talk about the information of arbitrary things, while algorithmic information theory requires everything to already be in the format of bits. Similarly, probabilistic condensation treats the underlying space as raw, unfactored reality, but algorithmic condensation needs the uncondensed data to already be formatted as a string of 1s and 0s.
In the algorithmic version, the "latent strings" are the contents of the notebooks (or the "bins" I mentioned when talking about universal data-structures). In the probabilistic version, these become latent variables which we postulate to help "explain" the probabilistic structure of the givens. Ideally (in "perfect" condensation), these variables are all independent of each other.
Crucially, the latent variables can live in a larger underlying space than the givens. This represents the idea that new concepts need not be definable in terms of pre-existing concepts. When we come to understand something new about the world, it can actually expand our language, representing the postulation of more to the underlying reality than we previously understood.
In Sam's paper, the "contribution relation" takes the place of the "tags" in my notebook story. This relates latents to givens, telling us which latents are needed in order to reconstruct which givens (which notebooks are needed to answer which questions).
The notebook with all the tags is the "top" latent, IE, the variable which contributes to all the givens. This tells us common information between everything. For example, if the given variables are coin-flips from the same biased coin, the top latent would be the bias.
This can give us information which we need to properly interpret the impact of the rest of the latent variables (similarly to how the notebook with all the tags works), but unlike the algorithmic version, this information needs to be variable in order to be present in top (otherwise it would just be background knowledge, with no need to represent it in any latent). For example, the bias of the coin needs to be unknown in order to be represented in top.
Correspondingly, the "bottom" latents are those which contribute to just one given variable. This can be thought of as the remaining "noise" after we've abstracted out all the common information between things. In the coin-flip example, these would encode the individual coinflips.[13]
I believe John's theory of natural latents excludes these from consideration, marking one difference from Sam's theory. (A natural latent is information that's represented in many places.)
The "score" in algorithmic condensation is quite straightforward (the concatenated string-length for all strings retrieved), but it gets more complex for probabilistic condensation.
First, Sam defines the simple score. This is the sum of the entropies for latent variables which contribute to the given variables queried. Entropy is just expected information, and information is just a very good proxy for description length, so this roughly makes sense as just the expectation of the retrieval cost (we need to introduce an expectation because this is the probabilistic version).
There's a problem, however. The simple score computes each entropy individually. This amounts to an assumption that each notebook's encoding has to be decodable on its own.[14] This doesn't make very much sense given the scenario: "higher-level" notebooks (with more tags) are always retrieved when specific "lower-level" notebooks (with some of those tags) are retrieved. This means we can use the information in the higher-level notebooks to help encode the lower-level notebooks, making lower-level notebooks impossible to decode without the right higher-level notebooks (EG, fluid mechanics is impossible to understand without differential equations). Since we can do this, we absolutely should do this, since it gets us a better score.[15]
This nuance is accounted for in Sam's conditioned score, which sums conditional entropies rather than simple entropies, as appropriate to represent the above-described optimization. (Note that the conditioned score is always better than the simple score.)
Shannon's bound on compression must also hold for condensation: the expected code-length must be at least the expected information. This means the simple score and conditioned score must both be at least the expected information (aka joint entropy) of the questions being asked.
When the conditioned score is equal to the joint entropy, Sam calls this perfect condensation. When even the simple score is equal to the joint entropy, he calls this simply-perfect condensation.
Perfect condensation means that the latent variables contributing to a set of givens are exactly the information we need to reconstruct the givens. Imagine I'm examining a machine without disassembling it. I can look at the input/output behaviors of this machine (givens), and I'm trying to postulate an internal mechanism (latents). Perfect condensation means not postulating any extra hidden states beyond what is needed to explain (reconstruct) the input/output behavior.
Sam's comparison theorem shows that systems of latent variables which are approximately perfect condensations of the same givens must approximately correspond to each other. Those who condense well must share similar perspectives on the world.
This is liable to have far-reaching consequences, but does it mean Sam has solved the interpretability problem?
Condensation has at least two plausible applications, which could improve our ability to interpret what machine-learning models are "thinking".
First, we could treat the internal activations of machine-learning models such as artificial neural networks as "givens" and try to condense them to yield more interpretable features.
Second, we could take condensation as inspiration and try to create new machine-learning models which resemble condensation, in the hopes that their structure will be more interpretable.
These avenues seem potentially promising, but should we think of condensation as a general theory of interpretability? Is this the theoretical tool we need to crack interpretability open?
I'll raise some concerns pointing at limitations.
While it is possible to create compression algorithms very close to the theoretical bounds articulated by information theory, it is not always possible to do so with condensation.
First, perfect condensation is not always possible (not even approximately). I'll skip the details for now (maybe I'll explain more in a future post), but this is basically because interaction information can be negative.
Second, the way compression smooshes everything together allows codes to take advantage of less-than-one-bit information by aggregating things. (If a coin is biased, observing a coin-flip gives us less than one bit of information in expectation. This allows us to compress a sequence of coinflips in a way that encodes more than one coin-flip per bit on average.) The way condensation splits things into latent variables doesn't always allow this. (If you have to encode a single coin-flip, you'll have to use a full bit, no matter how biased the coin.)
First, condensation completely ignores computational cost. The only cost it models is retrieval cost. While this takes us one step in the direction of modeling how information is used, it might to be far enough to represent the sorts of abstractions humans and AIs use in practice. It still ignores the issue of bounded computational power, like so many other models of rationality.
Second, the tag-retrieval model seems artificial to me. Sam's assumptions -- that we can get multiple questions at once, and that we retrieve all notebooks tagged with any of the questions -- are quite important for the results. However, these assumptions are difficult to justify (other than via the results they're able to get).
When I look at condensation through my "universal data-structure" goggles, I want to lift these assumptions, replacing them with a more free-flowing information-retrieval model.
Asking multiple questions at once can be modeled as just a more complex question ("Tell me the capital of France and the capital of Germany"), which makes it tempting to discard the multiple-questions aspect.
As for the tags, perhaps the top instruction booklet gives me an algorithm by which I can decide which other notebooks to retrieve. (These other notebooks could recursively contain instructions for when to retrieve yet-more notebooks.)
However, under optimization pressure, this becomes trivial: I can simply compress the answer to each question individually, and retrieve only that.
This can be made nontrivial again if we reintroduce storage costs. Compressing the answer to each question individually, rather than taking advantage of common abstractions, can take quite a bit of extra storage.
This suggests modeling a trade-off between retrieval cost and storage cost. In the limit of caring only about storage cost, we compress everything together. In the limit of caring only about retrieval cost, we compress answers individually. In the middle, we might get a nontrivial theory of abstraction, somewhat similar to condensation.
Condensation is in its early days; there is a great deal of low-hanging research fruit here, whether it is working out further details of the theory of condensation, trying to apply it to machine learning, or working out alternatives by modifying condensation. I think it is quite exciting, and I hope you feel the same way.
Sam says his goals are somewhat different: he sees John's theory as too objective, suggesting a universally correct way to represent things (an underlying reality, with respect to abstractions). Sam's ideas are more intersubjective, admitting commonalities between agents without supposing a universally correct way of doing things.
However, condensation does not reflect Sam's anti-objective streak. The math, as it stands, provides conditions under which there is a correct set of latent variables to postulate (up to isomorphism).
We can also conclude that the function must be increasing in the number of configurations: if the knob has 27 settings, then I can still communicate letters using the knob, but I can no longer faithfully communicate knob-settings using letters.
(assuming the two objects can be freely configured in an independent way, IE, do not constrain each other)
For a precise argument, see the Wikipedia page on Information Content. There are also similar technical arguments for the concept of entropy; see Shannon's a mathematical theory of communication or, for just the specific argument I'm referring to, here are some course notes.
This close relationship between compression and probability also implies a close relationship between compression and prediction: if you can compress data well, you must be able to predict it well.
This idea inherently assumes that the original events being encoded by the compression scheme were themselves already a sequence of 1s and 0s, unlike the example of Alice's birds used earlier.
For example, PNG is a common image compression format. We can take a bitmap image (which is already represented as 1s and 0s) and convert the file to PNG, which will usually make the file shorter (less 1s and 0s). This works because PNG takes advantage of some probabilistic beliefs about typical images, whereas the bitmap format does not.
(so long as it is Turing-universal)
This is not quite true. I haven't reviewed the relevant ideas, but Shannon's information theory does also capture one other objective: redundancy. My review only covered ideas relevant to communication across noiseless communication channels, where compression is the only goal. Shannon also studied communication across noisy channels, where you want your message to be adequately redundant, so that the intended message can be reconstructed with high probability (error-correcting codes).
This feature is apparent in natural languages, which tend to contain redundancies (such as subject-verb agreement) rather than compressing everything as efficiently as possible.
(when put to ordinary use, not as used in Algorithmic Information Theory)
I haven't covered that part of Algorithmic Information Theory here. I have in mind stuff like Levin Search, and similar things discussed by Ray Solomonoff in Progress in Incremental Machine Learning.
Granted, "universal data-structure" is just a vague idea, and perhaps poorly-named. Maybe this is closer to a "general theory of data-structures" IE something in the direction of taking a description of a set of use-cases & deriving the appropriate data-structure.
You can't really save any code-length for a single coin; no matter how biased it is (ie no matter how much fractional-bit information you have about it), you still have to write either 1 or 0 to record the bit. This is one way in which condensation is a less-tight abstraction. Or perhaps I should say: one way in which information is a less-tight abstraction when applied to condensation as opposed to compression. Information theory says that the information in a biased coin is less than one bit. This works out when we aggregate many coins together: we can anticipate that a coin biased towards 0 will create many runs of several 0s in a row (and far less runs of 1s). This allows us to compress (by representing runs of 0s more compactly). However, it seems to me, this is precisely the sort of compression which doesn't reflect our understanding of the data. If I understand correctly, condensation won't do this. We could condense several coin-flips together to encode them more efficiently, and it would help our score in cases where we are asked about (most of) those specific coinflips. Unfortunately, this could be a disaster for our score in cases where we're only asked about one or two coinflips in a bunch, but are forced to retrieve the condensation of the whole set. As you can see, this creates a high dependence on the distribution of sets of questions, which means it gets us into the territory where condensation isn't such a nice abstraction.
(I could be wrong about these details.)
What does it even mean to decode a notebook "individually"? After all: the point of the notebooks is that (if you've retrieved the right notebooks) you can collectively put them together and decode the answer to the questions you've been asked.
Well: recall that condensation is a theory of postulating latent variables. This does and should imply that a notebook is meaningful on its own. Decoding a notebook individually means figuring out what that notebook says about the state of the latent. Once we know the state of all the relevant latents, we put those together to figure out the state of the given variables being queried.
It might seem like this paragraph is confusing the algorithmic and probabilistic versions of condensation, since I'm using the algorithmic terminology to describe probabilistic condensation.
This is partly because I'm using the algorithmic version as a more concrete and intuitive analogy for what's going on in the probabilistic version, and partly because "description length" still makes sense in Shannon's information theory, so the notebook analogy is still apt.