What Does The Natural Abstraction Framework Say About ELK?

johnswentworth

Credit to Adam Shimi, Alex Flint, and Rob Miles for discussions, counterexamples, and general input to the ideas here.

Quick recap for anyone who didn’t read the hundred-page Eliciting Latent Knowledge document:

We have a diamond in a vault, with a bunch of automated theft-defenses.
We train a predictor to take in the vault’s video-stream and a plan for the vault’s actuators, and predict future video frames.
We train a planner to find plans that the predictor predicts will end with the video feed still showing a diamond in the vault.
We want some way for a human to probe the latent knowledge of the predictor, e.g. to check if the predictor expects a screen showing a diamond will be placed in front of the camera.

More generally, the central problem of ELK is to robustly extract whatever latent knowledge is inside of some predictive model (the diamond/vault thing is just an example).

The general version of this problem is one of the main intended use-cases of the natural abstraction framework: the hypothesis is that the sort of things humans recognize as “things” are natural abstractions, so by looking for natural abstractions in the predictor’s model we can find human-legible latent structure. So, what does natural abstraction have to say about ELK?

First and foremost: the natural abstraction framework is still under development. There are some powerful theorems, but there’s still a fair bit of legwork to be done before we can e.g. directly calculate the abstractions used by a trained predictor. We do at least have enough of the math in place that we can sketch out what it will probably look like, once the framework is ready for prime time, and that sketch is the purpose of this post.

Setup

At the level of abstraction needed for our purposes, we can think of the predictor as a probability distribution . This is not the “real” distribution in any sense; it is the predictor’s distribution. We will generally talk about the natural abstractions within that distribution. That means the abstractions we talk about will be the predictor’s natural abstractions, i.e. the abstract summaries implicitly used in its predictions.

Note that Observations and Actions are taken to be the observations and actions within a single episode. We can also think about the predictor’s distribution of all Observations and Actions over all episodes; we may talk about that occasionally, but won’t introduce any particular notation for it.

What Kinds Of Things Are Natural Abstractions In The Vault?

One reasonable summary of the core theorems of the natural abstraction framework: given a distribution over a large number of random variables $X_{1} \dots X_{n}$ , we can compute a (lower-entropy) summary $F (X)$ such that most variables are approximately independent given $F$ .

What would this abstract math of abstraction look like in the context of the diamond example?

Our random variables $X_{1} \dots X_{n}$ might be frames from the video feed. These frames are of course correlated: the video feed shows a diamond in a vault, and it’s the same diamond in each frame (at least in the pre-action part of the feed), so all the parts of each frame showing the diamond will have real-diamond induced correlations. If there’s a tiny imperfection on one part of the diamond visible in one frame, then I expect that tiny imperfection to be visible on the corresponding part of the diamond in other frames.

Assuming we have a lot of video frames, the natural abstraction framework then allows us to extract some summary $F (X)$ which would include things visible in lots of frames, including the appearance of the diamond itself, the appearance of the vault, and the appearance of any tiny imperfections in the diamond which are visible in many frames. Conditional on that summary, most video frames are independent - e.g. there might be some minor thing visible in frame 192 which also shows up in frame 193, but it will only be in a handful of frames and will not be relevant to the rest of the frames (otherwise it would be included in the summary $F (X)$ ).

Of course, it’s not just abstractions-across-video-frames. Even within one frame, many patches of the wall may look similar; that shared appearance is a natural abstraction. Or many patches of the diamond may look similar; their shared appearance is another natural abstraction. The lighting may vary in a predictable way, such that we could reconstruct the lighting of the whole room from just a few patches; that lighting is a natural abstraction. Seeing just a few patches of the diamond may allow us to reconstruct the diamond’s position and orientation, and then use those to guess what other patches are visible and how they appear; the position and orientation are a natural abstraction. Whenever some information is relevant to a large number of places-and-times in the video feed, it’s a natural abstraction.

So What Would “The Diamond” Be, As A Natural Abstraction?

“The diamond” is visible as a similar-looking chunk of pixels in many different video frames. If we can see the diamond in one frame, and black out the area where the diamond is in many other frames, then we can make a pretty good guess about what it looks like in the latter frames using the information from the first. So that’s the abstraction of “the diamond” as a geometric pattern, i.e. the shape and optical properties of the diamond.

There’s also the diamond’s position and orientation: given the diamond’s shape and optical properties, we still need its position and orientation (relative to the camera) at a given time to guess what the pixels will actually look like. Conditional on the diamond’s position and orientation (and maybe a few other background variables, like e.g. lighting), the pixels showing many different little parts of the diamond can all be predicted roughly-independently of each other. So position and orientation of the diamond at each time are another natural abstraction.

Moving up to higher-level abstractions, the diamond’s position and orientation at two adjacent timesteps let us estimate the diamond’s position and orientation at the next timestep; that’s physics. That relationship between position and orientation in three sequential frames would itself be a higher-level natural abstraction: we could estimate it from many different sets of three sequential frames.

That last example is especially crucial when we want to query what the predictor expects to happen “off-screen”, e.g. whether the real diamond is still in the vault after a screen is placed in front of the camera. The fundamental barrier to answering this sort of question is that, if the predictor is computationally efficient, it might just not calculate what’s happening off-screen at all. The predictor just needs to predict what the screen will show, and it doesn’t necessarily need to consider whether the real diamond is still in the vault in order to do so; its representation of the world could be some lazy data structure in which it only computes what it actually needs for video feed predictions. But if the predictor’s natural abstractions include information about the dynamics of the world, then we can potentially “play the simulation forward” within the predictor’s implicit world-model, even if the predictor itself isn’t actually running that calculation.

How Would This Play With A Screen In Front Of The Camera?

Let’s consider the screen-problem more generally: imagine that the camera is (predicted to be) covered by a screen showing a simulated diamond after some timestep. What would that do to the natural abstraction of “the diamond”?

Well, if the simulated-diamond on screen is anything less than a perfect simulation of the actual diamond, then it will be a different natural abstraction. Consider the shape-and-optical-properties part of the diamond-abstraction, for instance. If we see a little imperfection in the diamond at one of the earlier timesteps (when the camera is still unscreened), then that imperfection is added to our information about the abstract “diamond”; we expect to see that same imperfection in other frames showing the same abstract “diamond”. If the diamond-on-the-screen lacks that imperfection, then it diverges from the original diamond’s abstraction; the screen-diamond may still be internally consistent with itself (in which case it is itself a natural abstraction), but it will be separate.

One step closer to the actual math: if the predictor expects that some imperfection might show up in the earlier video frames, and that imperfection would not give us any evidence about later video frames, then that indicates that any apparent “diamond” in the later frames is generated by a different abstraction than the apparent “diamond” in the earlier frames.

That’s as close as we’ll come to specifying an ELK strategy in this post; see this post for hacky ideas generated by approximating this one. The rest of this post will talk about the still-unsolved barriers to operationalization of the natural abstraction approach to the ELK problem.

What Steps Do We Not Yet Know How To Do?

From the predictor’s distribution $P [O b s e r v a t i o n s | A c t i o n s]$ , we can in-principle directly compute a natural abstraction summary function $F (X)$ via the resampling method.

Other than computational limitations, the biggest open problem I know of to make direct calculation of $F (X)$ useful for something like ELK is that we’d just get a single summary which encodes all the natural abstractions simultaneously. It wouldn’t separate out “the wall” from “the diamond” from “the lighting”, or “the diamond’s shape” from “the diamond’s position at each time”. That would all just be mixed together in one fairly-arbitrary representation; we need some way to factor out the “parts” of the abstract information. We might be able to tell that e.g. a screen-diamond has been added to our ontology separate from the real-diamond just by counting the number of possible values of $F$ , but that could easily be messed up by other things changing; what we really want is to factor out those conceptually-separate abstract “things”.

(Meta: this paragraph requires the resampling post as background.) I expect that the factorization of the summary (i.e. breaking $F (X)$ into diamond, wall, lighting, etc) will drop out naturally from varying free parameters in the resampling process. In practice, any abstract summary will only be approximately conserved as we run a resampler, so even the approximately-conserved information $F (X)$ will be lost over some resampler-timescale. By varying the relative frequency at which we resample variables, different information will be lost faster or slower - e.g. if we resample pixels-showing-the-diamond more often, then information about the diamond will be lost faster. This works exactly to the extent that the information is mostly in some specific variables and not others - e.g. information about the diamond is mostly in pixels showing the diamond, so resampling those more often changes info-loss timescales in a way that e.g. resampling odd-indexed pixels more often does not. Where the pure resampling-based summary gives us a strictly “global” summary, this approach would bring back a notion of locality at the abstract level.

… but I don’t have the math worked out for all that yet, it’s just an “obvious” next step. (In practice that step may involve yet another new-but-equivalent conceptualization of natural abstractions; both information-at-a-distance and resampling feel like not-quite-the-right setups for factorization of the summary.) Hopefully I’ll work it out over the next few months.

AI ALIGNMENT FORUM
AF