In some sense, any new measurement device automates a part of research. A thermometer automates the task of sticking one's finger in something to check how hot it is, a scale automates the task of holding something to check how heavy it is, etc. The automated version is not only more convenient, but more precise and reproducible (as is usually the case when automating things). For alignment, one analogue might be interpretability tools which automate the work done when some human looks at a part of a neural net and sees what it's doing.

Let's take that last example and dig into it a bit more: interpretability tools which automate the work done when some human looks at a part of a net and sees what it's doing. We want to leverage the analogy to other measurement tools, like a thermometer or a scale, to better understand automation of interpretability.

Here's one type of proposal I hear a lot: to automate interpretability, have some human researchers look at parts of a net, poke at them, and write up an explanation of how they're interpreting it. Collect data from many such instances, and train a neural net to take net-parts and produce explanations.

We want to leverage the analogy to thermometers or scales, so what would be the analogous strategy for making a thermometer or scale? Well, have a bunch of humans stick their fingers in a bunch of stuff and report how hot the stuff is, then train a neural net to replicate the humans'  hotness-reports. Or, have a bunch of humans hold things and report how heavy they are, then train a net to replicate the humans' heaviness-reports.

Hopefully it is obvious that the "train a net to replicate human reports" results would not be nearly as useful, for purposes of scientific progress, as actual thermometers or scales. But what's missing? And how can we carry that insight back to the interpretability problem?

The thermometer has two great powers: a simple legible data type, and reproducilbility. First, simple legible data type: the thermometer's output is a single number (the temperature), and we can compare that number with other thermometer-outputs. That's a kind-of-thing for which we have very precise mathematical understanding: we know exactly what kinds-of-things we can do with these numbers, we have a nice general representation, we're confident that different people mean the same thing by numbers, etc. This is in contrast to natural language, which is typically ambiguous, doesn't necessarily make it obvious what we can do, leads to frequent miscommunication, etc.

Second, reproducibility: if the thermometer says X is hotter than Y, then when I put X and Y in contact, X gets cooler and Y gets hotter (all else equal). I can use the thermometer to rank hotness of a bunch of things, sort them by thermometer reading, and consistently (approximately-deterministically) find that the things on the hotter end feel hotter than the things on the colder end. This is what makes the single-number output (temperature) actually useful: it approximately-deterministically predicts some stuff, across a broad range of contexts, based on just those simple numbers.

Exercise for the reader: walk through the same analogy for a scale.

What would be the analogy of a thermometer for an interpretability tool? Well, something which we can point at part of a net, and get back a simple legible output, which approximately-deterministically predicts some stuff across a broad range of contexts.

When you look at it like that, it's clear that building a measurement tool like a scale or thermometer is itself a task which requires scientific insight. It requires finding some approximately-deterministically-reproducible pattern, which can be predicted by some simple legible summary data. Just picking some ad-hoc measurement convention (like e.g. reporting hotness using natural language) fails to capture most of the value. Most of the value isn't in the automation work itself, it's in noticing the reproducible pattern.

I'm generally pretty pessimistic about research-automation proposals analogous to "train a net to mimic a human sticking their finger in some stuff and reporting hotness". But I am much more optimistic about things analogous to "notice that a single number for each subsystem allows us to reproducibly predict which things get hotter/colder when in contact".

New Comment
3 comments, sorted by Click to highlight new comments since:

I think this is going to be wrong as an approach. Weight and temperature are properties of physical systems at specific points in time, and can be measured coherently because we understand laws about those systems. Alignment could be measured as a function of a particular system at a specific point in time, once we have a clear understanding of what? All of human values? 

I'm not arguing that "alignment" specifically is the thing we should be measuring.

More generally, a useful mantra is "we do not get to choose the ontology". In this context, it means that there are certain things which are natural to measure (like temperature and weight), and we do not get to pick what they are; we have to discover what they are.

That's correct. My point is that measuring goals which are not natural to measure will, in general, have many more problems with Goodharting and similar misoptimization and overoptimization pressures. And other approaches can be more productive, or at least more care is needed with design of metrics rather than discovery of what to measure and how.