Why Agent Foundations? An Overly Abstract Explanation

Let’s say you’re relatively new to the field of AI alignment. You notice a certain cluster of people in the field who claim that no substantive progress is likely to be made on alignment without first solving various foundational questions of agency. These sound like a bunch of weird pseudophilosophical questions, like “what does it mean for some chunk of the world to do optimization?”, or “how does an agent model a world bigger than itself?”, or “how do we ‘point’ at things?”, or in my case “how does abstraction work?”. You feel confused about why otherwise-smart-seeming people expect these weird pseudophilosophical questions to be unavoidable for engineering aligned AI. You go look for an explainer, but all you find is bits and pieces of worldview scattered across many posts, plus one post which does address the question but does so entirely in metaphor. Nobody seems to have written a straightforward explanation for why foundational questions of agency must be solved in order to significantly move the needle on alignment.

This post is an attempt to fill that gap. In my judgment, it mostly fails; it explains the abstract reasons for foundational agency research, but in order to convey the intuitions, it would need to instead follow the many paths by which researchers actually arrive at foundational questions of agency. But a better post won’t be ready for a while, and maybe this one will prove useful in the meantime.

Note that this post is not an attempt to address people who already have strong opinions that foundational questions of agency don't need to be answered for alignment; it's just intended as an explanation for those who don't understand what's going on.

Starting Point: The Obvious Stupid Idea

Let’s start from the obvious stupid idea for how to produce an aligned AI: have humans label policies/plans/actions/outcomes as good or bad, and then train an AI to optimize for the good things and avoid the bad things. (This is intentionally general enough to cover a broad range of setups; if you want something more specific, picture RL from human feedback.)

Assuming that this strategy could be efficiently implemented at scale, why would it not produce an aligned AI?

I see two main classes of problems:

  1. In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
  2. The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.

Note that both of these classes of problems are very pernicious: in both cases, the trained system’s results will look good at first glance.

Neither of these problems is obviously all that bad. In both cases, the system is behaving at least approximately well, at least within contexts not-too-different-from-training. These problems don’t become really bad until we apply optimization pressure, and Goodhart kicks in.

Goodhart’s Law

There’s a story about a Soviet nail factory. The factory was instructed to produce as many nails as possible, with rewards for high numbers and punishments for low numbers. Within a few years, the factory was producing huge numbers of nails - tiny useless nails, more like thumbtacks really. They were not very useful for nailing things.

So the planners changed the incentives: they decided to reward the factory for the total weight of nails produced. Within a few years, the factory was producing big heavy nails, more like lumps of steel really. They were still not very useful for nailing things.

This is Goodhart’s Law: when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.

In everyday life, if something looks good to a human, then it is probably actually good (i.e. that human would still think it’s good if they had more complete information and understanding). Obviously there are plenty of exceptions to this, but it works most of the time in day-to-day dealings. But if we start optimizing really hard to make things look good, then Goodhart’s Law kicks in. We end up with instagram food - an elaborate milkshake or salad or burger, visually arranged like a bouquet of flowers, but impractical to eat and kinda mediocre-tasting.

Returning to our two alignment subproblems from earlier:

  1. In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
  2. The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.

Goodhart in the context of problem (1): train a powerful AI to make things look good to humans, and we have the same problem as instagram food, but with way more optimization power applied. Think “Potemkin village world” - a world designed to look amazing, but with nothing behind the facade. Maybe not even any living humans behind the facade - after all, even generally-happy real humans will inevitably sometimes put forward appearances which would not appeal to the “good”/”bad”-labellers.

Goodhart in the context of problem (2): pretend our “good”/”bad” labels are perfect, but the system ends up optimizing for some target which doesn’t quite track our “good” labels, especially in new environments. Then that system ends up optimizing for whatever proxy it learned; we get the AI-equivalent of humans wearing condoms despite being optimized for reproductive fitness. And the AI then optimizes for that really hard.

Now, we’ve only talked about the problems with one particular alignment strategy. (We even explicitly picked a pretty stupid one.) But we’ve already seen the same basic issue come up in two different subproblems: Goodhart’s Law means that proxies which might at first glance seem approximately-fine will break down when lots of optimization pressure is applied. And when we’re talking about aligning powerful future AI, we’re talking about a lot of optimization pressure. That’s the key idea which generalizes to other alignment strategies: crappy proxies won’t cut it when we start to apply a lot of optimization pressure.

Goodhart Is Not Inevitable

Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel. We design the system so that the leaked radio signal has zero correlation with whatever signals are passed around inside the system.

Some time later, a clever adversary is able to use the radio side-channel to glean information about those internal signals using fourth-order statistics. Zero correlation was an imperfect proxy for zero information leak, and the proxy broke down under the adversary’s optimization pressure.

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong. Figuring out the True Name of a thing, a mathematical formulation sufficiently robust that one can apply lots of optimization pressure without the formulation breaking down, is absolutely possible and does happen. That said, finding such formulations is a sufficiently rare skill that most people will not ever have encountered it firsthand; it’s no surprise that many people automatically assume it impossible.

This is (one framing of) the fundamental reason why alignment researchers work on problems which sound like philosophy, or like turning philosophy into math. We are looking for the True Names of various relevant concepts - i.e. mathematical formulations robust enough that they will continue to work as intended even under lots of optimization pressure.

Aside: Accidentally Stumbling On True Names

You may have noticed that the problem of producing actually-good nails has basically been solved, despite all the optimization pressure brought to bear by nail producers. That problem was solved mainly by competitive markets and reputation systems. And it was solved long before we had robust mathematical formulations of markets and reputation systems.

Or, to reuse the example of mutual information: one-time pad encryption was intuitively obviously secure long before anyone could prove it.

So why do we need these “True Names” for alignment?

We might accidentally stumble on successful alignment techniques. (Alignment By Default is one such scenario.) On the other hand, we might also fuck it up by accident, and without the True Name we’d have no idea until it’s too late. (Remember, our canonical failure modes still look fine at first glance, even setting aside the question of whether the first AGI fooms without opportunity for iteration.) Indeed, people did historically fuck up markets and encryption by accident, repeatedly and to often-disastrous effect. It is generally nonobvious which pieces are load-bearing.

Aside from that, I also think the world provides lots of evidence that we are unlikely to accidentally stumble on successful alignment techniques, as well as lots of evidence that various specific classes of things which people suggest will not work. This evidence largely comes from failure to solve analogous existing problems “by default”. That’s a story for another post, though.

What “True Names” Do We Want/Need For Alignment?

What kind of “True Names” are needed for the two alignment subproblems discussed earlier?

  1. In cases where humans label bad things as “good”, the trained system will also be selected to label bad things as “good”. In other words, the trained AI will optimize for things which look “good'' to humans, even when those things are not very good.
  2. The trained system will likely end up implementing strategies which do “good”-labeled things in the training environment, but those strategies will not necessarily continue to do the things humans would consider “good” in other environments. The canonical analogy here is to human evolution: humans use condoms, even though evolution selected us to maximize reproductive fitness.

In the first subproblem, our “good”/”bad” labeling process is an imperfect proxy of what we actually want, and that proxy breaks down under optimization pressure. If we had the “True Name” of human values (insofar as such a thing exists), that would potentially solve the problem. Alternatively, rather than figuring out a “True Name” for human values directly, we could figure out a “pointer” to human values - something from which the “True Name” of human values could be automatically generated (analogous to the way that a True Name of nail-value is implicitly generated in an efficient market). Or, we could figure out the “True Names” of various other things as a substitute, like “do what I mean” or “corrigibility”.

In the second subproblem, the goals in the trained system are an imperfect proxy of the goals on which the system is trained, and that proxy breaks down when the trained system optimizes for it in a new environment. If we had the “True Names” of things like optimizers and goals, we could inspect a trained system directly to see if it contained any “inner optimizer” with a goal very different from what we intended. Ideally, we could also apply such techniques to physical systems like humans, e.g. as a way to point to human values.

Again, this is only one particular alignment strategy. But the idea generalizes: in order to make alignment strategies robust to lots of optimization pressure, we typically find that we need robust formulations of some intuitive concepts, i.e. “True Names”.

Regardless of the exact starting point, seekers of “True Names” quickly find themselves recursing into a search for “True Names” of lower-level components of agency, like:

  • Optimization
  • Goals
  • World models
  • Abstraction
  • Counterfactuals
  • Embeddedness

Aside: Generalizability

Instead of framing all this in terms of Goodhart’s Law, we could instead frame it in terms of generalizability. Indeed, Goodhart’s Law itself can be viewed as a case/driver of generalization failure: optimization by default pushes things into new regimes, and Goodhart’s Law consists of a proxy failing to generalize as intended into those new regimes.

In this frame, a “True Name” is a mathematical formulation which robustly generalizes as intended.

That, in turn, suggests a natural method to search for and recognize “True Names”. In some sense, they’re the easiest possible things to find, because they’re exactly the things which show up all over the place! We should be able to look at many different instances of some concept, and abstract out the same “True Name” from any of them.

Of course, the real acid test of a “True Name” is to prove, both empirically and mathematically, that systems which satisfy the conditions of the Name also have the other properties which one intuitively expects of the concept. Then we have a clear idea of just how robustly the formulation generalizes as intended.

Summary

We started out from one particular alignment strategy - a really bad one, but we care mainly about the failure modes. A central feature of the failure modes was Goodhart’s Law: when a proxy is used as an optimization target, it ceases to be a good proxy for the thing it was intended to measure. Some people would frame this as the central reason why alignment is hard.

Fortunately, Goodhart is not inevitable. It is possible to come up with formulations which match our concepts precisely enough that they hold up under lots of optimization pressure; mutual information is a good example. This is (one frame for) why alignment researchers invest in pseudophilosophical problems like “what are agents, mathematically?”. We want “True Names” of relevant concepts, formulations which will robustly generalize as intended.

Thankyou to Jack, Eli and everyone who attended our discussion last week which led to this post.

New Comment
22 comments, sorted by Click to highlight new comments since:

I mostly agree with this post.

Figuring out the True Name of a thing, a mathematical formulation sufficiently precise that one can apply lots of optimization pressure without the formulation breaking down, is absolutely possible and does happen.

Precision feels pretty far from the true name of the important feature of true names, I am not quite sure what precision means, but on one definition, precision is the opposite of generality, and true names seem anti-precise. I am not saying precision is not a virtue, and it does seem like precision is involved. (like precision on some meta level maybe?)

The second half about robustness to optimization pressure is much closer, but still not right. (I think it is a pretty direct consequence of true names.) It is clearly not yet a true name in the same way that "It is robust to people trying to push it" is the not the true name of inertia.

Precision feels pretty far from the true name of the important feature of true names

You're right, I wasn't being sufficiently careful about the wording of a bolded sentence. I should have said "robust" where it said "precise". Updated in the post; thankyou.

Also I basically agree that robustness to optimization is not the True Name of True Names, though it might be a sufficient condition.

[-]TLW129

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system?

You fundamentally cannot, so it's a moot point. There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.

I do not follow your seeming dismissal of this. You acknowledge it, and then... assert it's not a problem?

An analogy: solving the Halting problem is impossible[3]. It is sometimes useful to handwave a Halting oracle as a component of proofs regardless - but at the end of the day saying 'solving the Halting problem is easy, just use a Halting oracle' is not a solution.

Many people have an intuition like “everything is an imperfect proxy; we can never avoid Goodhart”. The point of the mutual information example is that this is basically wrong. 

"Many people have an intuition like "everything is an imperfect halting-problem solver; we can never avoid Turing". The point of the Halting oracle example is that this is basically wrong."

Hopefully this illustrates my point.

  1. ^

    In particular, it requires calculating the distributions to infinite accuracy, which in turn requires an infinite sample. (Consider if I have two independent perfectly fair coins. I flip each of them 3x and get HHT/HHT. My mutual information is non-zero!)

  2. ^

    For a sufficient example: gravity causes any[4] two things in the universe[5] to correlate[6].

  3. ^

    At least assuming the Church-Turing hypothesis is correct.

  4. ^

    Except potentially if there's an event horizon, although even that's an open question, and in that case it's a moot point because an AI in an event horizon is indistinguishable from no AI.

  5. ^

    Strictly speaking, within each others lightcone.

  6. ^

    And as soon as you have anything causing a correlation, the probability that other factors exactly cancel said correlation is zero.

There is no way to confirm zero mutual information[1], and even if there was there is zero probability that the mutual information was zero[2]. Very small, perhaps. Zero, no.

Thanks for bringing this up; it raises to a technical point which didn't make sense to include in the post but which I was hoping someone would raise in the comments.

The key point: Goodhart problems are about generalization, not approximation.

Suppose I have a proxy  for a true utility function , and  is always within  of u (i.e. ). I maximize . Then the true utility  achieved will be within  of the maximum achievable utility. Reasoning: in the worst case,  is  lower than  at the -maximizing point, and  higher than  at the -maximizing point.

Point is: if a proxy is close to the true utility function everywhere, then we will indeed achieve close-to-maximal utility upon maximizing the proxy. Goodhart problems require the proxy to not even be approximately close, in at least some places.

When we look at real-world Goodhart problems, they indeed involve situations where some approximation only works well within some region, and ceases to even be a good approximation once we move well outside that region. That's a generalization problem, not an approximation problem.

So approximations are fine, so long as they generalize well.

[-]TLW82

This is an interesting observation; I don't see how it addresses my point.

There is no exact solution to mutual information from two finite samples. There is no -approximation of mutual information from two finite samples, either.

=====

On the topic of said observation: beware that -approximations of many things are proven difficult to compute, and in some cases even are uncomputable. (The classic being Chaitin's Constant[1].)

In particular, you very often end up with Halting-problem style contradictions when computing properties of systems capable of unbounded computation, and even approximations thereof.

Unfortunately, 'value function of a powerful AI' tends to fall into that category[2].

  1. ^

    Which isn't "a" constant, but that's another matter.

  2. ^

    Well, as closely as anything in the physical world does, anyway. 

The Hardness of computing mutual information in general is not a very significant barrier to designing systems with (near-)zero mutual information between two components, in exactly the same way that the Hardness of computing whether a given program halts in general is not a very significant barrier to designing software which avoids infinite loops.

[-]TLW50

Let us make a distinction here between two cases:

  1. Observing the input and output of a blackbox X, and checking a property thereof.
  2. Whitebox knowledge of X, and checking a property thereof.

In physical systems, we do not have whitebox knowledge. We merely have a finite sample of a blackbox[1]. Sometimes said finite sample of a blackbox appears to match a fairly straightforward machine Y, but that's about the best we can say[2].

And yes, checking if two specific Turing-complete blackboxes are equivalent is undecidable[3], even though checking if two specific Turing-complete whiteboxes may be decidable.

in exactly the same way

It is not exactly the same way, due to the above.

 

  1. ^

    Namely, 'the laws of physics'

  2. ^

    (And worse, often doesn't exactly match in the observations thus far, or results in contradictions.)

  3. ^

    Trivially, due to indistinguishability issues. For any finite sequence of inputs and outputs, there are multiple machines X and X' which produce that sequence of outputs given the input, but which have later output that diverges. This is not a problem in the whitebox case because said machines are distinguishable.

You can ensure zero mutual information by building a sufficiently thick lead wall. By convention in engineering, any number is understood as a range, based on the number of significant digits relevant to the calculation. So "zero" is best understood as "zero within some tolerance". So long as we are not facing an intelligent and resourceful adversary, there will probably be a human-achievable amount of lead which cancels the signal sufficiently. 

This serves to illustrate the point that sometimes we can find ways to bound an error to within desirable tolerances, even if we do not yet know how to do such a thing in the face of the immense optimization pressure which superhuman AGI would bring to bear on a problem. 

We need plans to have achievable tolerances. For example, we need to assume a realistic amount of hardware failure. We can't treat the hardware as blackboxes; we know how it operates, and we have to make use of that knowledge. But we can't pretend perfect mathematical knowledge of it, either; we have error tolerances. 

So your blackbox/whitebox dichotomy doesn't fit the situation very well. 

But do you really buy the whole analogy with mutual information, IE buy the claim that we can judge the viability of escaping goodhart from this one example, and only object that the judgement with respect to this example was incorrect?

Perhaps we should really look at a range of examples, not just one? And judge John's point as reasonable if and only if we can find some cases where effectively perfect proxies were found?

Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed. So John's plan sounds doomed to failure, because it relies on finding an actually-perfect proxy, when all realistic proxies are imprecise at least in their physical tolerances. 

In which case, I would reply that the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course. So there is hope that we will not end up in a situation where every tiny flaw is exploited. What we are looking for is plans which robustly get us to that point. 

[-]TLW50

Ah, but perhaps your objection is that the difficulty of the AI alignment problem suggests that we do in fact need the analog of perfect zero correlation in order to succeed.

My objection is actually mostly to the example itself.

As you mention:

the idea is not to try ang contain a malign AGI which is already not on our side. The plan, to the extent that there is one, is to create systems that are on our side, and apply their optimization pressure to the task of keeping the plan on-course.

Compare with the example:

Suppose we’re designing some secure electronic equipment, and we’re concerned about the system leaking information to adversaries via a radio side-channel.

[...]

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

This is analogous to the case of... trying to contain a malign AI which is already not on our side.

Fair enough! I admit that John did not actually provide an argument for why alignment might be achievable by "guessing true names". I think the approach makes sense, but my argument for why this is the case does differ from John's arguments here.

Curated. Laying out a full story for why the work you're doing is solving AI alignment is very helpful, and this framing captures different things from other framings (e.g. Rocket Alignment, Embedded Curiosities, etc). Also it's simply written and mercifully short, relative to other such things. Thanks for this step in the conversation.

My usual framing/explanation (in conversations)

1. Individual future AI systems can be thought of about as points in some abstract "space of intelligent systems"

II. Notice different types of questions:
a. What properties some individual existing, experimentally accessible, points in this space has?
b. Where in this space we will end up in future?
c. What will be the properties of these points?
d. What's going on in generalizations/extrapolations from existing points to other parts of the space?
e. Are there some "effective theories" governing parts of the space? What are their domains of validity?
f. Are there some "fundamental theories" governing the space? What are their domains of validity?
g. What are the properties of the space? E.g., is it continuous?
h. What's the high-level evolutionary dynamic of our movement in this space?

III.
Use the metaphors from physics, history of science, econ,... to understand how these look in other domains, and what the relations between the questions are (e.g. relations betrween construction of heat engines, thermodynamics, stat physics, engineering, related markets, maths...)

IV.
Having something like "science of intelligent systems" seems plausible decisive factor for the ability to robustly solve the problem. 

Conceptual metaphors from other fields are often good prompts for thinking about how this may look, or what to search for

Why should we expect that True Names useful for research exist in general? It seems like there are reasons why they don't:

  • messy and non-robust maps between any clean concept and what we actually care about, such that more of the difficulty in research is in figuring out the map. The Standard Model of physics describes all the important physics behind protein folding, but we actually needed to invent AlphaFold.
  • The True Name doesn't quite represent what we care about. Tiling agents is a True Name for agents building successors, but we don't care that agents can rigorously prove things about their successors.
  • question is fundamentally ill-posed: what's the True Name of a crab? what's the True Name of a ghost?

Most of these examples are bad, but hopefully they get the point across.

But what if we instead design the system so that the leaked radio signal has zero mutual information with whatever signals are passed around inside the system? Then it doesn’t matter how much optimization pressure an adversary applies, they’re not going to figure out anything about those internal signals via leaked radio.

Flat out wrong. Its quite possible for A and B to have 0 mutual information. But A and B always have mutual information conditional on some C (assuming A and B each have information) Its possible for there to be absolutely no mutual information between any 2 of electricity use, leaked radio, private key. Yet there is mutual information between all 3. So if the adversary knows your electricity use, and can detect the leaked radio, then you know the key.

I think this is a good description of what agent foundations is and why it might be needed. But the binary of 'either we get alignment by default or we need to find the True Name' isn't how I think about it.

Rather, there's some unknown parameter, something like 'how sharply does the pressure towards incorrigibility ramp up, what capability level does it start at, how strong is it'?

Setting this at 0 means alignment by default. Setting this higher and higher means we need various kinds of Prosaic alignment strategies which are better at keeping systems corrigible and detecting bad behaviour. And setting it at 'infinity' means we need to find the True Names/foundational insights.

Rohin:

My rough model is that there's an unknown quantity about reality which is roughly "how strong does the oversight process have to be before the trained model does what the oversight process intended for it to do". p(doom) mainly depends on whether the actors training the powerful systems have sufficiently powerful oversight processes.

Maybe one way of getting at this is to look at ELK - if you think the simplest dumbest ELK proposals probably work, that's Alignment by Default. The harder you think prosaic alignment is, the more complex an ELK solution you expect to need. And if you think we need agent foundations, you think we need a worst-case ELK solution.

Regardless of the exact starting point, seekers of “True Names” quickly find themselves recursing into a search for “True Names” of lower-level components of agency, like:

  • Optimization
  • Goals
  • World models
  • Abstraction

This is the big missing piece for me. Could you elaborate on how you go from trying to find the True Names of human values to things like what is an agent, abstraction, and embeddedness? 

Goals makes sense, but the rest are not obvious why they'd be important or relevant. I feel like this reasoning would lead you to thinking about meta-ethics or something, not embeddedness and optimization. 

I suspect I'm missing a connecting piece here that would make it all click. 

Imagine it's 1665 and we're trying to figure out the True Name of physical force - i.e. how hard it feels like something is pushing or pulling.

One of the first steps is to go through our everyday experience, paying attention to what causes stronger/weaker sensations of pushing and pulling, or what effects stronger/weaker sensations have downstream. We might notice, for instance, that heavier objects take more force to push, or that a stronger push accelerates things faster. So, we might expect to find some robust relationship between the True Names of force, mass, and acceleration. At the time, we already basically had the True Name of mass, but we still needed to figure out the True Name of acceleration.

Why do we need the True Names of mass and acceleration, rather than just trying to figure out force directly? Well, finding robust relationships between the True Names of multiple concepts is, historically, one of the main ways we've been able to tell that we have the right Names. We can use e.g. the extension of a spring to measure force, but then what makes us think this operationalization of "force" is going to robustly generalize in the ways we expect? One main reason we expect today's notion of "force" to robustly generalize is the extremely robust experimental relationship force = mass * acceleration.

(Side note: for mathematical concepts, like e.g. probability and information, the main reason we expect the formulations to robustly generalize is usually the mathematical proof of some relationship, ideally augmented with experimental evidence, rather than just experimental evidence alone.)

Also, while we're still searching for the relevant Names, relationships between concepts help steer the search - for instance, it's a lot easier to figure out the True Name of heat once we have the Name of temperature.

Anyway, to answer what I think is your real question here...

A Hypothetical Dialogue

Philosopher: Imagine that Alice and Bob both want an apple, and they ca-

Alignment Researcher: Whoa now, hold up.

Philosopher: ... I haven't even asked the question yet.

Alignment Researcher: What is this "wanting" thing?

Philosopher: You know, it's this feeling you get where -

Alignment Researcher: I don't suppose you have any idea how to formulate that mathematically?

Philosopher: How about as a utility function? I hear that formulation has some arguments going for it...

Alignment Researcher: <looks at some math and experimental results> somewhat dubious, but it looks like it's at least in the right ballpark. Ok, so Alice and Bob both "want" an apple, meaning that (all else equal) they will accept whatever trades (or tradeoffs) give them the apple, and turn down any trades (or tradeoffs) which lose them the apple? Or, in other words, they're each optimizing to get that apple.

Philosopher: Well, not exactly, we're not saying Alice and Bob always do the things which get them what they want. "Wanting" isn't a purely behavioral concept. But you know what, sure, let's go with that for now. So Alice and Bob both want an apple, and they can't both -

Alignment Researcher: Ok, hang on, I'm trying to code up this hypothetical in Python, and I'm still unsure about the type-signatures. What are the inputs of the utility functions?

Philosopher: um... <does some googling>... Well, the standard formulation says that Alice and Bob are expected utility maximizers, so the inputs to the utility function will be random variables in their world models.

Alignment Researcher: World models? What the heck are world models??

Philosopher: Well, since we're modelling them as ideal agents anyway, it's a Bayesian distribution over a bunch of random variables corresponding to things in the world, which we upd-

Alignment Researcher: "corresponding to things in the world"? I know how to do Bayesian updates on distributions, but it's not like the variables in those distributions necessarily "correspond to the world" in any particular way. Presumably we need this "correspondence" in order for Alice and Bob's internal variables representing the "apple" to correspond with some actual apple? Heck, what even is an "actual apple"? That's important, if we want Alice and Bob to "want" some actual thing in the world, and not just particular activations in their sense-data...

... Anyway, The Point Is

When we try to mathematically formulate even very basic ideas about ethics, we very quickly run into questions about how to formalize agency, world-models, etc.

Not sure if I disagree or if we're placing emphasis differently.

I certainly agree that there are going to be places where we'll need to use nice, clean concepts that are known to generalize. But I don't think that the resolutions to problems 1 and 2 will look like nice clean concepts (like in minimizing mutual information). It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent. I think of some of my intuitions as my "real values" and others as mere "biases" in a thoroughly messy way.

But back on the first hand again, what's "messy" might be subjective. A good recipe for fitting values to me will certainly be simple and neat compared to the totality of information stored in my brain.

And I certainly want to move away from the framing that the way to deal with problems 1 and 2 is to say "Goodhart's law says that any difference between the proxy and our True Values gets amplified... so we just have to find our True Values" - I think this framing leads one to look for solutions in the wrong way (trying to eliminate ambiguity, trying to find a single human-comprehensible model of humans from which the True Values can be extracted, mistakes like that). But this is also kind of a matter of perspective - any satisfactory value learning process can be evaluated (given a background world-model) as if it assigns humans some set of True Values.

I think even if we just call these things differences in emphasis, they can still lead directly to disagreements about (even slightly) meta-level questions, such as how we should build trust in value learning schemes.

It's not just human values that are messy and contingent, even the pointer we want to use to gesture to those-things-we-want-to-treat-as-our-values is messy and contingent.

What's the evidence for this claim?

When I look at e.g. nails, the economic value of a nail seems reasonably complicated. Yet the "pointers to nail value" which we use in practice - i.e. competitive markets and reputation systems - do have clean, robust mathematical formulations.

Furthermore, before the mid-20th century, I expect that most people would have expected that competitive markets and reputation systems were inherently messy and contingent. They sure do look messy! People confuse messiness in the map for messiness in the territory.

I think of some of my intuitions as my "real values" and others as mere "biases" in a thoroughly messy way.

... this, for instance, I think is probably a map-territory confusion. The line between "real values" and "biases" will of course look messy when one has not yet figured out the True Name. That does not provide significant evidence of messiness in the territory.

Personally, I made this mistake really hard when I first started doing research in systems biology in undergrad. I thought the territory of biology was inherently messy, and I actually had an argument with my advisor that some of our research goals were unrealistic because of inherent biological messiness. In hindsight, I was completely wrong; the territory of biology just isn't that inherently messy. (My review of Design Principles of Biological Circuits goes into more depth on this topic.)

That said, the intuition that "the territory is messy" is responding to a real failure mode. The territory does not necessarily respect whatever ontology or model a human starts out with. People who expect a "clean" territory tend to be shocked by how "messy" the world looks when their original ontology/model inevitably turns out to not fit it very well. I think this is how people usually end up with the (sometimes useful!) intuition that the territory is messy.

Evidence & Priors

Note that the above mostly argued that the appearance of messiness is a feature of the map which yields little evidence about the messiness of the territory; even things with simple True Names look messy before we know those Names. But that still leaves unanswered two key questions:

  • Is there any way that we can get evidence of messiness of the territory itself?
  • What should our priors be regarding messiness in the territory?

One way to get positive evidence of messiness in the territory, for instance, is to see lots of smart people fail to find a clean True Name even with strong incentives to do so. Finding True Names is currently a fairly rare and illegible skill (there aren't a lot of Claude Shannons or Judea Pearls), so we usually don't have very strong evidence of this form in today's world, but there are possible futures in which it could become more relevant.

On the other hand, one way to get evidence of lack of messiness in the territory, even in places where we haven't yet found the True Names, is to notice that places which seem like canonical examples of very-probably-messy-territory repeatedly turn out to not be so messy. That was exactly my experience with systems biology, and is where my current intuitions on the matter originally came from.

Regarding priors, I think there's a decent argument that claims of messiness in the territory are always wrong, i.e. a messy territory is impossible in an important sense. The butterfly effect is a good example here: perhaps the flap of a butterfly's wings can change the course of a hurricane. But if the flap any butterfly's wings has a significant chance of changing the hurricane's course, for each of the billions of butterflies in the world, then ignorance of just a few dozen wing-flaps wipes out all the information about all the other wing-flaps; even if I measure the flaps of a million butterfly wings, this gives me basically-zero information about the hurricane's course. (For a toy mathematical version of this, see here.)

The point of this example is that this "messy" system is extremely well modeled across an extremely wide variety of epistemic states as pure noise, which is in some sense quite simple. (Obviously we're invoking an epistemic state here, which is a feature of a map, but the existence of a very wide range of simple and calibrated epistemic states is a feature of the territory.) More generally, the idea here is that there's a duality between structure and noise: anything which isn't "simple structure" is well-modeled as pure noise, which itself has a simple True Name. Of course then we can extend it to talk about fractal structure, in which more structure appears as we make the model more precise, but even then we get simple approximations.

Anyway, that argument about nonexistence of messy territory is more debatable than the rest of this comment, so don't get too caught up in it. The rest of the comment still stands even if the argument at the end is debatable.

It's not clear to me that your metaphors are pointing at something in particular.

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks. If this was all we needed, then yes, absolutely, I'm sure there's a similarly neat and simple way to instrumentalize human values - it's just going to fail if things are too smart, or too irrational, or too far in the future.

Biology being human-comprehensible is an interesting topic, and suppose I grant that it is - that we could have comprehensible explanatory stories for every thing our cells do, and that these stories aren't collectively leaving anything out. First off, I would like to note that such a collection of stories would still be really complicated relative to simple abstractions in physics or economics! Second, this doesn't connect directly to Goodhart's law. We're just talking about understanding biology, without mentioning purposes to which our understanding can be applied. Comprehending biology might help us generalize, in the sense of being able to predict what features will be conserved by mutation, or will adapt to a perturbed environment, but again this generalization only seems to work in a limited range, where the organism is doing all the same jobs with the same divisions between them.

The butterfly effect metaphor seems like the opposite of biology. In biology you can have lots of little important pieces - they're not individually redirecting the whole hurricane/organism, but they're doing locally-important jobs that follow comprehensible rules, and so we don't disregard them as noise. None of the butterflies have such locally-useful stories about what they're doing to the hurricane, they're all just applying small incomprehensible perturbations to a highly chaotic system. The lesson I take is that messiness is not the total lack of structure - when I say my room is messy, I don't mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution - it's just that the structure that's there isn't easy for humans to use.

I'd like to float one more metaphor: K-complexity and compression.

Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. The "True Name hypothesis" is that the compression looks like finding some simple, neat patterns that explain most of the data and we expect to generalize well, plus a lot of "diff" that's the noisy difference between the simple rules and the full bitstring. The "fractal hypothesis" is that there are a few simple patterns that do some of the work, and a few less simple rules that do more of the work, and so on for as long as you have patience. The "total mess hypothesis" is that simple rules do a small amount of the work, and a lot of the 10^8 bits is big highly-interdependent programs that would output something very different if you flipped just a few bits. Does this seem about right?

Revenue of a nail factory is a good proxy for the quality of the nails produced, but only within a fairly small bubble around our current world. You can't make the factory-owner too smart, or the economy too irrational, or allow for too many technological breakthroughs to happen, or else the proxy breaks.

I think you missed the point of that particular metaphor. The claim was not that revenue of a nail factory is a robust operationalization of nail value. The claim was that a competitive nail market plus nail-maker reputation tracking is a True Name for a pointer to nail value - i.e. such a system will naturally generate economically-valuable nails. Because we have a robust mathematical formalization of efficient markets, we know the conditions under which that pointer-to-nail-value will break down: things like the factory owner being smart enough to circumvent the market mechanism, or the economy too irrational, etc.

The lesson I take is that messiness is not the total lack of structure - when I say my room is messy, I don't mean that the arrangement of its component atoms has been sampled from the Boltzmann distribution - it's just that the structure that's there isn't easy for humans to use.

I agree with this, and it's a good summary of the takeaway of the butterfly effect analogy. In this frame, I think our disagreement is about whether "structure which isn't easy for humans to use" is generally hard to use because the humans haven't yet figured it out (but they could easily use it if they did figure it out) vs structure which humans are incapable of using due to hardware limitations of the brain.

Suppose I have a bit string of length 10^9, and I can compress it down to length 10^8. ...

This is an anology which I also considered bringing up, and I think you've analogized things basically correctly here. One important piece: if I can compress a bit string down to length 10^8, and I can't compress it any further, then that program of length 10^8 is itself incompressible - i.e. it's 10^8 random bits. As with the butterfly effect, we get a duality between structure and noise.

Actually, to be somewhat more precise: it may be that we could compress the length 10^8 program somewhat, but then we'd still need to run the decompressed program through an interpreter in order for it to generate our original bitstring. So the actual rule is something roughly like "any maximally-compressed string consists of a program shorter than roughly-the-length-of-the-shortest-interpreter, plus random bits" (with the obvious caveat that the short program and the random bits may not separate neatly).

I think you're saying: if a thing is messy, at least there can be a non-messy procedure / algorithm that converges to (a.k.a. points to) the thing. I think I'm with Charlie in feeling skeptical about this in regards to value learning, because I think value learning is significantly a normative question. Let me elaborate:

My genes plus 1.2e9 seconds of experience have built have built a fundamentally messy set of preferences, which are in some cases self-inconsistent, easily-manipulated, invalid-out-of-distribution, etc. It's easy enough to point to the set of preferences as a whole—you just say “Steve's preferences right now”.

In fact, one might eventually (I expect) be able to write down the learning algorithm, reward function, etc., that led to those preferences (but we won't be able to write down the many petabytes of messy training data), and we'll be able to talk about what the preferences look like in the brain. But still, you shouldn't and can't directly optimize according those preferences because they're self-inconsistent, invalid-out-of-distribution, they might involve ghosts, etc.

So then we have a normative question: if “fulfill Steve’s preferences” isn’t a straightforward thing, then what exactly should the AGI do? Maybe we should ask Steve what value learning ought to look like? But maybe I say “I don’t know”, or maybe I give an answer that I wouldn’t endorse upon reflection, or in hindsight. So maybe we should have the AGI do whatever Steve will endorse in hindsight? No, that leads to brainwashing.

Anyway, it's possible that we'll come up with an operationalization of value learning that really nails down what we think the AGI ought to do. (Let's say, for example, something like CEV but more specific.) If we do, to what extent should we expect this operationalization to be simple and elegant, versus messy? (For example, in my book, Stuart Armstrong research agenda v0.9 counts as rather messy.) I think an answer on the messier side is quite plausible. Remember, (1) this is a normative question, and (2) that means that the foundation on which it's built is human preferences (about what value learning ought to look like), and (3) as above, human preferences are fundamentally messy because they involve a lifetime of learning from data. This is especially true if we don't want to trample over individual / cultural differences of opinion about (for example) the boundary between advice (good) vs manipulation (bad).

(Low confidence on all this.)