In a recent paper by Scott Garrabrant and myself, we formalized and extended the categories Scott proposed for Goodhart-like phenomena. (If you haven't read either his post or the new paper, it's important background for most of this post.)

Here, I lay out my further intuitions about how and where the non-adversarial categories matter for AI safety. Specifically, I view these categories as particularly critical in preventing accidental superhuman AI, or near-term paperclipping. This makes them particularly crucial in the short term.

I do not think that most of the issues highlighted are new, but I think the framing is useful, and hopefully clearly presents why causal mistakes by Agentic AI are harder problems that I think is normally appreciated.

Epistemic Status: Provisional and open to revision based on new arguments, but arrived at after significant consideration. I believe conclusions 1-4 are restatements of well understood claims in AI safety. I believe conclusions 5 and 6 are less well appreciated.

Side Note: I am deferring discussion of adversarial Goodhart to the other paper and a later post; it is arguably more important, but in very different ways. The deferred topics includes most issues with multiple agentic AIs that interact, and issues with pre-specifying a control scheme for a superhuman AI.

Goodhart Effects Review - Read the paper for details!

Regressional Goodhart - When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.

Extremal Goodhart - Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the relationship between the proxy and the goal was observed. This occurs in the form of Model Insufficiency, or Change in Regime.

Causal Goodhart - When the causal path between the proxy and the goal is indirect, intervening can change the relationship between the measure and proxy, and optimizing can then cause perverse effects.

Adversarial Goodhart will not be discussed in this post. It occurs in two ways. Misalignment - The agent applies selection pressure knowing the regulator will apply different selection pressure on the basis of the metric. This allows the agent to hijack the regulator's optimization. Cobra Effect - The regulator modifies the agent goal, usually via an incentive, to correlate it with the regulator metric. The agent then either 1) uses selection pressure to create extremal Goodhart effects occur or make regressional Goodhart effects more severe, or 2) acts by changing the causal structure due to incompletely aligned goals in a way that creates a Goodhart effect.

Regressional and Extremal Goodhart

The first two categories of Goodhart-like phenomena, regressional and extremal, are over-optimization mistakes, and in my view the mistakes should be avoidable. This is not to say we don't need to treat AI as a cryptographic rocket probe, or that we don't need to be worried about it - just that we know what to be concerned about already. This seems vaguely related to what Scott Alexander calls "Mistake Theory" - the risks would be technically solvable if we could convince people not to do the stupid things that make them happen.

Regressional Goodhart, as Scott Garrabrant correctly noted, is an unavoidable phenomenon when doing unconstrained optimization using a fixed metric which is imperfectly correlated with a goal. To avoid the problems of overoptimization despite the unavoidable phenomenon, a safe AI systems must 1) have limited optimization power to allow robustness to misalignment, perhaps via satisficing, low-impact agents, or suspendable / stoppable agents, and/or 2) involve a metric which is adaptive using techniques like oversight or reinforcement learning. This allows humans to realign the AI, and safe approaches should ensure there are other ways to enforce robustness to scaling up.

Conclusion 1 - Don't allow unconstrained optimization using a fixed metric.

In less unavoidable ways, Extremal Goodhart effects are mistakes of overoptimization, and my intuition is that they should be addressable in similar ways. We need to be able to detect the regime changes or increasing misalignment of the metric, but the strategies that address regressional effects should be closely related or useful in the same cases. Again, it's not easy, but it's a well defined problem.

Conclusion 2 - When exploring the fitness landscape, don't jump down the optimization slope too quickly before double checking externally. This is especially true when moving to out-of-sample areas.

Despite the challenges, I think that the divergences between goals and metrics in the first two Goodhart-like effects can be understood and addressed beforehand, and these techniques are being actively explored. In fact, I think this describes at least a large plurality of the current work being done on AI alignment.

The Additional Challenge of Causality

Causal Goodhart, like the earlier two categories, is always a mistake of understanding. Unlike the first two, it seems less easily avoidable by being cautious. The difficulty of inferring causality correctly means that it's potentially easy to accidentally screw up an agentic AI's world model in a way that allows causal mistakes to be made. I'm unsure the approaches being considered for AI Safety are properly careful about this fact. (I am not very familiar with the various threads of AI safety research, so I may be mistaken on that count.)

Accounting for uncertainty about causal models is critical, but given the multiplicity of possible models, we run into the problems of computation seen in AIXI. (And even AIXI doesn't guarantee safety!)

So inferring causal structure is NP hard. Scott's earlier post claims that "you can try to infer the causal structure of the variables using statistical methods, and check that the proxy actually causes the goal before you intervene on the proxy. " The problem is that we can't actually infer causal structure well, even given RCTs, without simultaneously testing the full factorial set of cases. (And even then statistics is hard, and can be screwed up accidentally in complex and unanticipated ways.) Humans infer causality partly intuitively, but in more complex systems, badly. They can be taught to do it better (PDF), but only in narrow domains.

Conclusion 3 - Getting causality right is an intrinsically computationally hard and sample-inefficient problem, and building AI won't fix that.

As Pearl notes, policy is hard in part because knowing exactly these complex causal factors is hard. This isn't restricted to AI, and it also happens in [insert basically any public policy that you think we should stop already here]. (Politics is the mind-killer, and policy debates often center around claims about causality. No, I won't give contemporary examples.)

We don't even get causality right in the relatively simpler policy systems we already construct - hence Chesterton's fence, Boustead's Iron Law of Intervention, and the fact that intellectuals throughout history routinely start advocating strongly for things that turn out to be bad when actually applied. They never actually apologize for accidentally starving and killing 5% of their population. This, of course, is because their actual idea was good, it was just done badly. Obviously real killing of birds to reduce pests in China has never been tried.

Conclusion 4 - Sometimes the perverse effects of getting it a little bit wrong are really, really bad, especially because perverse effects may only be obvious after long delays.

There are two parts to this issue, the first of which is that mistaken causal structure can lead to regressional or extremal Goodhart. This is not causal Goodhart, and isn't more worrisome than those issues, since the earlier mentioned solutions still apply. The second part is that the action taken by the regulator may actually change the causal structure. They think they are doing something simple like removing a crop-eating predator, but the relationship between crop-eating and birds ignores the fact that the birds eat other pests. This is much more worrisome, and harder to avoid.

This second case is causal Goodhart. The mistake can occur as soon as you allow a regulator - Machine Learning, AI, or otherwise - to interact in arbitrary ways with wider complex systems directly to achieve specified goals, without specifying and restricting the methods to be used.

These problems don't show up in current deployed systems because humans typically choose the action set to be chosen from based on the causal understanding needed. The challenge is also not seen in toy-worlds, since testing domains are usually very well specified, and inferring causality becomes difficult only when the system being manipulated contains complex and not-fully-understood causal dynamics. (A possible counterexample is a story I heard secondhand about OpenAI developing what became the World of Bits system. Giving a RL system access to a random web browser and the mouse led to weird problems including, if I recall correctly, the entire system crashing.)

Conclusion 5 - This class of causal mistake problem should be expected to show up as proto-AI systems are fully deployed, not beforehand when tested in limited cases.

This class of problem does not seem to be addressed by much of the AI-Risk approaches that are currently being suggested or developed. (The only approach that avoids this is using Oracle-AIs.) It seems there is no alternative to using a tentative causal understanding for decision making if we allow any form of Agentic AI. The problems that it causes are not usually obvious to either the observer or the agent until the decision has been implemented.

Note that attempting to minimize the impact of a choice is done based on the same mistaken or imperfect causal model that leads to the decision, so it is not avoidable in this way. Humans providing reinforcement learning based on the projected outcomes of the decision are similarly unaware of the perverse effect, and impact minimization assumes that the projected impact is correct.

Conclusion 6 - Impact minimization strategies do not seem likely to fix the problems of causal mistakes.

Summary

It seems that the class of issues identified in the realm of Goodhart-like phenomena illustrates some potential advantages and issues worth considering in AI safety. The problems identified in part simply restate problems that are already understood, but the framework seems worth further consideration. Most critically, a better understanding of causal mistakes and causal Goodhart effects would potentially be valuable. If the conclusions here are incorrect, understanding why also seems useful for understanding the way in which AI risk can and cannot manifest.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 5:42 PM

Necromantic comment, sorry :P

I might be misinterpreting, but what I think you're saying is that if the humans make a mistake in using a causal model of the world and tell the AI to optimize for something bad-in-retrospect, this is "mistaken causal structure lead[ing] to regressional or extremal Goodhart", and thus not really causal Goodhart per se (by the categories you're intending). But I'm still a little fuzzy on what you mean to be actual factual causal Goodhart.

Is the idea that humans tell the AI to optimize for something that is not bad-in-retrospect, but in the process of changing the world the causal model the AI is using will move outside its domain of validity? Does this only happen if the AI's model of the world is lacking compared to the humans'?

Yes on point Number 1, and partly on point number 2.

If humans don't have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong - and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.