Let me try to repair Goodhart's law to avoid these problems:
By statistics, we should very generally expect two random variables to be uncorrelated unless there's a "good reason" to expect them to be correlated. Goodhart's law says that if U and V are correlated in some distribution, then (1) if a powerful optimizer tries to maximize U, then it will by default go far out of the distribution, (2) the mere fact that U and V were correlated in the distribution does not in itself constitute a "good reason" to expect them to be correlated far out of the distribution, so by default they won't be; (3) therefore we expect Goodhart's law "by default": you optimize U, thus go out of the distribution, thus break the correlation between U and V, and then V regresses back down to its mean.
So then we can start going through examples:
Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.
The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.
After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".
Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.
It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left with regret that I put in too much effort.
Most examples I can think of that look like potential anti-Goodharting seem the same: I don't mind that I overshot the target, but I do mind that I wasn't as efficient as I could have been.
There are two aspects to the Goodhart problem which are often conflated. One is trivially true for all proxy-true utility pairs; but the other is not.
Following this terminology, we'll say that V is the true goal, and U is the proxy. In the range of circumstances we're used to, U≈V - that's what's makes U a good proxy. Then the Goodhart problem has two aspects to it:
Aspect 1. is a tautology: the best way to maximise V is to... maximise V. Hence maximising U is almost certainly less effective at increasing V than maximising V directly.
But aspect 2. is not a tautology, and need not be true for generic proxy-true utility pairs (U,V). For instance, some pairs have the reverse Goodhart problem:
Are there utility functions that have anti-Goodhart problems? Yes, many. If (U,V) have a Goodhart problem, then (U,V′) has an anti-Goodhart problem if V′=2U−V.
Then in the range of circumstances we're used to, U=2U−U≈2U−V=V′. And, as V starts growing slower than U, V′ starts growing faster; when V starts decreasing, V′ starts growing more than twice as fast as U:
Are there more natural utility functions that have anti-Goodhart problems? Yes. If for instance you're a total or average utilitarian, and you maximise the proxy "do the best for the worst off". In general, if V is your true utility and U is a prioritarian/conservative version of V (eg U=−e−V or U=log(V) or other concave, increasing functions) then we have reverse Goodhart behaviour[1].
So saying that we expect Goodhart problems (in the second sense) means that we know something special about V (and U). It's not a generic problem for all utility functions, but for the ones we expect to correspond to human preferences.
We also need to scale the proxy so that V≈U on the typical range of circumstances; thus the conservatism of U is only visible away from the typical range. ↩︎