I believe that most existing proposals for aligning AI with human values are unlikely to succeed in the limit of optimization pressure due to Goodhart's curse. I believe this strongly enough that it continues to surprise me a bit that people keep working on things that I think clearly won't work, though I think there are two explanations for this. One is that, unlike me, they expect to approach superhuman AGI slowly and so we will have many opportunities to notice when we are deviating from human values as a result of Goodhart's curse and make corrections. The other is that they are simply unaware of the force of the argument that convinces me because, although it has been written about before, I have not seen recent, pointed arguments for it rather than technical explanations of it and its effects, and my grokking of this point happened long ago on mailing lists of yore via more intuitive and less formal arguments than I see now. I can't promise to make my points as intuitive as I would like, but nonetheless I will try to address this latter explanation by saying a few words about why I am convinced.
Note: Some of this borrows heavily from a paper I have out for publication, but with substantial additions for readability by a wider audience.
Goodhart's Curse
Goodhart's curse is what happens when Goodhart's law meets the optimizer's curse. Let's review those two here briefly for completeness. Feel free to skip some of this if you are already familiar.
Goodhart's Law
As originally formulated, Goodhart's law says "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes". A more accessible expression of Goodhart's law, though, would be that when a measure of success becomes the target, it ceases to be a good measure. A well known example of Goodhart's law comes from a program to exterminate rats in French-colonial Hanoi, Vietnam: the program paid a bounty for rat tails on the assumption that a rat tail represented a dead rat, but rat catchers would instead catch rats, cut off their tails, and release the rats so they could breed and produce new rats so their tails could be turned in for more bounties. There was a similar case with bounties for dead cobras in British-colonial India that intended to incentivize the reduction of cobra populations that instead resulted in the creation of cobra farms. And of course we can't forget this classic, though apocryphal, tale:
In the old Soviet Union, the government rewarded factory managers for production quantity, not quality. In addition to ignoring quality, factory managers ignored customer needs. The end result was more and more tractors produced, for instance, even though these tractors just sat unused. Managers of factories that produced nails optimized production by producing either fewer larger and heavier nails or more smaller nails.
The fact that factories were judged by rough physical quotas rather than by their ability to satisfy customers – their customers were the state – had predictably bad results. If, to take a real case, a nail factory’s output was measured by number, factories produced large numbers of small pink-like nails. If output was measured by weight, the nail factories shifted to fewer, very heavy nails.
Although a joke, the natural endpoint might be the production of a single, giant nail. It's unknown, to be best of my knowledge, if the nail example above is real, although reportedly something similar really did happen with shoes. Additional examples of Goodhart's law abound:
targeting easily-measured clicks rather than conversions in online advertising
optimizing for profits over company health in business
unintentionally incentivizing publication count over academic progress in academia
prioritizing grades and tests scores over learning in schools
maximizing score rather than having fun in video games
As these examples demonstrate, most of us are familiar with Goodhart's law or something similar in everyday life such that it's not that surprising when we learn about it. The opposite seems to be true of the optimizer's curse, being well studied but mostly invisible to us in daily life unless we take care to notice it.
The optimizer's curse
The optimizer's curse observes that when choosing among several possibilities, if we choose the option that is expected to maximize value, we will be "disappointed" (realize less than the expected value) more often than average. This happens because optimization acts as a source of bias in favor of overestimation, even if the estimated value of each option is not biased itself. And the curse is robust, such that even if an agent satisfices (accepts the option with the least expected value that is greater than neutral) rather than optimizes they will still suffer more disappointment than gratification. So each option can be estimated in an unbiased way, yet because there is a bias imposed by a preference for estimations of positive value, we can end up in a situation where we consistently pick options that are more likely to be overestimating their value.
The optimizer's curse has many opportunities to bite us. For example, a company trying to pick a project to invest in to earn the highest rate of return will consistently earn less return than predicted due to the optimizer's curse. Same goes for an investor picking investment instruments. Similarly a person trying to pick the best vacation will, on average, have a worse vacation than expected because the vacation that looks the best is more likely than the other options to be worse than predicted. And of course an AI trying to pick the policy that maximizes human value will usually pick a policy that performs worse than expected, but we'll return to that one later when we consider how it interacts with Goodhart.
I wish I had more, better examples of the optimizer's curse to offer you, especially documented real-world cases that are relatable, but most of what I can find seems to be about petroleumproduction (no, really!) or otherwise about managing and choosing among capital-intensiveprojects. The best I can offer you is this story from my own life about shoes:
For a long time, I wanted to find the best shoes. "Best" could mean many things, but basically I wanted the best shoe for all purposes. I wanted the shoe to be technically impressive, so it would have features like waterproofing, puncture-proofing, extremely durability, extreme thinness and lightness, able to be worn without socks, and breathability. I also wanted it to look classy and casual, able to mix and match with anything. You might say this is impossible, but I would have said you just aren't trying hard enough.
So I tried a lot of shoes. And in every case I was disappointed. One was durable but ugly, another was waterproof but made my feet smell, another looked good but was uncomfortable, and another was just too weird. The harder I tried to find the perfect shoe, the more I was disappointed. Cursed was I for optimizing!
This story isn't perfect: I was optimizing for multiple variables and making tradeoffs, and the solution was to find some set of tradeoffs I would be happiest with and to accept that I was mostly only going to move along the efficiency frontier rather than expand it by trying new shoes, so it teaches the wrong lesson unless we look at it through a very narrow lens. Better examples in the comments are deeply appreciated!
Before moving on, it's worth talking about attempts to mitigate the optimizer's curse. It would seem, since it is a systematic bias, that we could account for the optimizer's curse the same way we do most systematic biases using better Bayesian reasoning. And we can, but in many cases this is difficult or impossible because we lack sufficient information about the underlying distributions to make the necessary corrections. Instead we find ourselves in a situation where we know we suffer bias in our expectations but cannot adequately correct for it such that we can be sure we aren't still suffering from it even if we try not to. Put another way, attempting to correct for the optimizer's curse without perfect information simply shifts the distortions caused by the optimizer's curse to the corrections rather than the original estimates themselves without eliminating the bias.
Given how persistent the optimizer's curse is, it shouldn't surprise us it will pop up when we try to optimize for some measurable target, giving us Goodhart's curse.
Goodhart's curse
Combining Goodhart's law with the optimizers curse, we get Goodhart's curse: attempts to optimize for a measure of success result in increased likelihood of failure to hit the desired target. Or as someone on Arbital (probably Eliezer) put it: "neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V". In personal terms, you might say that the harder you try to get what you want, the more you'll find yourself doing things that cause you not to get what you want despite trying to act otherwise. I think this point is unintuitive because it feels contrary to the normal narrative that success comes from trying, and trying harder makes it more likely you will succeed, but that might only appear to be true due to survivorship bias. To give you an intuitive feel for this personal expression of Goodhart's curse, another story from my life:
At some tender age, maybe around 11 or 12, I became obsessed with efficiency so I would have time to do more in my life.
There was the little stuff, like figuring out the "best" way to brush my teeth or get out of bed. There was the medium stuff, like finding ways to read faster or write without moving my hand as much. And there was the big stuff, like trying to figure out how to get by on less sleep and how to study topics in the optimal order. It touched everything, from shoe tying, to clothes putting on, to walking, to playing, to eating, and on and on. It was personal Taylorism gone mad.
To take a single example, let's consider the important activity of eating breakfast cereal and how that process can be made more efficient. There's the question of how to store the cereal, how to store the milk, how to retrieve the cereal and milk, how to pour the two into the bowl, how to hold the spoon, how to put the cereal in the mouth, how to chew, how to swallow, and how to clean up, to name just a few. Maybe I could save a few seconds if I held the spoon differently, or stored the cereal in a different container, or store the milk on a different shelf in the refrigerator, or, or, or. By application of experimentation and observation I could get really good at eating cereal, saving maybe a minute or more off my daily routine!
Of course, this was out of a morning routine that lasted over an hour and included a lot of slack and waiting because I had three sisters and two parents and lived in a house with two bathrooms. But still, one whole minute saved!
By the time I was 13 or 14 I was over it. I had spent a couple years working hard at efficiency, gotten little for it, and lost a lot in exchange. Doing all that efficiency work was hard, made things that were once fun feel like work, and, worst of all, weren't delivering on the original purpose of doing more with my life. I had optimized for the measure—time to complete task, number of motions to complete task, etc.—at the expense of the target—getting more done. Yes, I was efficient at some things, but that efficiency was costing so much effort and will power that I was worse off than if I had just ignored the kind of efficiency I was targeting.
In this story, as I did things that I thought would help me reach my target, I actually moved myself further away from it. Eventually it got bad enough that I noticed the divergence and was compelled to course correct, but this depended on me having ever known what the original target was. If I were not the optimizer, and instead say some impersonal apparatus like the state or an AI were, there's considerable risk the optimizer would have kept optimizing and diverging long after it became clear to me that divergence had happened. For an intuitive sense of how this has happened historically, I recommend Seeing Like a State.
I hope by this point you are convinced of the power and prevalence of Goodhart's curse (but if not please let me know your thoughts in the comments, especially if you have ideas about what could be said that would be convincing). Now we are poised to consider Goodhart's curse and its relationship to AI alignment.
Goodhart's curse and AI alignment
Let's suppose we want to build an AI that is aligned with human values. A high level overview of a scheme for doing this is that we build an AI, check to see if it is aligned with human values so far, and then update it so that it is more aligned if it is not fully aligned already.
Although the details vary, this describes roughly the way methods like IRL and CIRL work, and possibly how HCH and safety via debate work in practice. Consequently, I think all of them will fail due to Goodhart's curse.
Caveat: I think HCH and debate-like methods may be able to work and avoid Goodhart's curse, though I'm not certain, and it would require careful design that I'm not sure the current work on these has done. I hope to have more to say on this in the future.
The way Goodhart's curse sneaks in to these is that they all seek to apply optimization pressure to something observable that is not exactly the same thing as what we want. In the case of IRL and CIRL, it's an AI optimizing over inferred values rather than the values themselves. In HCH and safety via debate, it's a human preferentially selecting AI that the human observes and then comes to believe does what it wants. So long as that observation step is there and we optimize based on observation, Goodhart's curse applies and we can expect, with sufficient optimization pressure, that alignment will be lost, even and possibly especially without us noticing because we're focused on the observable measure rather than the target.
Yikes!
Beyond Goodhart's curse
Do we have any hope of creating aligned AI if just making a (non-indifferent) choice based on an observation dooms us to Goodhart's curse?
Honestly, I don't know. I'm pretty pessimistic that we can solve alignment, yet in spite of this I keep working on it because I also believe it's the best chance we have. I suspect we may only be able to rule out solutions that are dangerous but not positively select for solutions that are safe, and may have to approach solving alignment by eliminating everything that won't work and then doing something in the tiny space of options we have left that we can't say for sure will end in catastrophe.
Maybe we can get around Goodhart's curse by applying so little optimization pressure that it doesn't happen? One proposal in this direction is quantilization. I remain doubtful, since without sufficient optimization it's not clear how we do better than picking at random.
Maybe we can get around Goodhart's curse by optimizing the target directly rather than a measure of it? Again, I remain doubtful, mostly due to epistemological issues suggesting all we ever have are observations and never the real thing itself.
Maybe we can overcome either or both issues via pragmatic means that negate enough of the problem that, although we don't actually eliminate Goodhart's curse completely, we eliminate enough of its effect that we can ignore it? Given the risks and the downsides I'm not excited about this approach, but it may be the best we have.
And, if all that wasn't bad enough, Goodhart's curse isn't even the only thing we have to watch out for! Scott Garrabrant and David Manheim have renamed Goodhart's curse to "Regressional Goodhart" to distinguish it from other forms of Goodharting where mechanisms other than optimization may be responsible for divergence from the target. The only reason I focus on Goodhart's curse is that it's the way proposed alignment schemes usually fail; other safety proposals may fail via other Goodharting effects.
All this makes it seem extremely likely to me that we aren't even close to solving AI alignment yet, to the point that we likely haven't even stumbled upon the general mechanism that will work, or if we have we haven't identified it as such. Thus, if there's anything upward looking I can end this on, it's that there's vast opportunity to do good for the world via work on AI safety.
I believe that most existing proposals for aligning AI with human values are unlikely to succeed in the limit of optimization pressure due to Goodhart's curse. I believe this strongly enough that it continues to surprise me a bit that people keep working on things that I think clearly won't work, though I think there are two explanations for this. One is that, unlike me, they expect to approach superhuman AGI slowly and so we will have many opportunities to notice when we are deviating from human values as a result of Goodhart's curse and make corrections. The other is that they are simply unaware of the force of the argument that convinces me because, although it has been written about before, I have not seen recent, pointed arguments for it rather than technical explanations of it and its effects, and my grokking of this point happened long ago on mailing lists of yore via more intuitive and less formal arguments than I see now. I can't promise to make my points as intuitive as I would like, but nonetheless I will try to address this latter explanation by saying a few words about why I am convinced.
Note: Some of this borrows heavily from a paper I have out for publication, but with substantial additions for readability by a wider audience.
Goodhart's Curse
Goodhart's curse is what happens when Goodhart's law meets the optimizer's curse. Let's review those two here briefly for completeness. Feel free to skip some of this if you are already familiar.
Goodhart's Law
As originally formulated, Goodhart's law says "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes". A more accessible expression of Goodhart's law, though, would be that when a measure of success becomes the target, it ceases to be a good measure. A well known example of Goodhart's law comes from a program to exterminate rats in French-colonial Hanoi, Vietnam: the program paid a bounty for rat tails on the assumption that a rat tail represented a dead rat, but rat catchers would instead catch rats, cut off their tails, and release the rats so they could breed and produce new rats so their tails could be turned in for more bounties. There was a similar case with bounties for dead cobras in British-colonial India that intended to incentivize the reduction of cobra populations that instead resulted in the creation of cobra farms. And of course we can't forget this classic, though apocryphal, tale:
Although a joke, the natural endpoint might be the production of a single, giant nail. It's unknown, to be best of my knowledge, if the nail example above is real, although reportedly something similar really did happen with shoes. Additional examples of Goodhart's law abound:
As these examples demonstrate, most of us are familiar with Goodhart's law or something similar in everyday life such that it's not that surprising when we learn about it. The opposite seems to be true of the optimizer's curse, being well studied but mostly invisible to us in daily life unless we take care to notice it.
The optimizer's curse
The optimizer's curse observes that when choosing among several possibilities, if we choose the option that is expected to maximize value, we will be "disappointed" (realize less than the expected value) more often than average. This happens because optimization acts as a source of bias in favor of overestimation, even if the estimated value of each option is not biased itself. And the curse is robust, such that even if an agent satisfices (accepts the option with the least expected value that is greater than neutral) rather than optimizes they will still suffer more disappointment than gratification. So each option can be estimated in an unbiased way, yet because there is a bias imposed by a preference for estimations of positive value, we can end up in a situation where we consistently pick options that are more likely to be overestimating their value.
The optimizer's curse has many opportunities to bite us. For example, a company trying to pick a project to invest in to earn the highest rate of return will consistently earn less return than predicted due to the optimizer's curse. Same goes for an investor picking investment instruments. Similarly a person trying to pick the best vacation will, on average, have a worse vacation than expected because the vacation that looks the best is more likely than the other options to be worse than predicted. And of course an AI trying to pick the policy that maximizes human value will usually pick a policy that performs worse than expected, but we'll return to that one later when we consider how it interacts with Goodhart.
I wish I had more, better examples of the optimizer's curse to offer you, especially documented real-world cases that are relatable, but most of what I can find seems to be about petroleum production (no, really!) or otherwise about managing and choosing among capital-intensive projects. The best I can offer you is this story from my own life about shoes:
This story isn't perfect: I was optimizing for multiple variables and making tradeoffs, and the solution was to find some set of tradeoffs I would be happiest with and to accept that I was mostly only going to move along the efficiency frontier rather than expand it by trying new shoes, so it teaches the wrong lesson unless we look at it through a very narrow lens. Better examples in the comments are deeply appreciated!
Before moving on, it's worth talking about attempts to mitigate the optimizer's curse. It would seem, since it is a systematic bias, that we could account for the optimizer's curse the same way we do most systematic biases using better Bayesian reasoning. And we can, but in many cases this is difficult or impossible because we lack sufficient information about the underlying distributions to make the necessary corrections. Instead we find ourselves in a situation where we know we suffer bias in our expectations but cannot adequately correct for it such that we can be sure we aren't still suffering from it even if we try not to. Put another way, attempting to correct for the optimizer's curse without perfect information simply shifts the distortions caused by the optimizer's curse to the corrections rather than the original estimates themselves without eliminating the bias.
Given how persistent the optimizer's curse is, it shouldn't surprise us it will pop up when we try to optimize for some measurable target, giving us Goodhart's curse.
Goodhart's curse
Combining Goodhart's law with the optimizers curse, we get Goodhart's curse: attempts to optimize for a measure of success result in increased likelihood of failure to hit the desired target. Or as someone on Arbital (probably Eliezer) put it: "neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V". In personal terms, you might say that the harder you try to get what you want, the more you'll find yourself doing things that cause you not to get what you want despite trying to act otherwise. I think this point is unintuitive because it feels contrary to the normal narrative that success comes from trying, and trying harder makes it more likely you will succeed, but that might only appear to be true due to survivorship bias. To give you an intuitive feel for this personal expression of Goodhart's curse, another story from my life:
In this story, as I did things that I thought would help me reach my target, I actually moved myself further away from it. Eventually it got bad enough that I noticed the divergence and was compelled to course correct, but this depended on me having ever known what the original target was. If I were not the optimizer, and instead say some impersonal apparatus like the state or an AI were, there's considerable risk the optimizer would have kept optimizing and diverging long after it became clear to me that divergence had happened. For an intuitive sense of how this has happened historically, I recommend Seeing Like a State.
I hope by this point you are convinced of the power and prevalence of Goodhart's curse (but if not please let me know your thoughts in the comments, especially if you have ideas about what could be said that would be convincing). Now we are poised to consider Goodhart's curse and its relationship to AI alignment.
Goodhart's curse and AI alignment
Let's suppose we want to build an AI that is aligned with human values. A high level overview of a scheme for doing this is that we build an AI, check to see if it is aligned with human values so far, and then update it so that it is more aligned if it is not fully aligned already.
Although the details vary, this describes roughly the way methods like IRL and CIRL work, and possibly how HCH and safety via debate work in practice. Consequently, I think all of them will fail due to Goodhart's curse.
Caveat: I think HCH and debate-like methods may be able to work and avoid Goodhart's curse, though I'm not certain, and it would require careful design that I'm not sure the current work on these has done. I hope to have more to say on this in the future.
The way Goodhart's curse sneaks in to these is that they all seek to apply optimization pressure to something observable that is not exactly the same thing as what we want. In the case of IRL and CIRL, it's an AI optimizing over inferred values rather than the values themselves. In HCH and safety via debate, it's a human preferentially selecting AI that the human observes and then comes to believe does what it wants. So long as that observation step is there and we optimize based on observation, Goodhart's curse applies and we can expect, with sufficient optimization pressure, that alignment will be lost, even and possibly especially without us noticing because we're focused on the observable measure rather than the target.
Yikes!
Beyond Goodhart's curse
Do we have any hope of creating aligned AI if just making a (non-indifferent) choice based on an observation dooms us to Goodhart's curse?
Honestly, I don't know. I'm pretty pessimistic that we can solve alignment, yet in spite of this I keep working on it because I also believe it's the best chance we have. I suspect we may only be able to rule out solutions that are dangerous but not positively select for solutions that are safe, and may have to approach solving alignment by eliminating everything that won't work and then doing something in the tiny space of options we have left that we can't say for sure will end in catastrophe.
Maybe we can get around Goodhart's curse by applying so little optimization pressure that it doesn't happen? One proposal in this direction is quantilization. I remain doubtful, since without sufficient optimization it's not clear how we do better than picking at random.
Maybe we can get around Goodhart's curse by optimizing the target directly rather than a measure of it? Again, I remain doubtful, mostly due to epistemological issues suggesting all we ever have are observations and never the real thing itself.
Maybe we can overcome either or both issues via pragmatic means that negate enough of the problem that, although we don't actually eliminate Goodhart's curse completely, we eliminate enough of its effect that we can ignore it? Given the risks and the downsides I'm not excited about this approach, but it may be the best we have.
And, if all that wasn't bad enough, Goodhart's curse isn't even the only thing we have to watch out for! Scott Garrabrant and David Manheim have renamed Goodhart's curse to "Regressional Goodhart" to distinguish it from other forms of Goodharting where mechanisms other than optimization may be responsible for divergence from the target. The only reason I focus on Goodhart's curse is that it's the way proposed alignment schemes usually fail; other safety proposals may fail via other Goodharting effects.
All this makes it seem extremely likely to me that we aren't even close to solving AI alignment yet, to the point that we likely haven't even stumbled upon the general mechanism that will work, or if we have we haven't identified it as such. Thus, if there's anything upward looking I can end this on, it's that there's vast opportunity to do good for the world via work on AI safety.