Formalising Catastrophic Goodhart
This sequence partially overlaps with our paper Extinction Risk from AI: Invisible to Science?.
In this sequence of posts, I will try to lay out some of my thoughts related to extreme versions of over-optimisation and its relation to extinction risk from AI.
Some ways in which I hope this to be useful, roughly in the order of importance:
- This seems important for advocacy. Some of these findings indicate that it might be exceedingly difficult to create good formal models that show that extinction from AI is a threat --- even if it is. At the same time, it seems there are quite a few people who are sceptical of AI risk precisely because of the inability to give a rigorous description of the threat model. Consequently, I think it should be possible to re-package some of the ideas to be (possibly very) useful for advocacy. But I am no expert on this, so this is where others can help.
- Introducing some simple concepts that improve alignment discussions. In particular, I quite like the notion of a weak Extinction-Level Goodhart's Law, informally defined as "if you optimise an imperfectly aligned goal sufficiently hard, you die". This is because you might agree with (some version of) this law, yet be skeptical about AI risk (because you expect "sufficiently hard" to be unreachable in practice).
- Pointing towards a research agenda. Some of the notions I discuss are, at the moment, only informally defined. However, I think the posts point at a research agenda on "formalising Catastrophic Goodhart's Law", in the sense that they can be read as "If we ever get a formal theory of this topic, here is what some of the important bits -- examples, observations, results, and theorems -- are going to be. Now please fill in the details."
At the moment, the sequence contains the following posts:
- Weak vs Quantitative Extinction-level Goodhart's Law, which suggests to distinguish between the theoretical cases of arbitrarily powerful over-optimisation, and whatever over-optimistion we should expect in practice.
- Extinction Risks from AI: Invisible to Science?, a linkpost to a paper which suggests that making a formal argument for AI risk --- of a similar rigour that we are used to in mathematics or computer science --- might require models that are too complex to rigorously analyse. The paper does this by observing that if a model is to be informative for the purpose of investigating a specific argument for AI risk, the model must satisfy certain properties. However, these properties seem to put a lower bound on the model's complexity.
- Which Model Properties are Necessary for Evaluating an Argument? and Dynamics Crucial to AI Risk Seem to Make for Complicated Models, two posts that zoom-in on the methodology, resp. the key argument of the paper.
- Extinction-level Goodhart's Law as a Property of the Environment, which also argues for the possibility that Computer Science and ML might have a fundamental blindspot regarding Extinction-level Goodhart's Law. However, this post approaches this topic from a different angle than the paper above: It starts by semi-formally defining Extinction-level Goodhart's Law (EGL), and conjectures that EGL is false in any (with some caveats) environment that is simple enough to be amenable to formal analysis.