Daniel Filan's bottle cap example was featured prominently in "Risks from Learned Optimization" for good reason. I think it is a really clear and useful example of why you might want to care about the internals of an optimization algorithm and not just its behavior, and helped motivate that framing in the "Risks from Learned Optimization" paper.
Daniel Filan's bottle cap example
Note that Abram Demski deserves a large part of the credit for that specific example (somewhere between 'half' and 'all'), as noted in the final sentence of the post.
A reminder, since this looks like it has a few upvotes from AF users: posts need 2 nominations to proceed to the review round.
Review by the author:
I continue to endorse the contents of this post.
I don't really think about the post that much, but the post expresses a worldview that shapes how I do my research - that agency is a mechanical fact about the workings of a system.
To me, the main contribution of the post is setting up a question: what's a good definition of optimisation that avoids the counterexamples of the post? Ideally, this definition would refer or correspond to the mechanistic properties of the system, so that people could somehow statically determine whether a given controller was an optimiser. To the best of my knowledge, no such definition has been developed. As such, I see the post as not having kicked off a fruitful public conversation, and its value if any lies in how it has changed the way other people think about optimisation.
I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets, and that the naive definition seems to work with one modification:
A system is downstream from an optimizer of some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing.
I'm surprised nobody has yet replied that the two examples are both products of significant optimizers with relevant optimization targets.
Yes, this seems pretty important and relevant.
That being said, I think that that definition suggests that natural selection and/or the earth's crust are downstream from an optimiser of the number of Holiday Inns, or that my liver is downstream from an optimiser from my income, both of which aren't right.
Probably it's important to relate 'natural subgoals' to some ideal definition - which offers some hope, since 'subgoal' is really a computational notion, so maybe investigation along these lines would offer a more computational characterisation of optimisation.
[EDIT: I made this comment longer and more contentful]
Okay, so another necessary condition for being downstream from an optimizer is being causally downstream. I'm sure there are other conditions, but the claim still feels like an important addition to the conversation.
I think my syntax/semantics idea is relevant to this question - especially the idea of different sets of environments. https://www.lesswrong.com/posts/EEPdbtvW8ei9Yi2e8/bridging-syntax-and-semantics-empirically
For example, suppose we have a super-intelligent bottle cap, dedicated to staying on the bottle (and with some convenient manufacturing arms and manufacturing capability. This seems to be exactly an optimiser, one that we mere humans cannot expect to be able to get off the bottle.
In contrast the standard bottle cap will only remain on the bottle in a much narrower set of circumstances (though the superintelligent bottle cap will also remain on in those circumstances).
So it seems that what distinguishes the standard bottle cap from a genuine optimiser, is that the genuine optimiser will accomplish its role in a much larger set of (possibly antagonistic) environments, while the standard bottle cap will only do so in a much smaller set of circumstances.
It's helped me hone my thinking on what is and isn't an optimiser (and a wireheader, and so on, for associated concepts).
Crossposted from my blog.
One thing I worry about sometimes is people writing code with optimisers in it, without realising that that's what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario, I'd be wary about your deploying that controller, since the controller itself is doing optimisation which might steer the world into a weird and unwelcome place.
In order to avoid such scenarios, it would be nice if one could look at an algorithm and determine if it was doing optimisation. Ideally, this would involve an objective definition of optimisation that could be checked from the source code of the algorithm, rather than something like "an optimiser is a system whose behaviour can't usefully be predicted mechanically, but can be predicted by assuming it near-optimises some objective function", since such a definition breaks down when you have the algorithm's source code and can compute its behaviour mechanically.
You might think about optimisation as follows: a system is optimising some objective function to the extent that that objective function attains much higher values than would be attained if the system didn't exist, or were doing some other random thing. This type of definition includes those put forward by Yudkowsky and Oesterheld. However, I think there are crucial counterexamples to this style of definition.
Firstly, consider a lid screwed onto a bottle of water. If not for this lid, or if the lid had a hole in it or were more loose, the water would likely exit the bottle via evaporation or being knocked over, but with the lid, the water stays in the bottle much more reliably than otherwise. As a result, you might think that the lid is optimising the water remaining inside the bottle. However, I claim that this is not the case: the lid is just a rigid object designed by some optimiser that wanted water to remain inside the bottle.
This isn't an incredibly compelling counterexample, since it doesn't qualify as an optimiser according to Yudkowsky's definition: it can be more simply described as a rigid object of a certain shape than an optimiser, so it isn't an optimiser. I am somewhat uncomfortable with this move (surely systems that are sub-optimal in complicated ways that are easily predictable by their source code should still count as optimisers?), but it's worth coming up with another counterexample to which this objection won't apply.
Secondly, consider my liver. It's a complex physical system that's hard to describe, but if it were absent or behaved very differently, my body wouldn't work, I wouldn't remain alive, and I wouldn't be able to make any money, meaning that my bank account balance would be significantly lower than it is. In fact, subject to the constraint that the rest of my body works in the way that it actually works, it's hard to imagine what my liver could do which would result in a much higher bank balance. Nevertheless, it seems wrong to say that my liver is optimising my bank balance, and more right to say that it "detoxifies various metabolites, synthesizes proteins, and produces biochemicals necessary for digestion"---even though that gives a less precise account of the liver's behaviour.
In fact, my liver's behaviour has something to do with optimising my income: it was created by evolution, which was sort of an optimisation process for agents that reproduce a lot, which has a lot to do with me having a lot of money in my bank account. It also sort of optimises some aspects of my digestion, which is a necessary sub-process of me getting a lot of money in my bank account. This explains the link between my liver function and my income without having to treat my liver as a bank account funds maximiser.
What's a better theory of optimisation that doesn't fall prey to these counterexamples? I don't know. That being said, I think that they should involve the internal details of the algorithms implemented by those physical systems. For instance, I think of gradient ascent as an optimisation algorithm because I can tell that at each iteration, it improves on its objective function a bit. Ideally, with such a definition you could decide whether an algorithm was doing optimisation without having to run it and see its behaviour, since one of the whole points of a definition of optimisation is to help you avoid running systems that do it.
Thanks to Abram Demski, who came up with the bottle-cap example in a conversation about this idea.