In the post introducing mesa optimization, the authors defined an optimizer as

a system [that is] internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

The paper continues by defining a mesa optimizer as an optimizer that was selected by a base optimizer.

However, there are a number of issues with this definition, as some have already pointed out.

First, I think by this definition humans are clearly not mesa optimizers. Most optimization we do is implicit. Yet, humans are the supposed to be the prototypical examples of mesa optimizers, which appears be a contradiction.

Second, the definition excludes perfectly legitimate examples of inner alignment failures. To see why, consider a simple feedforward neural network trained by deep reinforcement learning to navigate my Chests and Keys environment. Since "go to the nearest key" is a good proxy for getting the reward, the neural network simply returns the action, that when given the board state, results in the agent getting closer to the nearest key.

Is the feedforward neural network optimizing anything here? Hardly, it's just applying a heuristic. Note that you don't need to do anything like an internal A* search to find keys in a maze, because in many environments, following a wall until the key is within sight, and then performing a very shallow search (which doesn't have to be explicit) could work fairly well.

As far as I can tell, Hjalmar Wijk introduced the term "malign generalization" to describe the failure mode that I think is most worth worrying about here. In particular, malign generalization happens when you trained a system with objective function X, that at deployment has the actual outcome of doing Y, where Y is so bad that we'd prefer the system to fail completely. To me at least, this seems like a far more intuitive and less theory-laden way of framing inner alignment failures.

This way of reframing the issue allows us to keep the old terminology that we are concerned with capability robustness without alignment robustness, but drops all unnecessary references to mesa optimization.

Mesa optimizers could still form a natural class of things that are prone to malign generalization. But if even humans are not mesa optimizers, why should we expect mesa optimizers to be the primary real world examples of such inner alignment failures?

New Comment
17 comments, sorted by Click to highlight new comments since:

I think this is one of the major remaining open question wrt inner alignment. Personally, I think there is a meaningful sense in which all the models I'm most worried about do some sort of search internally (at least to the same extent that humans do search internally), but I'm definitely uncertain about that. If true, though, it could be quite helpful for solving inner alignment, since it could enable us to factor models into pieces (either through architecture or transparency tools). Also:

As far as I can tell, Hjalmar Wijk introduced the term "malign generalization" to describe the failure mode that I think is most worth worrying about here.

Hjalmar actually cites this post by Paul Christiano as the source of that term—though Hjalmar's usage is slightly different.

I’m sympathetic to what I see as the message of this post: that talk of mesa-optimisation is too specific given that the practical worry is something like malign generalisation. I agree that it makes extra assumptions on top of that basic worry, which we might not want to make. I would like to see more focus on inner alignment than on mesa-optimisation as such. I’d also like to see a broader view of possible causes for malign generalisation, which doesn’t stick so closely to the analysis in our paper. (In hindsight our analysis could also have benefitted from taking a broader view, but that wasn’t very visible at the time.)

At the same time, speaking only in terms of malign generalisation (and dropping the extra theoretical assumptions of a more specific framework) is too limiting. I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment. I’m not sure that mesa-optimisation is the right view for that, but I do think that the right view will have something to do with goal-directedness.

I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment.

Even taking that as an assumption, it seems like if we accept that "mesa optimizer" doesn't work as a description of humans, then mesa optimization can't be the right view, and we should retreat to malign generalization while trying to figure out a better view.

We’re probably in agreement, but I’m not sure what exactly you mean by “retreat to malign generalisation”.

For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.

That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what “retreating to malign generalisation” could mean, as malign generalisation itself makes no reference to goals.

One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.

It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.

From my perspective, there are three levels:

  • Most general: The inner agent could malignly generalize in some arbitrary bad way.
  • Middle: The inner agent malignly generalizes in such a way that it makes sense to call it goal-directed, and the mesa-goal (= intentional-stance-goal) is different from the base-goal.
  • Most specific: The inner agent encodes an explicit search algorithm, an explicit world model, and an explicit utility function.

I worry about the middle case. It seems like upon reading the mesa optimizers paper, most people start to worry about the last case. I would like people to worry about the middle case instead, and test their proposed solutions against that. (Well, ideally they'd test it against the most general case, but if it doesn't work against that, which it probably won't, that isn't necessarily a deal breaker.) I feel better about people accidentally worrying about the most general case, rather than people accidentally worrying about the most specific case.

The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.

I like "inner alignment", and am not sure why you think it isn't specific enough.

I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.

Re: inner alignment

I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.

Inner alignment gap? Inner objective gap?

I’m not sure what exactly you mean by “retreat to malign generalisation”.

When you don't have a deep understanding of a phenomenon, it's common to use some empirical description of what you're talking about, rather than using your current (and incorrect) model to interpret the phenomenon. The issue with using your current model, is that it leads you to make incorrect inferences about why things happen because you're relying too heavily on the model being internally correct.

Therefore, until we gain a deeper understanding, it's better to use the pre-theoretical description of what we're talking about. I'm assuming that's what Rohin meant by "retreat to malign generalization."

This is important because if we used the definition given in the paper, then this could affect which approaches we use to address inner alignment. For instance, we could try using some interpretability technique to discover the "objective" that a neural network was maximizing. But if our model of the neural network as an optimizer is ultimately incorrect, then the neural network won't have an explicit objective, making this technique very difficult.

I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.

In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.

I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers.

I think it's fairly easy to point out the problem using an alternative definition. If we just change the definition of mesa optimizer to reflect that we're are using the intentional stance (in other words, we're interpreting the neural network as having goals, whether it's using an internal search or not), the mesa!=base description falls right out, and all the normal risks about building mesa optimizers still apply.

I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.

As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.

I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.

I'm not sure what's unsatisfying about the characterization I gave? If we just redefined optimizer to mean an interpretation of the agent's behavior, specifically, that it abstractly pursues goals, why is that an unsatisfying way of showing the mesa != base issue?

ETA:

The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves.

I agree. And the relevance this plays is that in future systems that might experience malign generalization, we would want some model of how goals play a role in their architecture, because this could help us align the system. But until we have such architectures, or until we have models for how those future systems should behave, we should work abstractly.

First, I think by this definition humans are clearly not mesa optimizers.

I'm confused/unconvinced. Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"? Can you give some examples of humans being highly dangerous without having done this kind of explicit optimization?

As far as I can tell, Hjalmar Wijk introduced the term “malign generalization” to describe the failure mode that I think is most worth worrying about here.

Can you give some realistic examples/scenarios of "malign generalization" that does not involve mesa optimization? I'm not sure what kind of thing you're actually worried about here.

Surely the 9/11 attackers, for example, must have "internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system"?

ETA: I agree if someone were to eg. write a spreadsheet of all the things they could do, and write the costs of those actions, and then choose the one with the lowest cost, this would certainly count. And maybe terrorist organizations do a lot of deliberation that meets this kind of criteria. But I am responding to the more typical type of human action: walking around, seeking food, talking to others, working at a job.

There are two reasons why we might model something as an optimizer. The first reason is that we know that it is internally performing some type of search over strategies in its head, and then outputting the strategy that ranks highest under some explicit objective function. The second reason is that, given our ignorant epistemic state, our best model of that object is that it is optimizing some goal. We might call the second case the intentional stance, following Dennett.

If we could show that the first case was true in humans, then I would agree that humans would be mesa optimizers. However, my primary objection is that we could have better models of what the brain is actually doing. It's often the case that when you don't know how something works, the best way of understanding it is by modeling it as an optimizer. However, once you get to look inside and see what's going on, this way of thinking lends to better models which take into account the specifics of its operation.

I suspect that human brains are well modeled as optimizers from the outside, but that this view falls apart when considering specific cases. When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.

But since this is all a bit vague, and hard to see in the case of humans, I can provide the analogy that I gave in the post above.

At a first glance, someone who looked at the agent in the Chests and Keys environment would assume that it was performing an internal search, and then selecting the action that ranked highest in its preference ordering, where its preference ordering was something like "more keys is better." This would be a good model, but we could still do better.

In fact, the only selection that's really happening is at the last stage of the neural network, when the max function is being applied over its output layer. Otherwise, all it's really doing is applying a simple heuristic: if there are no keys on the board, move along the wall; otherwise, move towards the key currently in sight. Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.

When the brain makes a decision, it usually considers at most three or four alternatives for each action it does. Most of the actual work is therefore done at the heuristics stage, not the selection part. And even at the selection stage, I have little reason to believe that it is actually comparing alternatives against an explicit objective function.

Assuming this, it seems to me that the heuristics are being continuously trained by the selection stage, so that is the most important part even if heuristics are doing most of the immediate work in making each decision. And I'm not sure what you mean by "explicit objective function". I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"? If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?

Since this can all be done in a simple feedforward neural network, I find it hard to see why the best model of its behavior should be an optimizer.

I take your point that some models can behave like an optimizer at first glance but if you look closer it's not really an optimizer after all. But this doesn't answer my question: "Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."

ETA: If you don't have a realistic example in mind, and just think that we shouldn't currently rule out the possibility that a non-optimizer might generalize in a way that is more dangerous than total failure, I think that's a good thing to point out too. (I had already upvoted your post based on that.)

I guess the objective function is encoded in the connections/weights of some neural network. Are you not counting that as an explicit objective function and instead only counting a symbolically represented function as "explicit"?

If the heuristics are continuously being trained, and this is all happening by comparing things against some criterion that's encoded within some other neural network, I suppose that's a bit like saying that we have an "objective function." I wouldn't call it explicit, though, because to call something explicit means that you could extract the information content easily. I predict that extracting any sort of coherent or consistent reward function from the human brain will be very difficult.

If so, why would not being "explicit" disqualify humans as mesa optimizers? If not, please explain more what you mean?

I am only using the definition given. The definition clearly states that the objective function must be "explicit" not "implicit."

This is important; as Rohin mentioned below, this definition naturally implies that one way of addressing inner alignment will be to use some transparency procedure to extract the objective function used by the neural network we are training. However, if neural networks don't have clean, explicit internal objective functions, this technique becomes a lot harder, and might not be as tractable as other approaches.

"Can you give some realistic examples/scenarios of “malign generalization” that does not involve mesa optimization? I’m not sure what kind of thing you’re actually worried about here."

I actually agree that I didn't adequately argue this point. Right now I'm trying to come up with examples, and I estimate about a 50% chance that I'll write a post about this in the future naming detailed examples.

For now, my argument can be summed up by saying, logically, if humans are not mesa optimizers, yet humans are dangerous, then you don't need a mesa optimizer to produce malign generalization.

Planned summary for the Alignment newsletter:

The <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) paper defined an optimizer as a system that internally searches through a search space for elements that score high according to some explicit objective function. However, humans would not qualify as mesa optimizers by this definition, since there (presumably) isn't some part of the brain that explicitly encodes some objective function that we then try to maximize. In addition, there are inner alignment failures that don't involve mesa optimization: a small feedforward neural net doesn't do any explicit search; yet when it is trained in the <@chest and keys environment@>(@A simple environment for showing mesa misalignment@), it learns a policy that goes to the nearest key, which is equivalent to a key-maximizer. Rather than talking about "mesa optimizers", the post recommends that we instead talk about "malign generalization", to refer to the problem when <@capabilities generalize but the objective doesn't@>(@2-D Robustness@).

Planned opinion:

I strongly agree with this post (though note that the post was written right after a conversation with me on the topic, so this isn't independent evidence). I find it very unlikely that most powerful AI systems will be optimizers as defined in the original paper, but I do think that the malign generalization problem will apply to our AI systems. For this reason, I hope that future research doesn't specialize to the case of explicit-search-based agents.

Here's a related post that came up on Alignment Forum a few months back: Does Agent-like Behavior Imply Agent-like Architecture?