The implicit assumption seems to be that an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful. It is not at all clear to me that this is the case – I have not seen any good argument to support this, nor can I think of any myself.
I think that this is one of the major ways in which old discussions of optimization daemons would often get confused. I think the confusion was coming from the fact that, while it is true in isolation that an optimizer_1 won't generally self-modify into an optimizer_2, there is a pretty common case in which this is a possibility: the presence of a training procedure (e.g. gradient descent) which can perform the modification from the outside. In particular, it seems very likely to me that there will be many cases where you'll get an optimizer_1 early in training and then an optimizer_2 later in training.
That being said, while having an optimizer_2 seems likely to be necessary for deceptive alignment, I think you only need an optimizer_1 for pseudo-alignment: every search procedure has an objective, and if that objective is misaligned, it raises the possibility of capability generalization without objective generalization.
Also, as a terminological note, I've taken to using "optimizer" for optimizer_1 and "agent" for something closer to optimizer_2, where I've been defining an agent as an optimizer that is performing a search over what its own action should be. I prefer that definition to your definition of optimizer_2, as I prefer mechanistic definitions over behavioral definitions since I generally find them more useful, though I think your notion of optimizer_2 is also a useful concept.
Also, as a terminological note, I've taken to using "optimizer" for optimizer_1 and "agent" for something closer to optimizer_2, where I've been defining an agent as an optimizer that is performing a search over what its own action should be.
I'm confused about this part. According to this definition, is "agent" a special case of optimizer_1? If so it doesn't seem close to how we might want to define a "consequentialist" (which I think should capture some programs that do interesting stuff other than just implementing [a Turing Machine that performs well on a formal optimization problem and does not do any other interesting stuff]).
I pretty strongly think this is the same distinction as I am pointing at with selection vs control, although perhaps I focus on a slightly broader cluster-y distinction while you have a more focused definition.
I think this distinction is something which people often conflate in computer science more broadly, too. Often, for example, a method will be initially intended for the control case, and people will make 'improvements' to it which only make sense in a selection context. It's easy for things to slide in that direction, because control-type algorithms will often be tested out in computer-simulated environments; but then, you have access to the environment, and can optimize it in more direct ways.
I'm more annoyed by this sort of mix-up than I probably should be.
Planned summary:
The first sense of "optimizer" is an optimization algorithm, that given some formally specified problem computes the solution to that problem, e.g. a SAT solver or linear program solver. The second sense is an algorithm that acts upon its environment to change it. Joar believes that people often conflate the two in AI safety.
Planned opinion:
I agree that this is an important distinction to keep in mind. It seems to me that the distinction is whether the optimizer has knowledge about the environment: in canonical examples of the first kind of optimizer, it does not. If we somehow encoded the dynamics of the world as a SAT formula and asked a super-powerful SAT solver to solve for the actions that accomplish some goal, it would look like the second kind of optimizer.
If the super-powerful SAT solver thing finds the plans but doesn't execute them, would you still lump it with optimizer_2? (I know it's just terminology and there's no right answer, but I'm just curious about what categories you find natural.)
(BTW this is more-or-less a description of my current Grand Vision For AGI Safety, where the "dynamics of the world" are discovered by self-supervised learning, and the search process (and much else) is TBD.)
Hmm, idk, it feels more like an optimizer_1 in that situation. Now that you've posed this question, the super-powerful SAT solver that acts in the world feels like both an optimizer_1 and an optimizer_2.
It seems to me that the distinction is whether the optimizer has knowledge about the environment
Alternatively, you could say the distinction is whether the optimizer cares about the environment. I think there's a sense (or senses?) in which these things can be made/considered equivalent. I don't feel like I totally understand or am satisfied with either way of thinking about it, though.
This looks like a Cartesian distinction that exists only by virtue of not fully considering the embeddedness of the optimizer.
It only seems that the domain of an optimizer_1 cannot optimize or affect the environment like an optimizer_2 because you are thinking of it as operating in mathematical, ideal terms, rather than as a real system that runs on a computer by doing physics and interacting with the world. An optimizer_1 can smoothly turn into an optimizer_2 in at least two ways. One is via unintended side effects. Another is via scope creep. There is no clean, bright line separating the domain of the optimizer_1 from the rest of reality, and in fact it was always an optimizer_2, just only looking at a narrow slice of the world because you put up some guardrails to keep it there.
The worry is what happens when it jumps the guardrails, or the guardrails fail.
I don't think that I'm assuming the existence of some sort of Cartesian boundary, and the distinction between these two senses of "optimizer" does not seem to disappear if you think of a computer as an embedded, causal structure. Could you state more precisely why you think that this is a Cartesian distinction?
Sure, let's be super specific about it.
Let's say we have something you consider an optimizer_1, a SAT solver. It operates over a set of variables V arranged in predicts P using an algorithm A. Since this is a real SAT solver that is computed rather than a purely mathematical one we think about, it runs on some computer C and thus for each of V, P, and A there is some C(V), C(P), and C(A) that is the manifestation of each on the computer. We can conceptualize what C does to V, P, and A in different ways: it turns them into bytes, it turns A into instructions, it uses C(A) to operate on C(V) and C(P) to produce a solution for V and P.
Now the intention is that the algorithm A is an optimizer_1 that only operates on V and P, but in fact A is never run, properly speaking, C(A) is, and we can only say A is run to the extent C(A) does something to reality that we can set up an isomorphism to A with. So C(A) is only an optimizer_1 to the extent the isomorphism holds and it is, as you defined optimizer_1, "solving a computational optimization problem". But properly speaking C(A) doesn't "know" it's an algorithm: it's just matter arranged in a way that is isomorphic, via some transformation, to A.
So what is C(A) doing then to produce a solution? Well, I'd say it "optimizes its environment", that is literally the matter and its configuration that it is in contact with, so it's an optimizer_2.
You might object that there's something special going on here such that C(A) is still an optimizer_1 because it was set up in a way that isolates it from the broader environment so it stays within the isomorphism, but that's not a matter of classification, that's an engineering problem of making an optimizer_2 behave as if it were an optimizer_1. And a large chunk of AI safety (mostly boxing) is dealing with ways in which, even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 because of unexpected behavior where it "breaks" the isomorphism and does something that might still keep the isomorphism in tact but also does other things you didn't think it would do if the isomorphism were strict.
Put pithily, there's no free lunch when it comes to isomorphisms that allow you to manifest your algorithms to compute them, so you have to worry about the way they are computed.
It seems useful to have a quick way of saying:
"The quarks in this box implement a Turing Machine that [performs well on the formal optimization problem P and does not do any other interesting stuff]. And the quarks do not do any other interesting stuff."
(which of course does not imply that the box is safe)
Sure. Not making the distinction seems important, though, because this post seems to be leaning towards rejecting arguments that depend on noticing that the distinction is leaky. Making it is okay so long as you understand it as "optimizer_1 is a way of looking at things that screens off many messy details of the world so I can focus on only the details I care about right now", but if it becomes conflated with "and if something is an optimizer_1 I don't have to worry about the way it is also an optimizer_2" then that's dangerous.
The author of the post suggests it's a problem that "some arguments related to AI safety that seem to conflate these two concepts". I'd say they don't conflate them, but understand that every optimizer_1 is an optimizer_2.
Maybe we're just not using the same definitions, but according to the definitions in the OP as I understand them, a box might indeed contain an arbitrarily strong optimizer_1 while not containing an optimizer_2.
For example, suppose the box contains an arbitrarily large computer that runs a brute-force search for some formal optimization problem. [EDIT: for some optimization problems, the evaluation of a solution might result in the execution of an optimizer_2]
I have already (sort of) addressed this point at the bottom of the post. There is a perspective from which any optimizer_1 can (kind of) be thought of as an optimizer_2, but its unclear how informative this is. It is certainly at least misleading in many cases. Whether or not the distinction is "leaky" in a given case is something that should be carefully examined, not something that should be glossed over.
I also agree with what ofer said.
"even if we can make something safe in optimizer_1 terms, it may still be dangerous as an optimizer_2 because of unexpected behavior where it "breaks" the isomorphism and does something that might still keep the isomorphism in tact but also does other things you didn't think it would do if the isomorphism were strict"
I agree. Part of the reason why it's valuable to make the distinction is to enable more clear thinking about these sorts of issues.
I think there is only a question of how leaky, but it is always non-zero amounts of leaky, which is the reason Bostrom and others are concerned about it for all optimizers and don't bother to make this distinction.
Second, a system can be an “optimizer” in the sense that it optimizes its environment. A human is an optimizer in this sense, because we robustly take actions that push our environment in a certain direction. A reinforcement learning agent can also be thought of as an optimizer in this sense, but confined to whatever environment it is run in.
This definition of optimizer_2 depends on the definition of "environment". It seems that for an RL agent you use the word "environment" to mean the formal environment as defined in RL. How do you define "environment", for this purpose, in non-RL settings?
What should be considered the environment of a SAT solver, or an arbitrary mesa-optimizer that was optimized to be a SAT solver?
Yep. Good post. Important stuff. I think we're still struggling to understand all of this fully, and work on indifference seems like the most relevant stuff.
My current take is that as long as there is any "black-box" part of the algorithm which is optimizing for performance, then it may end up behaving like an optimizer_2, since the black box can pick up on arbitrary effective strategies.
(in partial RE to Rohin below): I wouldn't necessarily say that such an algorithm knows about its environment (i.e. has a good model), it may simply have stumbled upon an effective strategy for interacting with it (i.e. have a good policy).
The word “optimizer” can be used in at least two different ways.
First, a system can be an “optimizer” in the sense that it is solving a computational optimization problem. A computer running a linear program solver, a SAT-solver, or gradient descent, would be an example of a system that is an “optimizer” in this sense. That is, it runs an optimization algorithm. Let “optimizer_1” denote this concept.
Second, a system can be an “optimizer” in the sense that it optimizes its environment. A human is an optimizer in this sense, because we robustly take actions that push our environment in a certain direction. A reinforcement learning agent can also be thought of as an optimizer in this sense, but confined to whatever environment it is run in. This is the sense in which “optimizer” is used in posts such as this. Let “optimizer_2” denote this concept.
These two concepts are distinct. Say that you somehow hook up a linear program solver to a reinforcement learning environment. Unless you do the “hooking up” in a particularly creative way there is no reason to assume that the output of the linear program solver would push the environment in a particular direction. Hence a linear program solver is an optimizer_1, but not an optimizer_2. On the other hand, a simple tabular RL agent would eventually come to systematically push the environment in a particular direction, and is hence an optimizer_2. However, such a system does not run any internal optimization algorithm, and is therefore not an optimizer_1. This means that a system can be an optimizer_1 while not being an optimizer_2, and vice versa.
There are some arguments related to AI safety that seem to conflate these two concepts. In Superintelligence (pg 153), on the topic of Tool AI, Nick Bostrom writes that:
To me, this argument seems to make an unexplained jump from optimizer_1 to optimizer_2. It begins with the observation that a powerful Tool AI would be likely to optimize its internal computation in various ways, and that this optimization process could be quite powerful. In other words, a powerful Tool AI would be a strong optimizer_1. It then concludes that the system might start pursuing convergent instrumental goals – in other words, that it would be an optimizer_2. The jump between the two is not explained.
The implicit assumption seems to be that an optimizer_1 could turn into an optimizer_2 unexpectedly if it becomes sufficiently powerful. It is not at all clear to me that this is the case – I have not seen any good argument to support this, nor can I think of any myself. The fact that a system is internally running an optimization algorithm does not imply that the system is selecting its output in such a way that this output optimizes the environment of the system.
The excerpt from Superintelligence is just one example of an argument that seems to slide between optimizer_1 and optimizer_2. For example, some parts of Dreams of Friendliness seem to be doing so, or at least it's not always clear which of the two is being talked about. I’m sure there are more examples as well.
Be mindful of this distinction when reasoning about AI. I propose that “consequentialist” (or perhaps "goal-directed") is used to mean what I have called “optimizer_2”. I don’t think there is a need for a special word to denote what I have called “optimizer_1” (at least not once the distinction between optimizer_1 and optimizer_2 has been pointed out).
Note: It is possible to raise a sort of embedded agency-like objection against the distinction between optimizer_1 and optimizer_2. One might argue that:
I don’t think that this argument works. Consider the following two systems:
System 1 is an optimizer_1 that solves linear programs, whereas system 2 is an optimizer_2 that is optimizing the state of the computer that it is installed on. These two things are different. (Moreover, the difference isn’t just that system 2 is “more powerful” than system 1 – system 1 might even be a better linear program solver than system 2.)
Acknowledgements: We were aware of the difference between "optimizer_1" and "optimizer_2" while working on the mesa-optimization paper, and I'm not sure who first pointed it out. We were also probably not the first people to realise this.