Mesa-Optimizers vs “Steered Optimizers”

[-]Thomas Kwa3y20

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

At first this claim seemed kind of wild, but there's a version of it I agree with.

It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a pointer to some optimizer "outside" it, it's underspecified what it should point to. In the evolution -> humans -> gradient descent -> model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn't seem to be different between the RLO and steered optimization stories.

I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer's goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.

[-]Steven Byrnes3y20

Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯

BTW this is an old post; see my more up-to-date discussion here, esp. Posts 8–10.

[-]Charlie Steiner5y20

I dunno, the productivity hacks thing sounds pretty bad.

But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent. Trying to steer the agent towards thinking of its goal as "do what the programmers want" is essentially talking about a machine-learning method of trying to find this description.

[-]Steven Byrnes5y10

I dunno, the productivity hacks thing sounds pretty bad.

Well, we ought to be able to either figure out how to use this kind of system safely, or prove it's impossible. Either would be valuable. :-)

I don't think it's obviously impossible though. In particular, with the right motivation, it won't be motivated to undermine the steering signals. And also, the subcortex can be a slightly-less powerful AI, assisted by intrusive interpretability tools, multiple copies running faster, etc.

But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent...

Yeah, I struggle with that too. Maybe an alternative (or at least starting point) would be to try to solve the challenge of building a question-answering oracle that has no motivation to lie or manipulate or escape its box, etc. I think that is a goal I can fully understand, although maybe I just haven't thought about it carefully enough to find the edge cases. :-)

Property	“Search Over Algorithms” scenario	“Steered Optimizer” scenario
Base layer	Run inner layer for N steps, compute score, do gradient descent on inner layer algorithm	Run inner layer. As it runs, watch it, and send rewards (or other signals) to “steer” it.
Inner layer	Arbitrary algorithm discovered by base layer	Known, hand-coded algorithm, involving learning, world-modeling, planning, acting, etc.
Interpretability of inner layer (by default, i.e. without special interpretability tools)	Always inscrutable	Starts from a known, simple state, but gets more and more inscrutable as it builds a complex world-model
What is the inner layer’s objective?	It might not have one. If it does, we don’t know what it is (by default)	We designed it to form and seek goals based on the steering signals it receives, but we don’t know its actual goals at any given time (by default)
How many training episodes?	Millions, I presume.	As few as one; maybe several, but more like a run-and-debug loop.
Are we doing this today?	Not really (but see references in “Risks from Learned Optimization”).	Not that I know of, off-hand, but it’s probably in the AI literature somewhere.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

Mesa-Optimizers vs “Steered Optimizers”

19

Introduction

Why might we expect AI researchers to build steered optimizers, rather than searches-over-algorithms?

How might one steer an AGI steered optimizer?

Lessons from being a human

1. How does it feel to be steered?

2. What does Inner Alignment failure look like in humans?

3. What would corrigible alignment look like in humans?

More random thoughts on steering

Related work