Inner alignment: what are we pointing at?

lemonhope

1 min read

12 Inner alignment: what are we pointing at?

by lemonhope

18th Sep 2022

1 min read

12

Proof that a model is an optimizer says very little about the model. I do not know what a research group is studying outer alignment is studying. Inner alignment seems to cover the entire problem at the limit. Whether an optimizer is mesa or not depends on your point of view. These terms seem to be a magnet for confusion and debate. I have to do background reading on someone to even understand what claim they're making. These are all indicators that we are using the wrong terms.

What are we actually pointing at? What questions do we want answered?

Do we care if a model is an optimizer? Is it important whether it is creating plans through an explicit search process or a clever collection of heuristics? A poor search algorithm cannot plan much and clever enough heuristics can take you to any goal. What's the important metric?
Sometimes a model will have great capacity to shape its environment but little inclination. How to divide between capacity and inclination in a way that closely corresponds to agents and models as we observe them? (One could say that capacity and inclination cannot be separated but the right definitions would split them right apart.)
When you specify what you want the model to do in code, what is the central difficulty? Is there a common risk or error in giving examples and giving loss/reward/value functions that we can name?
Is there a clear, accepted term for when models do not maintain desired behavior under distribution shift?
Should we distinguish between trained RL models that optimize and spontaneous agents that emerge in dynamical systems? One might expect the first to almost always happen and the second very rarely. What's the key difference?

I'll post my answers to these questions in a couple days but I'm curious how other people slice it. Does "inner alignment failure" mean anything or do we need to point more directly?

Inner AlignmentMesa-OptimizationOuter AlignmentAI

Frontpage

Mentioned in

14Embedding safety in ML development

2Planning capacity and daemons

Inner alignment: what are we pointing at?

2Ramana Kumar

1lemonhope

New Comment

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:43 AM

[-]Ramana Kumar3y20

I'll take a stab at answering the questions for myself (fairly quick takes):

No, I don't care about whether a model is an optimiser per se. I care only insofar as being an optimiser makes it more effective as an agent. That is, if it's robustly able to achieve things, it doesn't matter how. (However, it could be impossible to achieve things without being shaped like an optimiser; this is still unresolved.)
I agree that it would be nice to find definitions such that capacity and inclination split cleanly. Retargetability is one approach to this, e.g., operationalised as fine-tuning effort required to redirect inclinations.
I think there are two: incorrect labels (when the feedback provider isn't capable enough to assess the examples it needs to evaluate), and underspecification (leading to goal misgeneralisation).
Goal misgeneralisation. More broadly (to also include capability misgeneralisation), robustness failures.
No I don't think they're important to distinguish.

[-]lemonhope3y10

Thanks, especially like vague/incorrect labels to refer to that mismatch. Well-posed Q by Garrabrant, might touch on that in my next post.

Moderation Log