I claim that if we're clever enough, we can construct a hypothetical training regime T' which trains the NN to do nearly or exactly the same thing on T, but which injects malign behavior on some different examples. (Someone told me that this is actually an existing area of study; but, I haven't been able to find it yet.)
I assume they're referring to data poisoning backdoor attacks like https://arxiv.org/abs/2010.12563 or https://arxiv.org/abs/1708.06733 or https://arxiv.org/abs/2104.09667
Cool post! It's clearly not super polished, but I think you're pointing at a lot of important ideas, and so it's a good thing to publish it relatively quickly.
The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.
As far as I understand it, the initial assumption of internal search was mostly done for two reasons: because then you can speak of the objective/goal without a lot of the issues around behavioral objectives; and because the authors of the Risk from Learned Optimization paper felt that they needed assumptions about the internals of the system to say things like "training and generalization incentivize mesa-optimization".
But personally, I really think of inner alignment in terms of goal-directed agents with misaligned goals. That's by the way one reason why I'm excited to work on deconfusing goal-directedness: I hope this will allow us to consider broader inner misalignment.
With that perspective, I see the Risks paper as arguing that when pushed at the limit of competence, optimized goal-directed systems will have a simple internal model built around a goal, instead of being a mess of heuristics as you could expect at intermediary levels of competence. But I don't necessarily think this has to be search.
I don't think these arguments are enough to supersede (misaligned) mesa-control as the general thing we're trying to prevent, but still, it could be that explicit representation of values is the definition which we can build a successful theory around / systematically prevent. So value-representation might end up being the more pragmatically useful definition of mesa-optimization. Therefore, I think it's important to keep this in mind as a potential definition.
The argument I find the most convincing for the internal representation (or at least awareness/comprehension) is that it is required for very high-level of competence towards the goal (for complex enough goals, of course). I guess that's probably similar (though not strictly the same) to your point about the "systematically misaligned".
But I worry that people could interpret the experiment incorrectly, thinking that "good" results from this experiment (ie creating much more helpful versions of GPT) are actually "good signs" for alignment. I think the opposite is true: successful results would actually be significant reason for caution, and the more success, the more reason for caution.
Your analysis of making GPT-3 made me think a lot of this great blog post (and great blog) that I just read today. The gist of this and other posts there is to think of GPT-3 as a "multiverse-generator", simulating some natural language realities. And with the prompt, the logit-bias and other aspects, you can push it to priviledge certain simulations. I feel like the link with what you're saying is that making GPT-3 useful in that sense seems to push it towards simulating realities consistent/produced by agents, and so to almost optimize for an inner alignment problem.
Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.
I haven't thought enough/studied enough the lottery ticket hypotheses and related idea to judge if your proposal makes sense, but even accepting it, I'm not sure it forbids basins of attraction. It just says that when the deceptive lottery ticket is found enough, then there is no way back. But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can't expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.
I guess the difference is that instead of the deceptive region of the model space, it's the "your innate deceptiveness has won" region of the model space?
But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can't expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.
I guess the difference is that instead of the deceptive region of the model space, it's the "your innate deceptiveness has won" region of the model space?
Right, so, the point of the argument for basin-like proposals is this:
A basin-type solution has to 1. initialize in such a way as to be within a good basin / not within a bad basin. 2. Train in a way which preserves this property. Most existing proposals focus on (2) and don't say that much about (1), possibly counting on the idea that random initializations will at least not be actively deceptive. The argument I make in the post is meant to question this, pointing toward a difficulty in step (1).
One way to put the problem in focus: suppose the ensemble learning hypothesis:
Ensemble learning hypothesis (ELH): Big NNs basically work as a big ensemble of hypotheses, which learning sorts through to find a good one.
This bears some similarity to lottery-ticket thinking.
Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started.
This argument is obviously a bit sloppy, though.
Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).
But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started.
This argument is obviously a bit sloppy, though.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
On the other hand, if there's just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don't make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That's the intuition that to learn that lying is a useful strategy, you must actually be "good enough" at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).
If you actually need deceptiveness to be strong already to have this issue, then I don't think your ELH points to a problem because I don't see why deceptiveness should dominate already.
I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.
I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.
Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.
I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?
Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.
Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.
Right, agreed, I'll consider editing.
I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?
Do you think that's a problem?
Do you think that's a problem?
I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.
I guess the mesa- prefix helps point towards the fact that we're talking about policies, not policies + optimisers.
Probably my preferred terminology would be:
This helps make it clear that mesa-controller isn't a disjoint category from mesa-searcher, and also that mesa-controller is the default, rather than a special case.
Having written all this I'm now a little confused about the usefulness of the mesa-optimisation terminology at all, and I'll need to think about it more. In particular, it's currently unclear to me what the realistic alternative to mesa-optimisation is, which makes me wonder if it's actually carving off an important set of possibilities, or just reframing the whole space of possibilities. (If the policy receives a gradient update every minute, is it useful to call it a mesa-optimiser? Or every hour? Or...)
I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.
Thinking about this more, I think maybe what I really want it to mean is: competent policies which are non-myopic in some sense. A truly myopic Q&A system doesn't feel much like a controller / inner optimizer (even if it is misaligned, it's not steering the world in a bad direction, because it's totally myopic).
I'm not sure what sense of "myopia" I want to use, though.
To me it sounds like you're describing (some version of) agency, and so the most natural term to use would be mesa-agent.
I'm a bit confused about the relationship between "optimiser" and "agent", but I tend to think of the latter as more compressed, and so insofar as we're talking about policies it seems like "agent" is appropriate. Also, mesa-optimiser is taken already (under a definition which assumes that optimisation is equivalent to some kind of internal search).
I tend to think of the latter as more compressed,
I'm not sure what you meant by "more compressed".
I used to define "agent" as "both a searcher and a controller", IE, something which uses an internal selection/search of some kind to accomplish an external control task. This might be too restrictive, though.
I used to define "agent" as "both a searcher and a controller"
Oh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important.
I'm not sure what you meant by "more compressed".
Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the human as being an agent separate from its environment.
The rest of this comment is me thinking out loud in a somewhat incoherent way; no pressure to read/respond.
It seems like calling something a "searcher" describes only a very simple interface: at the end of the search, there needs to be some representation of the output which it has found. But that output may be very complex.
Whereas calling something a "controller" describes a much more complex interface between it and its environment: you need to be able to point not just to outcomes, but also to observations and actions. But each of those actions is usually fairly simple for a pure controller; if it's complex, then you need search to find which action to take at each step.
Now, it seems useful to sometimes call evolution a controller. For example, suppose you're trying to wipe out a virus, but it keeps mutating. Then there's a straightforward sense in which evolution is "steering" the world towards states where the virus still exists, in the short term. You could also say that it's steering the world towards states where all organisms have high fitness in the long term, but organisms are so complex that it's easier to treat them as selected outcomes, and abstract away from the many "actions" by evolution which led to this point.
In other words, evolution searches using a process of iterative control. Whereas humans control using a process of iterative search.
(As a side note, I'm now thinking that "search" isn't quite the right word, because there are other ways to do selection than search. For example, if I construct a mathematical proof (or a poem) by writing it one line at a time, letting my intuition guide me, then it doesn't really seem accurate to say that I'm searching over the space of proofs/poems. Similarly, a chain of reasoning may not branch much, but still end up finding a highly specific conclusion. Yet "selection" also doesn't really seem like the right word either, because it's at odds with normal usage, which involves choosing from a preexisting set of options - e.g. you wouldn't say that a poet is "selecting" a poem. How about "design" as an alternative? Which allows us to be agnostic about how the design occurred - whether it be via a control process like evolution, or a process of search, or a process of reasoning.)
Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.
I'd love to see you do this!
Re: The Treacherous Turn argument: What do you think of the following spitball objections:
(a) Maybe the deceptive ticket that makes T' work is indeed there from the beginning, but maybe it's outnumbered by 'benign' tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.
(b) Maybe the deceptive ticket that makes T' work is not deceptive from the beginning, but rather is made so by the training process T'. If instead you just give it T, it does not exhibit malign off-T behavior. (Analogy: Maybe I can take you and brainwash you so that you flip out and murder people when a certain codeword reaches your ear, and moreover otherwise act completely normally so that you'd react exactly the same way to everything in your life so far as you in fact have. If so, then the "ticket" that makes this possible is already present inside you, even now as you read these words! But the 'ticket' is just you. And you won't actually flip out and murder people if the codeword reaches your ear, because you haven't in fact been brainwashed.)
(a) Maybe the deceptive ticket that makes T' work is indeed there from the beginning, but maybe it's outnumbered by 'benign' tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.
My overall claim is that attractor-basin type arguments need to address the base case. This seems like a potentially fine way to address the base-case, if the math works out for whatever specific attractor-basin argument. If we're trying to avoid deception via methods which can steer away from deception if we assume there's not yet any deception, then we're in trouble; the technique's assumptions are violated.
(b) Maybe the deceptive ticket that makes T' work is not deceptive from the beginning, but rather is made so by the training process T'.
Right, this seems in line with the original lottery ticket hypothesis, and would alleviate the concern. It doesn't seem as consistent with the tangent space hypothesis, though.
The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard.
And presumably, the extent to which search takes place isn't important, a measure of risk, or optimizing. (In other words, it's not a part of the definition, and it shouldn't be a part of the definition.)
Some of the reasons we expect mesa-search also apply to mesa-control more broadly.
expect mesa-search might be a problem?
Highly knowledge-based strategies, such as calculus, which find solutions "directly" with no iteration -- but which still involve meaningful computation.
This explains 'search might not be the only problem' rather well (even if isn't the only alternative).
Dumb lookup tables.
Hm. Based on earlier:
Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.
It sounds like there's also a risk of smart lookup tables. That might not be the right terminology, but 'look up tables which contain really effective things', even if the tables themselves just execute and don't change, seems worth pointing out somehow.
I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on?
The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don't search are truly less concerning, this just means it's an easier case to guard against. That's not an argument against including them in the definition of the inner alignment problem.
A controller, mesa- or otherwise, may be a tool another agent creates or employs to obtain their objectives. (For instance, if someone creates malware that hacks your thermostat to build a bigger botnet (yay Internet of Things!). It might be better to think of the 'intelligence/power/effectiveness of an object for reaching a goal' (even for a rock) to be seen as a function of the system, rather than the parts.)
If you used your chess experience to create a lookup table that could beat me at chess, it's 'intelligence' would be an expression of your int/optimization.
For non-search strategies, it's even more important that the goal actually simplify the problem as opposed to merely reiterate it; so there's even more reason to think that mesa-controllers of this type wouldn't be aligned with the outer goal.
How does a goal simplify a problem?
My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates.
Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of 'seeing through the noise'. You could try going further, and reinforce a particular style, or 'this word is better than that word'.)
Train a model to predict upvotes on Quara, Stackxchange, and similar question-answering websites. This serves as a function recognizing "intelligent and helpful responses".
Uh, that's not what I'd expect it to do. If you're worried about deception now, why don't you think that'd make it worse? (If nothing else, are you trying to create GPT-Flattery?)
If this procedure works exceedingly well, causing GPT to "wake up" and be a human-level conversation partner or greater, we should be very worried indeed. (Since we wouldn't then know the alignment of the resulting system, and could be virtually sure that it was an inner optimizer of significant power.)
It's not an agent. It's a predictor. (It doesn't want to make paperclips.)
I think you're anthropomorphizing it.
expect mesa-search might be a problem?
What I intended there was "expect mesa-search to happen at all" (particularly, mesa-search with its own goals)
It sounds like there's also a risk of smart lookup tables. That might not be the right terminology, but 'look up tables which contain really effective things', even if the tables themselves just execute and don't change, seems worth pointing out somehow.
Sorry, by "dumb" I didn't really mean much, except that in some sense lookup tables are "not as smart" as the previous things in the list (not in terms of capabilities, but rather in terms of how much internal processing is going on).
How does a goal simplify a problem?
For example, you can often get better results out of RL methods if you include "shaping" rewards, which reward behaviors which you think will be useful in productive strategies, even though this technically creates misalignment and opportunities for perverse behavior. For example, if you wanted an RL agent to go to a specific square, you might do well to reward movement toward that square.
Similarly, part of the common story about how mesa-optimizers develop is: if they have explicitly represented values, these same kinds of "shaping" values will be adaptive to include, since they guide the search toward useful answers. Without this effect, inner search might not be worthwhile at all, due to inefficiency.
Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of 'seeing through the noise'. You could try going further, and reinforce a particular style, or 'this word is better than that word'.)
Yes, I agree that GPT's outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.
However, I feel like your "if you don't want that, then..." seems to suppose that it's easy to make it outer-aligned. I don't think so.
The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations -- or similarly, we could just apply a loss function for outputs which aren't spelled correctly). But what's the generalization of that?? How do you try to discourage all "deliberate mistakes"?
Uh, that's not what I'd expect it to do. If you're worried about deception now, why don't you think that'd make it worse? (If nothing else, are you trying to create GPT-Flattery?)
I don't think it would be entirely aligned by any means. My prediction is that it'd be incentivized to reveal information (so you could say it's differentially more "honest" relative to GPT-3 trained only on predictive accuracy). I agree that in the extreme case (if fine-tuned GPT-3 is really good at this) it could end up more deceptive rather than less (due to issues like flattery).
It's not an agent. It's a predictor. (It doesn't want to make paperclips.)
I think you're anthropomorphizing it.
How do you try to discourage all "deliberate mistakes"?
1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn't play itself into)?
*This may be different from 'maximize score, or wins long term'. If you try to avoid teaching your opponent how to play better, while seeking out wins, there can be a 'try to meta game' approach - though this might require games to have the right structure, especially in training to create a tournament, rather than game focus. And I would guess it is game focused, rather than tournament.
Why do you suppose it's not an agent? Isn't that essentially the question of inner optimizers? IE, does it get its own goals? Is it just trying to predict?
A fair point. Dealing with this at the level of 'does it have goals' is a question worth asking. I think that it, like AlphaGo, isn't engaging in particularly deliberate action because I don't think it is existing properly to do that, or learn to do that.
You think of the spelling errors as deception. Another way of characterizing it might be 'trying to speak the lingo'. For example we might think of as an agent, that, if it chatted with you for a while, and you don't use words like 'aint' a lot, might shift to not use words like that around you. (Is an agent that "knows its audience" deceptive? Maybe yes, maybe no.)
You think that there is a correct way to spell words. GPT might be more agnostic. For example, (it's weird to not put this in terms of prediction) if another version of GPT (GPT-Speller) somehow 'ignored context', or 'factored it 'better'', then we might imagine Speller would spell words right with a probability. You and I understand that 'words are spelled (mostly) one way'. But Speller, might come up with words as these probability distributions over strings - spelling things right most of the time (if the dataset has them spelled that way most of the time), but always getting them wrong sometimes because it:
**Maybe some new (or existing) methods might be required to fix this? The issue of 'imperfect feedback' sounds like something that's (probably) been an issue before - and not just in conjunction with the words 'Goodhart'.
I also lean towards 'this thing was created, and given something like a goal, and it's going to keep doing that goal like thing'. If it 'spells things wrong to fit in' that's because it was trained as a predictor, not a writer. If we want something to write, yeah, figuring out how to train that might be hard. If you want something out of GPT that differs from the objective 'predict' then maybe GPT needs to be modified, if prompting it correctly doesn't work. Given the way it 'can respond to prompts' characterizing it as 'deceptive' might make sense under some circumstances*, but if you're going to look at it that way, training something to do 'prediction' (of original text) and then have it 'write' is systematically going to result in 'deception' because it has been trained to be a chameleon. To blend in. To say what whoever wrote the string it is being tested against at the moment. It's abilities are shocking and it's easy to see them in an 'action framework'. However, if it developed a model of the world, and it was possible to factor that out from the goal - then pulling the model out and getting 'the truth' is possible. But the two might not be separable. If trained on say "a flat earther dataset" will it say "the earth is round"? Can it actually achieve insight?
If you want a good writer, train a good writer. I'm guessing garbage in, garbage out, is an AI rule as much as straight up programming.*** If we give something the wrong rewards, the system will be gamed (absent a system (successfully) designed and deployed to not do that).
*i.e., it might have a mind, but it also might not. Rather it might just be that
***More because the AI has to 'figure out' what it is that you want, from scratch.
If GPT, when asked 'is this spelled correctly: [string]' it tells us truthfully, then as deception, that's probably not an issue. As far as deception goes...arguably it's 'deceiving' everyone all the time, that it is a human (assuming most text in it's corpus is written by humans, and most prompts match that), or trying to. If it things it's supposed to play the part of a someone who is bad at spelling, it might be hard to read.
(I haven't heard of it making any new scientific discoveries*. Though if it hasn't read a lot of papers, it could be trained...)
*This would be surprising, and might change the way I look at it - if a predictor can do that, what else can it do, and is the distinction between an agent an a predictor a meaningful one? Maybe not. Though pre-registration might be key here. If most of the time it just produces awful or mediocre papers, then maybe it's just a 'monkey at a typewriter'.
I'm a bit confused about part of what we're disagreeing on, so, context trace:
I originally said:
My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates.
Then you said:
Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of 'seeing through the noise'. You could try going further, and reinforce a particular style, or 'this word is better than that word'.)
Then I said:
Yes, I agree that GPT's outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.
However, I feel like your "if you don't want that, then..." seems to suppose that it's easy to make it outer-aligned. I don't think so.
The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations -- or similarly, we could just apply a loss function for outputs which aren't spelled correctly). But what's the generalization of that?? How do you try to discourage all "deliberate mistakes"?
Then you said:
1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn't play itself into)?
I get the vibe that I might be missing a broader point you're trying to make. Maybe something like "you get what you ask for" -- you're pointing out that hiding information like this isn't at all surprising given the loss function, and different loss functions imply different behavior, often in a straightforward way.
If this were your point, I would respond:
You think of the spelling errors as deception. Another way of characterizing it might be 'trying to speak the lingo'. For example we might think of as an agent, that, if it chatted with you for a while, and you don't use words like 'aint' a lot, might shift to not use words like that around you. (Is an agent that "knows its audience" deceptive? Maybe yes, maybe no.)
You think that there is a correct way to spell words. GPT might be more agnostic.
I'm not sure whether there is any disagreement here. Certainly I tend to think about language differently from that. But I agree that's the purely descriptive view.
I also lean towards 'this thing was created, and given something like a goal, and it's going to keep doing that goal like thing'. If it 'spells things wrong to fit in' that's because it was trained as a predictor, not a writer.
I mean, I agree as a statistical tendency, but are you assuming away the inner alignment problem?
Given the way it 'can respond to prompts' characterizing it as 'deceptive' might make sense under some circumstances*, but if you're going to look at it that way, training something to do 'prediction' (of original text) and then have it 'write' is systematically going to result in 'deception' because it has been trained to be a chameleon. To blend in.
We seem to be in agreement about this.
However, if it developed a model of the world, and it was possible to factor that out from the goal - then pulling the model out and getting 'the truth' is possible. But the two might not be separable. If trained on say "a flat earther dataset" will it say "the earth is round"? Can it actually achieve insight?
Right, this is the question I am interested in. Is there a world model? (To what degree?)
The existing definitions of deception, inner optimizer, and some other terms tend to strike me as "stronger than necessary" depending on the context. If weaker definitions are similarly problematic, this means we need stronger methods to prevent them! I illustrate this and make some related (probably contentious) claims.
Summary of contentious claims to follow:
It's possible I've shoved too many things into one post. Sorry.
Inner Optimization
The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.
Mesa-Control
I've previously written about the idea of distinguishing mesa-search vs mesa-control:
I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on?
However, I would make the following points:
Mesa-Learning
I mentioned this category in the same Mesa-Search vs Mesa-Control post. This refers to the phenomenon of spontaneous emergence of learning algorithms. Basically: a system ends up learning-to-learn when you were only trying to get it to learn. This may or may not involve search.
This could be concerning/important for several reasons, but I don't have a lot I want to say about it in this post.
Explicitly Representing Values
This refers to one of the properties I mentioned in the mesa-control subsection: does a model represent its objective separately from its world-model, and combine those to plan?
Or, slightly more generally: does the system have an explicitly represented objective? (Whether or not it has a "world model".)
There are several reasons to think this might be the critical distinction for pointing to inner optimization:
I don't think these arguments are enough to supersede (misaligned) mesa-control as the general thing we're trying to prevent, but still, it could be that explicit representation of values is the definition which we can build a successful theory around / systematically prevent. So value-representation might end up being the more pragmatically useful definition of mesa-optimization. Therefore, I think it's important to keep this in mind as a potential definition.
Generalizing Values Poorly
This section would be incomplete without mentioning another practical definition: competently pursuing a different objective when put in a different context.
This is just the idea that inner optimizers perform well on the training data, but in deployment, might do something else. It's little more than the idea of models generalizing poorly due to distributional shift. Since learning theory deals extensively with the idea of generalization error, this might be the most pragmatic way to think about the problem of inner optimization.
I'll have more to say about this later.
Deception
Evan Hubinger uses "deceptive alignment" for a strong notion of inner alignment failure, where:
I find that I often (accidentally or purposefully) use "deception" to indicate lesser crimes.
Hidden (possibly "inaccessible") Information
The intuition here is that a "deceptive" system is one that is hiding something from us; it knows more than it is letting on. For example, a system which is using a lot of inaccessible information.
When I read The Case for Aligning Narrowly Superhuman Models, my reaction was:
My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates. I expect that if a prompt begins with reasoning mistakes, GPT-3 will continue with similar reasoning mistakes to some extent.
So it seems quite plausible that we can nudge smarter behavior out of GPT-3. For example, with the following procedure inspired by Nisan's paper that Ajeya cites:
I'm thinking of Ajeya's research program roughly as "come up with some better version of that".
So then a very interesting question is: would that work? How well?
Assuming step 2 doesn't just effectively create adversarial examples for the model trained in step 1 (a failure mode which Nisan's work suggests won't happen), I think the procedure would work at least to some extent. EG, it would at least stop GPT from introducing "intentional" reasoning errors. GPT-3 fine-tuned in this way would probably be more helpful than GPT-3 in a meaningful way.
Here's my possibly-contentious claim:
The better this works, the move evidence we have that GPT is close to worrisome territory WRT inner optimizers.
If this procedure barely works at all, then we have evidence that GPT was not "hiding anything" or "playing dumb". This means GPT was not very much like an inner optimizer. It didn't have very much private information it was using to model humans, without revealing to the humans.
If this procedure works moderately well, then this provides some confirmation that GPT had some hidden information.
If this procedure worked quite well, producing a model which answers questions quite intelligently and helpfully, this provides significant evidence that GPT-3 had a kind of inner optimizer. In particular, if this procedure was able to produce better answers than those in the Stackxchange dataset, this would suggest that GPT-3 knew those better answers all along, rather than developing better answers as a consequence of training on that signal.
If this procedure works exceedingly well, causing GPT to "wake up" and be a human-level conversation partner or greater, we should be very worried indeed. (Since we wouldn't then know the alignment of the resulting system, and could be virtually sure that it was an inner optimizer of significant power.)
Barring the final extreme outcome which creates a risk, this is a great experiment which could provide some valuable information about the potential for near-term inner optimizers.
But I worry that people could interpret the experiment incorrectly, thinking that "good" results from this experiment (ie creating much more helpful versions of GPT) are actually "good signs" for alignment. I think the opposite is true: successful results would actually be significant reason for caution, and the more success, the more reason for caution.
This does, of course, depend highly on methodology. If the methodology is not very much like what I describe, then positive results might indeed be good news rather than bad.
Also note that hiding information does not in itself imply inner optimization, much less inner misalignment. Ajeya privately mentioned the example of ImageNet: certainly ImageNet knows more about images than it overtly reveals through its outputs.
However, it does imply some sort of alignment problem, I think, because arguably alignment implies the absence of hidden information. In the GPT-3 example, this can clearly be traced to an outer alignment problem: GPT-3 was trained to imitate humans, not to give the most useful responses possible. So GPT-3 hiding information does strike me as an example of a misaligned meza-optimizer even if not an inner-misaligned mesa-optimizer. (Depending, of course, on your preferred definition of mesa-optimizer. I have no idea whether GPT-3 conducts an internal search. Planning ahead seems like a broadly useful thing for it to do, but, we know little about GPT-3's internal strategies.)
(In an extreme case, an aligned AI might hide information from us for our own sake. However, this at least implies an absence of corrigibility, since it results in difficult-to-verify and difficult-to-correct behavior. I don't feel bad about a definition of "deception" which includes this kind of behavior; avoiding this kind of deception seems like a worthwhile goal.)
A Treacherous Turn
The core reason why we should be interested in Evan's notion of deception is the treacherous turn: a system which appears aligned until, at an opportune moment, it changes its behavior.
So, this serves as a very practical operational definition.
Note that this is identical with the "generalizing values poorly" definition of inner optimizer which I mentioned.
My Contentious Position for this subsection:
Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.
The argument goes like this:
Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.
So, what do we think of the argument?
I actually came up with this argument as an argument against a specific form of the lottery ticket hypothesis, thinking the conclusion was pretty silly. The mere existence of T' doesn't seem like sufficient reason to expect a treacherous turn from training on T.
However, now I'm not so sure.
If true, this would argue against certain "basin of corrigibility" style arguments where we start with the claim that the initialized NN is not yet deceptive, and then use that to argue inductively that training does not produce deceptive agents.