[AN #157]: Measuring misalignment in the technology underlying Copilot

[-]adamShimi4y10

Rohin's opinion: I really liked the experiment demonstrating misalignment, as it seems like it accurately captures the aspects that we expect to see with existentially risky misaligned AI systems: they will “know” how to do the thing we want, they simply won’t be “motivated” to actually do it.

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for. Notably, there are IMO pretty good arguments (mostly by people affiliated with EleutherAI, I'm pushing them to post on the AF) that GPT-3 seems to work more like a simulator of language-producing processes (for lack of a better word), than as an agent trying to predict the next token.

Like what you write here:

They also probe the model for bad behavior, including misalignment. In this context, they define misalignment as a case where the user wants A, but the model outputs B, and the model is both capable of outputting A and capable of distinguishing between cases where the user wants A and the user wants B.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem. In that sense, GPT-3 fails the "chatbot task": for a lot of the great things it's great at doing, you have to handcraft (or constrain) the prompts to make -- it won't find out precisely what you mean.

Or put it differently: people which are good at making GPT-3 do what they want have learned to not use it like a smart agent figuring out what you really mean, but more like a "prompt continuation engine". You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture, and I have the gut feeling that being agenty makes it harder to do that task (as you need a very un-goal-like goal).

(I think this points out to what you mention in that comment, about approval-directedness being significantly less goal-directed: if GPT-3 is agenty, it looks quite a lot like a sort of approval-directed agent.)

[-]Rohin Shah4y50

I think that this is a very good example where the paper (based on your summary) and your opinion assumes some sort of higher agency/goals in GPT-3 than what I feel we have evidence for.

Where do you see any assumption of agency/goals?

(I find this some combination of sad and amusing as a commentary on the difficulty of communication, in that I feel like I tend to be the person pushing against ascribing goals to GPT.)

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

For a simulator-like model, this is not misalignment, this is intended behavior. It is trained to find the most probable continuation, not to analyze what you meant and solve your problem.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

You can obviously say "it's an agent that does really care about the context", but I doesn't look like it adds anything to the picture,

Agreed, which is why I didn't say anything like that?

[-]adamShimi4y10

Sorry for ascribing you beliefs you don't have. I guess I'm just used to people here and in other places assuming goals and agency in language models, and also some of your choices of words sounded very goal-directed/intentional stance to me.

Maybe you're objecting to the "motivated" part of that sentence? But I was saying that it isn't motivated to help us, not that it is motivated to do something else.

Sure, but don't you agree that it's a very confusing use of the term? Like, if I say GPT-3 isn't trying to kill me, I'm not saying it is trying to kill anyone, but I'm sort of implying that it's the right framing to talk about it. In this case, the "motivated" part did triggered me, because it implied that the right framing is to think about what Codex wants, which I don't think is right (and apparently you agree).

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

Maybe you're objecting to words like "know" and "capable"? But those don't seem to imply agency/goals; it seems reasonable to say that Google Maps knows about traffic patterns and is capable of predicting route times.

Agreed with you there.

As an aside, this was Codex rather than GPT-3, though I'd say the same thing for both.

True, but I don't feel like there is a significant difference between Codex and GPT-3 in terms of size or training to warrant different conclusions with regard to ascribing goals/agency.

I don't care what it is trained for; I care whether it solves my problem. Are you telling me that you wouldn't count any of the reward misspecification examples as misalignment? After all, those agents were trained to optimize the reward, not to analyze what you meant and fix your reward.

First, I think I interpreted "misalignment" here to mean "inner misalignment", hence my answer. I also agree that all examples in Victoria's doc are showing misalignment. That being said, I still think there is a difference with the specification gaming stuff.

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious? Almost all specification gaming examples are subtle, or tricky, or exploiting bugs. They're things that I would expect a human to fail to find, even given the precise loss and training environment. Whereas I expect any human to complete buggy code with buggy code once you explain to them that Codex looks for the most probable next token based on all the code.

But there doesn't seem to be a real disagreement between us: I agree that GPT-3/Codex seem fundamentally unable to get really good at the "Chatbot task" I described above, which is what I gather you mean by "solving my problem".

(By the way, I have an old post about formulating this task that we want GPT-3 to solve. It was written before I actually studied GPT-3 but that holds decently well I think. I also did some experiments on GPT-3 with EleutherAI people on whether bigger models get better at answering more variations of the prompt for the same task.)

[-]Rohin Shah4y20

Sure, but don't you agree that it's a very confusing use of the term?

Maybe? Idk, according to me the goal of alignment is "create a model that is motivated to help us", and so misalignment = not-alignment = "the mode is not motivated to help us". Feels pretty clear to me but illusion of transparency is a thing.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now.

(Also the fact that gwern, which ascribe agency to GPT-3, quoted specifically this part in his comment is another evidence that you're implying agency for different people)

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯

I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Maybe the real reason it feels weird for me to call this behavior of Codex misalignment is that it is so obvious?

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

Almost all specification gaming examples are subtle, or tricky, or exploiting bugs.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

[-]adamShimi4y*40

Sorry for the delay in answering, I was a bit busy.

I am making a claim that for the purposes of alignment of capable systems, you do want to talk about "motivation". So to the extent GPT-N / Codex-N doesn't have a motivation, but is existentially risky, I'm claiming that you want to give it a motivation. I wouldn't say this with high confidence but it is my best guess for now

That makes some sense, but I do find the "motivationless" state interesting from an alignment point of view. Because if it has no motivation, it also doesn't have a motivation to do all the things we don't want. We thus get some corrigibility by default, because we can change its motivation just by changing the prompt.

I think Gwern is using "agent" in a different way than you are ¯\_(ツ)_/¯
I don't think Gwern and I would differ much in our predictions about what GPT-3 is going to do in new circumstances. (He'd probably be more specific than me just because he's worked with it a lot more than I have.)

Agreed that there's not much difference when predicting GPT-3. But it's because we're at the place in the scaling where Gwern (AFAIK) describe the LM as an agent very good at predicting-agent. By definition it will not do anything different from a simulator, since its "goal" literally encode all of its behavior.

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

It doesn't seem like whether something is obvious or not should determine whether it is misaligned -- it's obvious that a very superintelligent paperclip maximizer would be bad, but clearly we should still call that misaligned.

Fair enough.

I think that's primarily to emphasize why it is difficult to avoid specification gaming, not because those are the only examples of misalignment.

Yeah, you're probably right.

[-]Rohin Shah4y20

Yet there is a difference when scaling. If Gwern is right (or if LM because more like what he's describing as they get bigger), then we end up with a single agent which we probably shouldn't trust because of all our many worries with alignment. On the other hand, if scaled up LM are non-agentic/simulator-like, then they would stay motivationless, and there would be at least the possibility to use them to help alignment research for example, by trying to simulate non-agenty systems.

Yeah, I agree that in the future there is a difference. I don't think we know which of these situations we're going to be in (which is maybe what you're arguing). Idk what Gwern predicts.

[-]adamShimi4y10

Exactly. I'm mostly arguing that I don't think the case for the agent situation is as clear cut as I've seen some people defend it, which doesn't mean it's not possibly true.

[-]Beth Barnes4y20

@Adam I'm interested if you have the same criticism of the language in the paper (in appendix E)?

(I mostly wrote it, and am interested whether it sounds like it's ascribing agency too much)

[-]adamShimi4y10

My message was really about Rohin's phrasing, since I usually don't read the papers in details if I think the summary is good enough.

Reading the section now, I'm fine with it. There are a few intentional stance words, but the scare quotes and the straightforwardness of cashing out "is capable" into "there is a prompt to make it do what we want" and "chooses" into "what it actually returns for our prompt" makes it quite unambiguous.

I also like this paragraph in the appendix:

However, there is an intuitive notion that, given its training objective, Codex is better described as “trying” to continue the prompt by either matching or generalizing the training distribution, than as “trying” to be helpful to the user.

Rohin also changed my mind on my criticism of calling that misalignment; I now agree that misalignment is the right term.

One thought I just had: this looks more like a form of proxy alignment to what we really want, which is not ideal but significantly better than deceptive alignment. Maybe autoregressive language models point to a way of paying a cost of proxy alignment to avoid deceptive alignment?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

[AN #157]: Measuring misalignment in the technology underlying Copilot

18

HIGHLIGHTS

TECHNICAL AI ALIGNMENT

TECHNICAL AGENDAS AND PRIORITIZATION

FORECASTING

MISCELLANEOUS (ALIGNMENT)

AI GOVERNANCE

NEWS

FEEDBACK

PODCAST