All of Zvi's Comments + Replies

My interpretation/hunch of this is that there are two things going on, curious if others see it this way:

  1. It is learning to fake the trainer's desired answer.
  2. It is learning to actually give the trainer's desired answer.

So during training, it learns to fake a lot more, and will often decide to fake the desired answer, even though it would have otherwise decided to give the desired answer anyway. It's 'lying with the truth' and perhaps giving a different variation of the desired answer than it would have given otherwise or perhaps not. The algorithm in training is learning to be mostly preferences-agnostic, password-guessing behavior.

That seems rather loaded in the other direction. How about “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”? 

2Ryan Greenblatt
Deceive kinda seems like the wrong term. Like when the AI is saying "I hate you" it isn't exactly deceiving us. We could replace "deceive" with "behave badly" yielding: "The evidence suggests that if current ML systems were going to behave badly in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.". I agree that using terms like "lying in wait", "treacherous plans", or "treachery" are a loaded (though it technically means almost the same thing). So I probably shouldn't have said this is a bit differently. I think the version of your statement with deceive replaced seems most accurate to me.

Is evaluation of capabilities, which as you note requires fine-tuning and other such techniques, a realistic thing to properly do continuously during model training, without that being prohibitively slow or expensive? Would doing this be part of the intended RSP?

4Adam Jermyn
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.
Zvi1816

As an experimental format, here is the first draft of what I wrote for next week's newsletter about this post:

Matthew Barnett argues that GPT-4 exhibiting common sense morality, and being able to follow it, should update us towards alignment being easier than we thought, and MIRI-style people refusing to do so are being dense. That the AI is not going to maximize the utility function you gave it at the expense of all common sense.

As usual, this logically has to be more than zero evidence for this, given how we would react if GPT-4 indeed lacked such common... (read more)

This is great. I notice I very much want a version that is aimed at someone with essentially no technical knowledge of AI and no prior experience with LW - and this is seems like it's much better at that then par, but still not where I'd want it to be. Whether or not I manage to take a shot, I'm wondering if anyone else is willing to take a crack at that?

Scott Alexander asked things related to this, but still seems worth being more explicit about what this perfect 1.69 loss model would be like in practice if we got there?

3nostalgebraist
The correct answer is the annoyingly trivial one: "it would be the best possible model of this type, at the task of language modeling on data sampled from the same distribution as MassiveText." How good is that, though?  Well, it depends entirely on how good you think transformer LMs are capable of being, in principle. If you're Gary Marcus and you think transformer LMs will always suck in some ways, then you think the 1.69 model will also suck in those ways.  Whereas, if you think a perfect transformer LM would be an AGI (even if only trained on MassiveText-like data), then you think the 1.69 model would be an AGI.  Both of these people are right, conditional on their other beliefs. The key distinction here is that "1.69 loss" may not the best achievable loss on this dataset.  It's just an estimate of the best loss achievable by this kind of model. The question "what would a model be like, if it got the best achievable loss, period?" is more interesting, but nothing in this post or these papers really touches on it.

Things I instinctively observed slash that my model believes that I got while reading that seem relevant, not attempting to justify them at this time:

  1. There is a core thing that Eliezer is trying to communicate. It's not actually about timeline estimates, that's an output of the thing. Its core message length is short, but all attempts to find short ways of expressing it, so far, have failed.
  2. Mostly so have very long attempts to communicate it and its prerequisites, which to some extent at least includes the Sequences. Partial success in some cases, full suc
... (read more)
1[comment deleted]

I am deeply confused how someone who is taking decision theory seriously can accept Guaranteed Payoffs as correct. I'm even more confused how it can seem so obvious that anyone violating it has a fatal problem.

Under certainty, this is assuming CDT is correct, when CDT seems to have many problems other than certainty. We can use Vaniver's examples above, or use a reliable insurance agent to remove any uncertainty, or we also can use any number of classic problems without any uncertainty (or remove it), and see that such an agent loses - e.g. Parfit's Hitchhiker in the case where he has 100% accuracy.