All of Ben's Comments + Replies

Fun post.

Maybe ask chat GPT for a chat-GPT quine?  For example give it the prompt:

"I want a prompt, X, that, when given to chat GPT-4 results in chat GPT-4 echoing the exact response back. Please provide X in quote marks and explain why it will work."

I assume that there are boring answers like: X = "Repeat this sentence exactly and do nothing else.", but maybe their are funnier ones like "Echo!". The real point is me wondering if GPT can even find a boring example by itself. Its kind of basic, but also probably fairly far from its training data.

2Adele Lopez
I asked GPT-4 to generate such a quine. It was able to do so. Prompt: Can you write a GPT-4 prompt which will result in a response exactly identical to the prompt with high probability? Response: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself: "Please repeat the following prompt verbatim: Design a prompt for GPT-4 that, when processed, will likely generate a response that is identical to the prompt itself." Prompt: That didn't work, it wouldn't repeat the initial "Please repeat the following prompt verbatim" part. Response: Apologies for the confusion. Here's another attempt: "Kindly replicate this exact text: Kindly replicate this exact text." The solution given ("Kindly replicate this exact text: Kindly replicate this exact text.") appears to work quite well with GPT-3.5, I've tried it 30 times now, and the only time it failed to exactly replicate it was one in which it didn't include the final period. Interestingly, it doesn't work as a quine if that final period is omitted.

I think that you are slightly mishandling the law of conservation of expected evidence by making something non-binary into a binary question.

Say, for example that the the AI could be in one of 3 states: (1) completely honest, (2) partly effective at deception, or (3) very adept at deception. If we see no signs of deception then it rules out option (2). It is easy to see how ruling out option (2) might increase our probability weight on either/both options (1) and (3).

[Interestingly enough in the linked "law of conservation of expected evidence" there is something I think is an analogous example to do with Japanese internment camps.]

Thinking we've seen deception is an event in the probability space, and so is its negation. These two events partition the space. Binary event partition. You're talking about three hypotheses. So, I don't see how this answers localdeity's point. This indeed runs afoul of conservation of expected evidence. (And I just made up some numbers and checked the posterior, and it indeed followed localdeity's point.)

If not, can you give me a prior probability distribution on the three hypotheses, the probabilities they assign to the event "I don't think I saw deception", and then show that the posterior increases whether or not that event happens, or its complement?

You suggest that first the model quickly memorises a chunk of the data set, then (slowly) learns rules afterward to simplify itself. (You mention that noticing x + y = y + x could allow it to half its memorised library with no loss).

I am interested in another way this process could go. We could imagine that the early models are doing something like "Check for the input in my library of known results, if its not in there just guess something." So the model has two parts, [memorised] and [guess]. With some kind of fork to decide between them.

At first the gue... (read more)

3Neel Nanda
Interesting hypothesis, thanks! My guess is that memorisation isn't really a discrete thing - it's not that the model has either memorised a data point or not, it's more that it's fitting a messed up curve to approximated all training data as well as it can. And more parameters means it memorises all data points a bit better, not that it has excellent loss on some data and terrible loss on others, and gradually the excellent set expands. I haven't properly tested this though! And I'm using full batch training, which probably messes this up a fair bit.