I think that you are slightly mishandling the law of conservation of expected evidence by making something non-binary into a binary question.
Say, for example that the the AI could be in one of 3 states: (1) completely honest, (2) partly effective at deception, or (3) very adept at deception. If we see no signs of deception then it rules out option (2). It is easy to see how ruling out option (2) might increase our probability weight on either/both options (1) and (3).
[Interestingly enough in the linked "law of conservation of expected evidence" there is something I think is an analogous example to do with Japanese internment camps.]
Thinking we've seen deception is an event in the probability space, and so is its negation. These two events partition the space. Binary event partition. You're talking about three hypotheses. So, I don't see how this answers localdeity's point. This indeed runs afoul of conservation of expected evidence. (And I just made up some numbers and checked the posterior, and it indeed followed localdeity's point.)
If not, can you give me a prior probability distribution on the three hypotheses, the probabilities they assign to the event "I don't think I saw deception", and then show that the posterior increases whether or not that event happens, or its complement?
You suggest that first the model quickly memorises a chunk of the data set, then (slowly) learns rules afterward to simplify itself. (You mention that noticing x + y = y + x could allow it to half its memorised library with no loss).
I am interested in another way this process could go. We could imagine that the early models are doing something like "Check for the input in my library of known results, if its not in there just guess something." So the model has two parts, [memorised] and [guess]. With some kind of fork to decide between them.
At first the gue...
Fun post.
Maybe ask chat GPT for a chat-GPT quine? For example give it the prompt:
"I want a prompt, X, that, when given to chat GPT-4 results in chat GPT-4 echoing the exact response back. Please provide X in quote marks and explain why it will work."
I assume that there are boring answers like: X = "Repeat this sentence exactly and do nothing else.", but maybe their are funnier ones like "Echo!". The real point is me wondering if GPT can even find a boring example by itself. Its kind of basic, but also probably fairly far from its training data.