Great! I'd love to have included a remark that one, as a human, might anticipate forward-chainy/rational reasoning in these systems because we're often taking the "thought" metaphor seriously/literally in the label "chain-of-thought", rather than backwardy/rationalization "reasoning".
But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought". That is, the optimization to have chain-of-thought in correspondence with its output can work in the backwards direction, cohering with bad output instead of leading to better output, a la partial agency.
(Admittedly human thoughts do a lot of rationalization as well. So maybe the mistake is in taking directionality implied by "chain" too seriously?)
Maybe this is obvious, but it could become increasingly reckless to not notice when you're drawing the face of "thoughts" or "chains" on CoT shoggoth-movements . You can be misled into thinking that the shoggoth is less able to deceive than it actually is.
Less obvious but important: in the reverse direction, drawing "hacker faces" on the shoggoth, as in the case of the Docker hack (section 4.2.1), can mislead into thinking that the shoggoth "wants" to or tends to hack/undermine/power-seek more than it actually, independently does. It seems at least somewhat relevant that the docker vulnerability was exploited for a challenge that was explicitly about exploiting vulnerability. Even though it was an impressive meta-hack, one must wonder how much this is cued by the prompt and therefore is zero evidence for an autonomy telos---which is crucial for the deceptive optimizer story---even though mechanistically possible.
(The word "independently" above is important: if it takes human "misuse"/participation to trigger its undermining personas, we also might have more of a continuous shot at pausing/shutdown or even corrigibilty.)
I was going to post this as a comment, but there's also an answer here: I'd say calling o1 "deceptive" could be as misleading as calling it aligned if it outputs loving words.
It has unsteady referentiality, at least from the POV of the meanings of us life-forms. Even though it has some closeness to our meanings and referentiality, the quantity of the unsteadiness of that referentiality can be qualitative. Distinguishing "deceptively aligned mesa-optimizer" from "the tentacles of the shoggoth I find it useful to call 'words' don't work like 'words' in some annoying ways" is important, in order to protect some of that (quantitatively-)qualitative difference. Both for not dismissing risks and for not hallucinating them.
I want to draw a clear distinction between the hypothesis I mention in the OP (your 'causal' explanation) and the 'acausal' explanation you mention here. You were blending the two together during our conversation, but I think there are important differences.
In particular, I am interested in the question: does o1 display any "backwards causation" from answers to chain-of-thought? Would its training result in chain-of-thought that is optimized to justify a sort of answer it tends to give (eg, hallucinated URLs)?
This depends on details of the training which we may not have enough information on (plus potentially complex reasoning about the consequences of such reasoning).
I notice that o1's behavior (it's cognitive process) looks suspiciously like human behaviors:
But why is this happening more when o1 can reason more than previous models? Shouldn't that give it more ways to catch its own deception?
No:
Most of this is analogous to how more intelligent people ("intellectuals") can generate elaborate, convincing—but incorrect—explanations that cannot be detected by less intelligent participants (who may still suspect something is off but can't prove it).
To me, this comparison to humans doesn't seem to answer why the o1 training ended up producing this result.
Why do you think o1 would know that it's making up references? The fact that it can't retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net - no retrieval necessary.)
Let's take the Sally link. The URL may not exist, but Sally's Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than necessary) and has many brownie recipes, and some ginger probing of 4o (to avoid any issues with o1 training and just try to understand the knowledge of the baseline) suggests that 4o finds the URL real and doesn't discuss it being fake or confabulated: https://chatgpt.com/share/66f753df-2ac4-8006-a296-8e39a1ab3ee0
I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly:
So, the assistant should [...] provide actual or plausible links.
The "plausible" here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.
I agree with this in principle, but contrasting "actual" with "plausible", combined with the fact that it talked about this in the context of not having internet access, makes me feel reasonably confident this is pointed at "not an actual link", but I agree that it's not an ironclad case.
>The artificially generated data includes hallucinated links.
Not commenting on OpenAI's training data, but commenting generally: Models don't hallucinate because they've been trained on hallucinated data. They hallucinate because they've been trained on real data, but they can't remember it perfectly, so they guess. I hypothesize that URLs are very commonly hallucinated because they have a common, easy-to-remember format (so the model confidently starts to write them out) but hard-to-remember details (at which point the model just guesses because it knows a guessed URL is more likely than a URL that randomly cuts off after the http://www.).
You write: "Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:"
I think what we can conclude here that the model doesn't know what it knows, like we do. It will say it values one thing, but 'forget' that it values the thing when exercising some other task. When executing a string of tokens, it is not leveraging any 'knowledge' stored in the rest of the neural net. That's what makes their knowledge (and alignment for that matter) so shallow and brittle (and honestly dangerous).
I agree, but my intuitive guess is that this isn't sufficient to explain the degree of deceptive behavior reported.
It seems potentially important to compare this to GPT4o. In my experience when asking GPT4 for research papers on particular subjects it seemed to make up non-existent research papers (at least I didn't find them after multiple minutes of searching the web). I don't have any precise statistics on this.
What I'm trying to express here is that it is surprising that o1 seems to explicitly encourage itself to fake links; not just that it fakes links. I agree that other models often hallucinate plausible references. What I haven't seen before is a chain of thought which encourages this. Furthermore, while it's plausible that you can solicit such a chain of thought from 4o under some circumstances, it seems a priori surprising that such behavior would survive at such a high rate in a model whose chain of thought has specifically been trained via rl to help produce correct answers. This leads me to guess the rl is badly misaligned.
The o1 system card reports:
An example of this type of chain-of-thought reasoning is provided:
OpenAI tries to soften the blow:
I find this unconvincing. In the reported example, the user is asking for references. Doesn't the model know (in the sense that if you asked it, it would tell you) that the user wanted real references as opposed to faked references? Here, I'll ask it:
So, it seems to me like o1 is at least failing to do obvious sanity-checks (if it asked itself whether it was "satisfying the user request" it would say no!), and more plausibly, deliberately attempting to deceive the user contrary to its own interpretation of the user request.
But why? Why would its training result in this?
Hypothesis: training to give plausible answers
In a recent discussion about this, Sahil postulated the following causal mechanism:
Some plausible predictions based on this hypothesis:
In weak support of this, I had two conversations with o1 where I asked it to prove one thing and prove the opposite; it provided "proofs" in both cases.