A Walkthrough of Interpretability in the Wild (w/ authors Kevin Wang, Arthur Conmy & Alexandre Variengien)

Great! I'd love to have included a remark that one, as a human, might anticipate forward-chainy/rational reasoning in these systems because we're often taking the "thought" metaphor seriously/literally in the label "chain-of-thought", rather than backwardy/rationalization "reasoning".

But since it is is at least somewhat intelligent/predictive, it can make the move of "acausal collusion" with its own tendency to hallucinate, in generating its "chain"-of-"thought". That is, the optimization to have chain-of-thought in correspondence with its output can work in the backwards direction, cohering with bad output instead of leading to better output, a la partial agency.

(Admittedly human thoughts do a lot of rationalization as well. So maybe the mistake is in taking directionality implied by "chain" too seriously?)

Maybe this is obvious, but it could become increasingly reckless to not notice when you're drawing the face of "thoughts" or "chains" on CoT shoggoth-movements . You can be misled into thinking that the shoggoth is less able to deceive than it actually is.

Less obvious but important: in the reverse direction, drawing "hacker faces" on the shoggoth, as in the case of the Docker hack (section 4.2.1), can mislead into thinking that the shoggoth "wants" to or tends to hack/undermine/power-seek more than it actually, independently does. It seems at least somewhat relevant that the docker vulnerability was exploited for a challenge that was explicitly about exploiting vulnerability. Even though it was an impressive meta-hack, one must wonder how much this is cued by the prompt and therefore is zero evidence for an autonomy telos---which is crucial for the deceptive optimizer story---even though mechanistically possible.

(The word "independently" above is important: if it takes human "misuse"/participation to trigger its undermining personas, we also might have more of a continuous shot at pausing/shutdown or even corrigibilty.)

I was going to post this as a comment, but there's also an answer here: I'd say calling o1 "deceptive" could be as misleading as calling it aligned if it outputs loving words.

It has unsteady referentiality, at least from the POV of the meanings of us life-forms. Even though it has some closeness to our meanings and referentiality, the quantity of the unsteadiness of that referentiality can be qualitative. Distinguishing "deceptively aligned mesa-optimizer" from "the tentacles of the shoggoth I find it useful to call 'words' don't work like 'words' in some annoying ways" is important, in order to protect some of that (quantitatively-)qualitative difference. Both for not dismissing risks and for not hallucinating them.

Reply

1