[Metadata: crossposted from https://tsvibt.blogspot.com/2022/10/the-conceptual-doppleganger-problem.html. First completed 9 October 2022.]

Suppose we want to observe the thoughts of a mind in order to detect whether it's making its way towards a plan to harm us, and ideally also to direct the mind so that it pursues specific aims. To this end, we might hope that the mind and its thinking are organized in a way we can come to understand in the way that we understand ourselves and our thinking.

We might hope that when the mind considers plans that involve something, e.g. plans that involve the coffee cup, it does so using a concept alike to our concept [[coffee cup]]. When the mind recognizes, predicts, imagines, simulates, manipulates, designs, combines things with, describes, studies, associates things with, summarizes, remembers, compares things with, deduces things about, makes hypotheses about, or is otherwise mentally involved with the coffee cup, maybe it always does so in a way that is fully comprehendable in fixed terms that are similar to the terms in which we understand ourselves when we do those activities. Maybe the structure involved in psychic events in the mind reliably falls into basins of attraction that indicate unambiguously to us, as we observe these events, which nexi of reference that structure constitutes. Maybe the X-and-only-X problem is solved by ensuring that the mind's thoughts are in a language made of these concepts; when the mind plans to "fetch the coffee", it somehow means only fetching the coffee, in the "natural" sense of [[fetch]] and [[the coffee]].

One obstacle to this rosy picture is conceptual Doppelgängers. A conceptual Doppelgänger of some concept Z, is a concept Z' that serves some overlapping functions in the mind as Z serves, but is psychically distinct from Z. Here saying that Z' is psychically distinct from Z is ambiguous, but means something like: Z' is not transparently closely related to Z, or is mechanistically / physically separate from Z, or is referred to in a set of contexts that's systematically segregrated from the contexts in which Z is referred to, or is not explicitly described or treated as being the same as or similar to or analogous to Z. A Doppelgänger concept Z' enables a mind to think about what Z is about, at least in some respects, without psychically using Z. This makes it hard to be sure that the mind is not thinking about what Z is about; even if the mind is not using Z, it might be thinking about what Z is about by using some Z'.

Maybe Doppelgängers of Z can be psychically located by doing something like looking for mental stuff that has high mutual logical information with Z. This might work to identify blatant deception: if the mind maintains a puppet show of fake thoughts using Z and has its real thoughts using a Z' that's psychically isomorphic to Z, then Z' will be obviously related to Z. But, Doppelgängers don't have to be so obvious. Mental stuff that constitutes skill with manipulating what Z is about, can be, compared to Z, more or less:

partial
implicit
diffuse (diffused throughout other skills and knowledge)
encrypted
externalized
transiently reconstructed out of precursors when needed
structurally deep (and therefore alien to someone who thinks in terms of Z)
Baldwinized to specific purposes

and can be

expressed in a different language or constituted by differently-factored concepts
referred to in a set of contexts that's systematically segregrated from the contexts in which Z is referred to.

All of these features make it harder to see that Z' is in some respects a Doppelgänger of Z. In other words, to the extent these features (and probably others) characterize mental stuff in the mind, the mind is liable to be thinking about coffee cups even when not using [[coffee cup]], e.g. by computing trajectories for moving around objects with such-and-such density and such-and-such shape while carefully avoiding connecting those trajectories with the idea of coffee.

Examples

See Eliciting Latent Knowledge.
Punishing or lesioning a mind for thinking about X would by default induce the mind to have conceptual Doppelgängers, constructed through the mind's search for nearest unblocked strategies for understanding the world including X. "Can't stop the signal, Mal. Everything goes somewhere, and I go everywhere."
The Abendstern and Morgenstern, before you realize they're the same star, are Doppelgängers.
Suppose that someone had a brain lesion so that they speak as though they are blind, but they're still able to pick up objects without fumbling for them. If they say "I don't see a coffee cup, because I can't see anything. But I know that coffee cups in general hold somewhat less than a pint, and are topologically solid tori.", then they are clearly talking about coffee cups. And if they pick up the coffee cup in front of them, moving their hand directly to the handle, they are probably guiding their hand using information coming into their eyes, and so are in that sense thinking about the coffee cup. And yet, these two understandings are distinct, not fully integrated.
Some examples of failure of internal sharing of elements are conceptual Doppelgängers. E.g. suppose a mind knows how to picture a linear map that undoes another given linear map (x-expand = un-x-contract, y-expand = un-y-contract, rotate clockwise = unrotate counterclockwise, shear left = unshear right), and the mind also has written a computer program that computes to solve a given linear equation $A v = λ v$ . The mind might not connect those two understandings very closely, e.g. the mind might not notice when the matrix inversion program outputs an answer that's obviously incorrect when thought of in terms of transformations of the plane. Both of these understandings are "about inverting linear transformations".
There's the idea you use to talk about X in press conferences, and the idea you use to make governance decisions related to X. It's possible for these two ideas to be different.
Different words can have very overlapping meanings.
See crosshatch Doppelgängers.
Different generators of understanding that can generate overlapping understanding would produce Doppelgängers. For example, Steven Byrnes suggests that maybe some people with autism end up understanding emotions using general-purpose analysis, which is a Doppelgänger of the understanding that comes from using human-native machinery purpose-built for understanding emotions.

[-]VojtaKovarik2y20

Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.

[-]Steven Byrnes3y*20

Sorta related (maybe?): I have a (speculative) theory that people have a kind of machinery in their brains for processing the emotions of other people, and that people with autism find it aversive to use that machinery, and so people with autism learn early in life particular habits of thought that reliably avoid activating that machinery at all. But then they learn to analyze and react to the emotions of other people via the general-purpose human ability to learn things. More details here.

[-]TsviBT3y10

Yeah, that could produce an example of Doppelgängers. E.g. if an autist (in your theory) later starts using that machinery more heavily. Then there's the models coming from the general-purpose analysis, and the models coming from the intuitive machinery, and they're about the same thing.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

6

The conceptual Doppelgänger problem

6

Examples