The “when no one’s watching” feature
Can we find a “when no one’s watching” feature, or its opposite, “when someone’s watching”? This could be a useful and plausible precursor to deception
With sufficiently powerful systems, this feature may be *very* hard to elicit, as by definition, finding the feature means someone’s watching and requires tricking / honeypotting the model to take it off its own modeling process of the input distribution.
This could be a precursor to deception. One way we could try to make deception harder for the model is to never develop this feature: modify the training data to make it clear that all gradients are recording, and someone’s always watching. This way, the need for the feature will be OOD by design, and it will be harder to develop and use such a feature during training. (We may still end up with effective proxies or correlates, such as “counterfactual no one’s watching,” e.g. when writing a story and assuming as the protagonist, or when feeding 1984 to the context window, etc. Even this proxy might still be a useful feature to locate and study.)
We could also perform gain of function finetuning on true *actual* “unrecorded training and evaluation” sessions to build a model organism with the feature to understand what it may look like, its subtlety, its elicitation and black box discovery potential, and how it hides OOD. Then leverage it for Microscope AI based alignment proposals.
I would offer $100 for anyone that can provide a concrete experiment or evidence to locate such a feature in a frontier open weight model.