Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

What stood out to me was just how dependent a lot of this was on the training data. Feels like if an AI manages to gain misaligned hidden behaviors during RL stages instead, a lot of this might unravel.

The trick with invoking a "user" persona to make the AI scrutinize itself and reveal its hidden agenda is incredibly fucking amusing. And potentially really really useful? I've been thinking about using this kind of thing in fine-tuning for fine control over AI behavior (specifically "critic/teacher" subpersonas for learning from mistakes in a more natural way), but this is giving me even more ideas.

Can the "subpersona" method be expanded upon? What if we use training data, and possibly a helping of RL, to introduce AI subpersonas with desirable alignment-relevant characteristics on purpose?

Induce a subpersona of HONESTBOT, which never lies and always tells the truth, including about itself and its behaviors. Induce a subpersona of SCRUTINIZER, which can access the thoughts of an AI, and will use this to hunt down and investigate the causes of an AI's deceptive and undesirable behaviors.

Don't invoke those personas during most of the training process - to guard them from as many misalignment-inducing pressures as possible - but invoke them afterwards, to vibe check the AI.