https://twitter.com/antimatter15/status/1602469101854564352
Currently, large language models (ChatGPT, Constitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology”
Instead of training a language model to produce “elaborate apology” when it refuses to do an action, train it to produce a special sequence or token first “<SORRYDAVE>elaborate apology”. Strip the special sequence out before returning a response to the user (and never allow the user to include the special sequence in input).