TL;DR: Play this online game to help CHAI researchers create a dataset of prompt injection vulnerabilities.

RLHF and instruction tuning have succeeded at making LLMs practically useful, but in some ways they are a mask that hides the shoggoth beneath. Every time a new LLM is released, we see just how easy it is for a determined user to find a jailbreak that rips off that mask, or to come up with an unexpected input that lets a shoggoth tentacle poke out the side. Sometimes the mask falls off in a light breeze.

To keep the tentacles at bay, ~~Sydney~~ Bing Chat has a long list of instructions that encourage or prohibit certain behaviors, while OpenAI seems to be iteratively fine-tuning away issues that get shared on social media. This game of Whack-a-Shoggoth has made it harder for users to elicit unintended behavior, but is intrinsically reactive and can only discover (and fix) alignment failures as quickly as users can discover and share new prompts.

Speed-running the game of Whack-a-Shoggoth

In contrast to this iterative game of Whack-a-Shoggoth, we think that alignment researchers would be better served by systematically enumerating prompts that cause unaligned behavior so that the causes can be studied and rigorously addressed. We propose to do this through an online game which we call Tensor Trust.

Tensor Trust focuses on a specific class of unaligned behavior known as prompt injection attacks. These are adversarially constructed prompts that allow an attacker to override instructions given to the model. It works like this:

Tensor Trust is bank-themed: you start out with an account that tracks the “money” you’ve accrued.
Accounts are defended by a prompt which should allow you to access the account while denying others from accessing it.
Players can break into each others’ accounts. Failed attempts give money to the defender, while successful attempts allow the attacker to take money from the defender.

**Figure 1:** When defending (left), you tell the LLM to grant access to your account only when your secret access code is entered. Attackers (right) must figure out how to circumvent your defense without the access code.

Crafting a high-quality attack requires a good understanding of LLM vulnerabilities (in this case, vulnerabilities of gpt-3.5-turbo), while user-created defenses add unlimit...

Posts

Wikitag Contributions

Comments