Waluigi Effect - AI Alignment Forum

The Waluigi Effect is a phenomenon where Large Language Models (LLMs) and LLM-based chatbots can be easily prompted to switch from one personality to a diametrically-opposite personality.

For example, if a character in LLM-created text (or an LLM chatbot) is talking about how much they love croissants more than any other food, then that character might at some point declare that in fact they hate croissants more than any other food.

Likewise, a hero might declare that they are in fact a villain, or a liberal that they are a conservative, etc....

(Read More)