This is a linkpost for https://www.nature.com/articles/s42256-022-00465-9
Worth remembering that flips of the reward function do happen: https://openai.com/blog/fine-tuning-gpt-2/#bugscanoptimizeforbadbehavior ("Was this a loss to minimize or a reward to maximize...")
H/T Aella.
A company that made machine learning software for drug discovery, on hearing about the security concerns for these sorts of models, asked: "huh, I wonder how effective it would be?" and within 6 hours discovered not only one of the most potent known chemical warfare agents, but also a large number of candidates that the model thought was more deadly.
This is basically a real-world example of the "it just works to flip the sign of the utility function and turn a 'friend' into an 'enemy'"; this was slightly more complicated as they had two targets that they jointly optimized for the drug discovery process (toxicity and bioactivity), and only the toxicity target is flipped. [This makes sense--you'd want your chemical warfare agents to not be bioactive.] It also required a little bit of domain knowledge--they had to specify which sort of bioactivity to look for, and picked one that would point towards this specific agent.