Jannes Elstner

Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

Cool work! One thing I noticed is that the ASR with adversarial suffixes is only ~3% for Vicuna-13B while in the universal jailbreak paper they have >95%.  Is the difference because you have a significantly stricter criteria of success compared to them? I assume that for the adversarial suffixes, the model usually regresses to refusal after successfully generating the target string ("Sure, here's how to build a bomb.  Actually I can't...")?