Last year, we showed that supposedly superhuman Go AIs can be beaten by human amateurs playing specific “cyclic” patterns on the board. Vulnerabilities have previously been observed in a wide variety of sub- or near-human AI systems, but this result demonstrates that even far superhuman AI systems can fail catastrophically in surprising ways. This lack of robustness poses a critical challenge for AI safety, especially as AI systems are integrated in critical infrastructure or deployed in large-scale applications. We seek to defend Go AIs, in the process developing insights that can make AI applications in various domains more robust against unpredictable threats.

We explored three defense strategies: positional adversarial training on handpicked examples of cyclic patterns, iterated adversarial training against successively fine-tuned adversaries, and replacing convolutional neural networks with vision transformers. We found that the two adversarial training methods defend against the original cyclic attack. However, we also found several qualitatively new adversarial strategies that can overcome all these defenses. Nonetheless, finding these new attacks is more challenging than against an undefended KataGo, requiring more training compute resources for the adversary.

For more information, see our blog post, project website or paper

New Comment
1 comment, sorted by Click to highlight new comments since:

I speculated last time that convolutions might be inherently vulnerable; looks like I was wrong, although the ViT attack there does look distinctly different (so maybe in some sense that circle attack was convolutional)

I am not sure what is up with the 'gift' or 'atari' attacks, but the overall appearance of the attacks looks more like a horizon blindspot problem: they look like they push out the collapse enough steps that the feedforward network doesn't see any issue. And then perhaps it assigns such low values that the tree search never spots the disaster without an extremely large number of visits? Some qualitative work here on where the model+search goes wrong might be worthwhile now that there are multiple good attacks to compare. Are these all fundamentally the same or do they seem to trigger completely different pathologies?