Prospects for Alignment Automation:
Interpretability Case Study

Jacob Pfau; Geoffrey Irving

This is really helpful to get such an overview. It's an impressive body of work!

Building on the Frontier Safety Team's recent work on persuasion, do you see an expansion of human-AI interaction experiments?

Based on Dafoe's work, how is the AGI Safety Team currently thinking about structural risks and corresponding threat models? For instance, cyber and CBRN threats are recognized as misuse cases. Are there evals/red teaming planned for capabilities that could significantly impact, say, nuclear deterrence, especially in scenarios with asymmetric access to s... (read more)

AI ALIGNMENT FORUM
AF

All of Tristram Price's Comments + Replies