Reverse-engineering using interpretability — AI Alignment Forum