User Comment Replies — AI Alignment Forum

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Cody11mo1321

It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved.

Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism - I assume you are referring specifically to the statement in their paper that:

Although advocates for AI safety guidelines often allude to the "black box" nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sect

... (read more)

Neel Nanda10mo2758

+1, I think the correct conclusion is "a16z are making bald faced lies to major governments" not "a16z were misled by Anthropic hype"

Stephen Casper11mo79

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements.

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this.

AI ALIGNMENT FORUM
AF

All of Cody Rushing's Comments + Replies