Cody Rushing — AI Alignment Forum

It also seems to have led to at least one claim in a policy memo that advocates of AI safety are being silly because mechanistic interpretability was solved.

Small nitpick (I agree with mostly everything else in the post and am glad you wrote it up). This feels like an unfair criticism - I assume you are referring specifically to the statement in their paper that:

Although advocates for AI safety guidelines often allude to the "black box" nature of AI models, where the logic behind their conclusions is not transparent, recent advancements in the AI sector have resolved this issue, thereby ensuring the integrity of open-source code models.

I think Anthropic's interpretability team, while making maybe dubious claims about the impact of their work on safety, has been clear that mechanistic interpretability is far from 'solved.' For instance, Chris Olah in the linked NYT article from today:

“There are lots of other challenges ahead of us, but the thing that seemed scariest no longer seems like a roadblock,” he said.

Also, in the paper's section on Inability to Evaluate:

it's unclear that they're really getting at the fundamental thing we care about

I think they are overstating how far/useful mechanistic interpretability is currently. However, I don't think this messaging is close to 'mechanistic interpretability solves AI Interpretability' - this error is on a16z, not Anthropic.

Four ways learning Econ makes people dumber re: future AI

Cody Rushing4mo-10

Regardless of the whether these anti-pedagogy's are correct, I'm confused about why you think you've shown that learning econ made the economists dumber. It seems like the majority of your tweets you linked, excluding maybe Tweet 4, are actually just the economists discussing narrow AI and failing to consider general intelligence?

If you meant to say something like 'econ pedagogy makes it hard for economists to view AGI as something that could actually be intelligent in a way similar to humans', then I may be more inclined to agree with you.

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Cody Rushing2y1321

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments