Neel Nanda

Sequences

GDM Mech Interp Progress Updates
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level
Interpreting Othello-GPT
200 Concrete Open Problems in Mechanistic Interpretability

Wikitag Contributions

Comments

Sorted by

I think the correct question is how much of an update should you make in an absolute sense rather than a relative sense? Many people in this community are overconfident and if you decide that every person is less worth listening to than you thought this doesn't change who you listen to, but it should make you a lot more uncertain in your beliefs

Are the joint names separated by spaces if not, the tokenization is going to be totally broken more generally I would be interested to see this Tried with a code that EG maps familiar tokens to obscure ones or something like mapping token with id k to id maximum minus K. Tokens feel like the natural way in llm would represent its processing and thus encoded processing. Doing things in individual letters is kind of hard

Thanks a lot for doing this. This is substantially more evaluation awareness than I would have predicted. I'm not super convinced by the classifying transcript purpose experiments, since the evaluator model is plausibly primed to think about this stuff, but the monitoring results seem compelling and very concerning. Thanks a lot for doing this work. I guess we really get concerned when it stops showing up in the chain of thought...

I would disagree that either one or four was achieved because, to my knowledge, the auditing game focused on finding, not fixing. It's also a bit ambiguous whether the prediction involves fixing with interpretability or fixing with unrelated means. It wouldn't surprise me if you could use the SAE-derived insights to filter out the relevant data if you really wanted to, but I'd guess an LLM classifier is more effective. Did they do that in the paper?

(To be clear, I think it's a great paper, and to the degree that there's a disagreement here it's that I think your predictions weren't covering the right comparative advantages of interp)

Neel Nanda1417

This is fantastic work, I'd be very excited to see more work in the vein of auditing games. It seems like the one of the best ways so far to test how useful different techniques for understanding models are

In my incredibly biased opinion, the GDM AGI safety team is great and an effective place to work on reducing AI x-risk, and I would love to get applications from people here

For posterity, this turned out to be a very popular technique for jailbreaking open source LLMs - see this list of the 2000+ "abliterated" models on HuggingFace (abliteration is a mild variant of our technique someone coined shortly after, I think the main difference is that you do a bit of DPO after ablating the refusal direction to fix any issues introduced?). I don't actually know why people prefer abliteration to just finetuning, but empirically people use it, which is good enough for me to call it beating baselines on some metric.

This looks extremely comprehensive and useful, thanks a lot for writing it! Some of my favourite tips (like clipboard managers and rectangle) were included, which is always a good sign. And I strongly agree with "Cursor/LLM-assisted coding is basically mandatory".

I passed this on to my mentees - not all of this transfers to mech interp, in particular the time between experiments is often much shorter (eg a few minutes, or even seconds) and often almost an entire project is in de-risking mode, but much of it transfers. And the ability to get shit done fast is super important

I don't know much about CTF specifically, but based on my maths exam/olympiad experience I predict that there's a lot of tricks to go fast (common question archetypes, saved code snippets, etc) that will be top of mind for people actively practicing, but not for someone with a lot of domain expertise who doesn't explicitly practice CTF. I also don't know how important speed is for being a successful cyber professional. They might be able to get some of this speed up with a bit of practice, but I predict by default there's a lot of room for improvement.

Thanks a lot for sharing all this code and data, seems super useful for external replication and follow-on work. It might be good to link this post from the Github readme - I initially found the Github via the paper, but not this post, and I found this exposition in this post more helpful than the current readme

Load More