Knight Lee

Wiki Contributions

Comments

Sorted by

Do you think my Multi-Agent Framing idea might work against the Waluigi attractor states problem?

Pardon my self promotion haha.

It seems like the post is implicitly referring to the next big paper on SAEs from one of these labs, similar in newsworthiness as the last Anthropic paper. A big paper won't be a negative result or a much smaller downstream application, and a big paper would compare its method against baselines if possible, making 165% still within the ballpark.

I still agree with your comment, especially the recommendation for a time-based prediction (I explained in my other comment here).

Thank you for your alignment work :)

I like your post, I like how you overviewed the big picture of mechanistic interpretability's present and future. That is important.

I agree that it is looking more promising over time with the Golden Gate Claude etc. I also agree that there is some potential for negatives. I can imagine an advanced AI editing itself using these tools, causing its goals to change, causing it to edit itself even more, in a feedback loop that leads to misalignment (this feels unlikely, and a superintelligence would be able to edit itself anyways).

I agree the benefits outweigh the negatives: yes mechanistic interpretability tools could make AI more capable, but AI will eventually become capable anyways. What matters is whether the first superintelligence is aligned, and in my opinion it's much harder to align a superintelligence if you don't know what's going on inside.

One small detail is defining your predictions better, as Dr. Shah said. It doesn't hurt to convert your prediction to a time-based prediction. Just add a small edit to this post. You can still post an update after the next big paper even if your prediction is time-based.

A prediction based on the next big paper not only depends on unimportant details like the number of papers they spread their contents over, but doesn't depend on important details like when the next big paper comes. Suppose I predicted that the next big advancement beyond OpenAI's o1 will be able to get 90% on the GPQA Diamond, but didn't say when it'll happen. I'm not predicting very much in that case, and I can't judge how accurate my prediction was afterwards.

Your last prediction was about the Anthropic report/paper that was already about to be released, so by default you predicted the next paper again. This is very understandable.

Thank you for your alignment work :)