Kay Kozaronek

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Takeaways from our robust injury classifier project [Redwood Research]

Thus, if given the right incentives, it should be "easy" for our AI systems to avoid those kinds of catastrophes: they just need to not do it. To us, this is one of the core reasons for optimism about alignment.

I'm not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them "the right incentives"?

What exactly do you mean by "the right incentives"?

Can you illustrate this by means of an example?

Reply

Concrete Steps to Get Started in Transformer Mechanistic Interpretability

Kay Kozaronek2y00

Thank you for your efforts in organizing and outlining the learning steps, Neel. I found the inclusion of concrete success criteria to be very helpful. I was wondering if you might be able to provide an estimated time indication for each step as well. I believe this would be useful not only to myself but to others as well. In particular, could you provide rough time estimates for the four steps in the "Getting the Fundamentals" part of the curriculum?

Reply