This is a pretty cool paper. Despite feeling overall quite positive about it, I have some reservations:
My summary to augment the main one:
Broadly human level AI may be here soon and will have a large impact. Anthropic has a portfolio approach to AI safety, considering both: optimistic scenarios where current techniques are enough for alignment, intermediate scenarios where substantial work is needed, and pessimistic scenarios where alignment is impossible; they do not give a breakdown of probability mass in each bucket and hope that future evidence will help figure out what world we're in (though see the last quote below). These buckets are helpful for unde...
how likely does Anthropic think each is? What is the main evidence currently contributing to that world view?
I wouldn't want to give an "official organizational probability distribution", but I think collectively we average out to something closer to "a uniform prior over possibilities" without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.
(Obviously, within the company, there's a wide range of views. Some pe...
I doubt it's a crux for you, but I think your critique of Debate makes pessimistic assumptions which I think are not the most realistic expectation about the future.
...Let’s play the “follow-the-trying game” on AGI debate. Somewhere in this procedure, we need the AGI debaters to have figured out things that are outside the space of existing human concepts—otherwise what’s the point? And (I claim) this entails that somewhere in this procedure, there was an AGI that was “trying” to figure something out. That brings us to the usual inner-alignment question
Makes sense. FWIW, based on Jan's comments I think the main/only thing the OpenAI alignment team is aiming for here is i, differentially speeding up alignment research. It doesn't seem like Jan believes in this plan; personally I don't believe in this plan.
4. We want to focus on aspects of research work that are differentially helpful to alignment. However, most of our day-to-day work looks like pretty normal ML work, so it might be that we'll see limited alignment research acceleration before ML research automation happens.
I don't know how to link t...
(iii) because if this was true, then we could presumably just solve alignment without the help of AI assistants.
Either I misunderstand this or it seems incorrect.
It could be the case that the current state of the world doesn’t put us on track to solve Alignment in time, but using AI assistants to increase the rate of Alignment : Capabilities work by some amount is sufficient.
The use of AI assistants for alignment : capabilities doesn't have to track with the current rate of Alignment : Capabilities work. For instance, if the AI labs with the biggest ...
Summary:
If interpretability research is highly tractable and we can build highly interpretable systems without sacrificing competitiveness, then it will be better to build such systems from the ground up, rather than taking existing unsafe systems and tweaking them to be safe. By analogy, if you have a non-functioning car, it is easy to bring in functional parts to fix the engine and make the car drive safely, compared to it being hard to take a functional elephant and tweak it to be safe. In a follow up post
, the author clarifies that this could be though...
Strong upvote from me. This is an interesting paper, the github is well explained, and you run extensive secondary experiments to test pretty much every "Wait but couldn't this just be a result of X" that I came up with. I'm especially impressed by the range of generalization results.
Some questions I still have:
- The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up?
- Appendix C reports the feature importance of various follow-u
... (read more)