All of Jared Kaplan's Comments + Replies

Thanks, yeah I meant that I was interested in a solution that would scale to arbitrarily superhuman AI capabilities with a "mere" capabilities hit/cost (perhaps a very large cost that grows with AI capability, but does not impose a bound on the ultimate capability of the aligned system).  So this was a useful clarification for me in terms of understanding your perspective; I may be wrong but I could imagine it might be useful to lead with this a bit more, ie "we don't know of and would be very interested in solutions that might be extremely costly but that avoid all counter-examples".  Possibly you already say this and I just missed it.

Apologies for a possibly naive comment/question, perhaps this has been discussed elsewhere and you can just direct me there.  But anyway...

I would find it helpful to see a strategy that ARC believes does in fact solve ELK, but fails only because it requires taking an unacceptably large capabilities hit.  I would find this helpful for several reasons, namely 

(1) it would help me to understand what kinds of strategies you believe really do escape counter-examples, 
(2) it would give me a better sense for how optimistic to be about the appr... (read more)

5Paul Christiano
If you don't care about a capabilities hit, I think the salient strategy is training your model to predict human predictions rather than to predict reality. You can still do science+debate+etc. in order to improve those predictions. If you care about getting superhuman capabilities (and going beyond recursive schemes etc.) then I don't know if there's any easy way to "merely" pay a big capabilities hit. Certainly I don't know how to e.g. solve the problem in a way that's merely very computationally expensive (and that does sound like it would be major progress towards a solution, I'd guess it would mean you are most of the way there).

I think this is an interesting project, and one that (from a very different angle) I’ve spent a bit of time on, so here are a few notes on that, followed by a few suggestions. Stella, in another comment, made several great points that I agree with and that are similar in spirit to my suggestions.

Anyway, based on a fairly similar motivation of wanting to be able to “ask a LM what it’s actually thinking/expecting”, combined with the general tendency to want to do the simplest and cheapest thing possible first… and then try to make it even simpler still befor... (read more)

3Beth Barnes
I'm sympathetic to the desire to keep things simple, but I actually think that getting good at scalably collecting rich human data is probably the most valuable part of the project. I'd be really excited to see Anthropic either building an excellent internal human data team, or figuring out how to work productively with one of the existing human data provider startups.