User Comment Replies — AI Alignment Forum

High-stakes alignment via adversarial training [Redwood Research report]

2y20

That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results; an updated arXiv paper with those experiments is appearing on Monday.

Takeaways from our robust injury classifier project [Redwood Research]

dmz

3y10

The main reason is that we think we can learn faster in simpler toy settings for now, so we're doing that first. Implementing all the changes I described (particularly changing the task definition and switching to fine-tuning the generator) would basically mean starting over from scratch anyway.

High-stakes alignment via adversarial training [Redwood Research report]

dmz

3y20

Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)

High-stakes alignment via adversarial training [Redwood Research report]

dmz

3y20

I read Oli's comment as referring to the 2.4% -> 0.002% failure rate improvement from filtering.

3Paul Christiano3y

Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?

High-stakes alignment via adversarial training [Redwood Research report]

dmz

3y20

Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don't differ much in performance even at noticeably higher thresholds.

For future projects, I think I'm most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they're extremely rare.

High-stakes alignment via adversarial training [Redwood Research report]

dmz

3y100

Excellent question -- I wish we had included more of an answer to this in the post.

I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.

I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)

Pa... (read more)

AI ALIGNMENT FORUM
AF

All of dmz's Comments + Replies