All of dmz's Comments + Replies

dmz
20

That's right. We did some followup experiments doing the head-to-head comparison: the tools seem to speed up the contractors by 2x for the weak adversarial examples they were finding (and anecdotally speed us up a lot more when we use them to find more egregious failures). See https://www.lesswrong.com/posts/n3LAgnHg6ashQK3fF/takeaways-from-our-robust-injury-classifier-project-redwood#Quick_followup_results; an updated arXiv paper with those experiments is appearing on Monday. 

dmz
10

The main reason is that we think we can learn faster in simpler toy settings for now, so we're doing that first. Implementing all the changes I described (particularly changing the task definition and switching to fine-tuning the generator) would basically mean starting over from scratch anyway.

dmz
20

Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)

dmz
20

I read Oli's comment as referring to the 2.4% -> 0.002% failure rate improvement from filtering.

3Paul Christiano
Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?
dmz
20

Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don't differ much in performance even at noticeably higher thresholds.

For future projects, I think I'm most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they're extremely rare.

dmz
100

Excellent question -- I wish we had included more of an answer to this in the post.

I think we made some real progress on the defense side -- but I 100% was hoping for more and agree we have a long way to go.

I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven't actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We're interested in people giving it a shot! :)

Pa... (read more)