I agree with the critiques you make of specific papers (in section 2), but I'm less convinced by your diagnosis that these papers are attempting to manage/combat hype in a misguided way.
IMO, "underclaiming" is ubiquitous in academic papers across many fields -- including fields unrelated to NLP or ML, and fields where there's little to no hype to manage. Why do academics underclaim? Common reasons include:
I suspect 1+2+3 above, rather than hype management, explains the specific mistakes you discuss.
For example, Zhang et al 2020 seems like a case of #2. They cite Jia and Liang as evidence about a problem with earlier models, a problem they are trying to solve with their new method. It would be strange to "manage hype" by saying NLP systems can't do X, and then in the same breath present a new system which you claim does X!
Jang and Lukasiewicz (2021) is also a case of #2, describing a flaw primarily in order to motivate their own proposed fix.
Meanwhile, Xu et al 2020 seems like #3: it's a broad review paper on "adversarial attacks" which gives a brief description of Jia and Liang 2017 alongside brief descriptions of many other results, many of them outside NLP. It's true that the authors should not have used the word "SOTA" here, but it seems more plausible that this is mere sloppiness (they copied other, years-old descriptions of the Jia and Liang result) rather than an attempt to push a specific perspective about NLP.
I think a more useful framing might go something like:
An incentive to make the existing SOTA seem as bad as possible, to maximize the gap between it and your own new, sparkly, putatively superior method.
Here's an eyerolling example from yesterday or so: Delphi boasts about their new ethics dataset of n=millions & model which gets 91% vs GPT-3 at chance-level of 52%. Wow, how awful! But wait, we know GPT-3 does better than chance on other datasets like Hendrycks's ETHICS, how can it do so bad where a much smaller model can do so well?
Oh, it turns out that that's zeroshot with their idiosyncratic format. The abstract just doesn't mention that when they do some basic prompt engineering (no p-tuning or self-distillation or anything) and include a few examples (ie. a lot fewer than 'millions'), it gets more like... 84%. Oh.
Yeah, this all sounds right, and it's fairly close to the narrative I was using for my previous draft, which had a section on some of these motives.
The best defense I can give of the switch to the hype-centric framing, FWIW:
That said, I agree that this framing is a little bit too charitable, to the point of making implausible implications about some of these authors' motives in some cases, which isn't a good look. I also hadn't thought of the wasted effort point, which seems quite useful here. I'm giving a few talks about this over the next few weeks, and I'll workshop some tweaks to the framing with this in mind.
Some minor feedback points: Just from reading the abstract and intro, this could be read as a non-sequitur: "It limits our ability to mitigate short-term harms from NLP deployments". Also, calling something a "short-term" problem doesn't seem necessary and it may sound like you think the problem is not very important.
Thanks! Tentative rewrite for the next revision:
It harms our credibility in ways that can make it harder to mitigate present-day harms from NLP deployments. It also limits our ability to prepare for the potentially enormous impacts of more distant future advances.
I tried to stick to 'present-day' over 'short-term', but missed this old bit of draft text in the abstract.
The paper makes a slightly odd multi-step argument to try to connect to active debates in the field:
This comment is some quick feedback on those:
Weirdly, this even happens in papers that themselves
toshow positive results involving NNs.
citations to failures in old systems that we've since improved upon significantly.
Might not be a main point, but this could be padded out with an explanation of how something like that could be marginally better. Like adding:
"As opposed to explaining how that is relevant today, like:
[Old technique] had [problem]. As [that area] has matured [problem has been fixed in this way]. However [slower deployment]/[more humans in the loop]/[other fix] would have reduced [problems]. Using [these fixes]/not making them critical systems which is risky because _ can help ensure [this new area] which [has the same problem] and probably will for [time] until it matures, does not have the same problems [old area] did [for length of time]."
But is that actually the right way to minimize the risk of harms? We should expect [that]
We should expect the impacts of these technologies to grow dramatically as they get better
Thanks! (Typo fixed.)
[Old technique] had [problem]...
For this point, I'm not sure how it fits into the argument. Could you say more?
Is there any empirical base...
Yeah, this is a missed opportunity that I haven't had the time/expertise to take on. There probably are comparable situations in the histories of other applied research fields, but I'm not aware of any good analogies. I suspect that a deep dive into some history-and-sociology-of-science literature would be valuable here.
What if the impact grows dramatically as...they get deployed widely? ...
I think this kind of discussion is already well underway within NLP and adjacent subfields like FaCCT. I don't have as much to add there.
(Weird meta-note: Are you aware of something unusual about how this comment is posted? I saw a notification for it, but I didn't see it in the comments section for the post itself until initially submitting this reply. I'm newish to posting on Lightcone forums...)
(Weird meta-note: Are you aware of something unusual about how this comment is posted? I saw a notification for it, but I didn't see it in the comments section for the post itself until initially submitting this reply. I'm newish to posting on Lightcone forums...)
Ah. When you say lightcone forums, what site are you on? What does the URL look like?
For this point, I'm not sure how it fits into the argument. Could you say more?
It's probably a tangent. The idea was:
1) Criticism is great.
2) Explaining how that could be improved is marginally better. (I then explained for that case* how citing 'old evidence' or 'old stuff' could still apply to new stuff. It was kind of a niche application of evidence though. If someone had a good reason for using the old evidence, elaborating on that reason might help.)
*In abstract terms - I didn't have any examples in mind.
Forum
I can see the comment at the comment-specific AF permalink here:
https://www.alignmentforum.org/posts/RLHkSBQ7zmTzAjsio/nlp-position-paper-when-combatting-hype-proceed-with-caution?commentId=pSkdAanZQwyT4Xyit#pSkdAanZQwyT4Xyit
...but I can't see it among the comments at the base post URL here.
From my experience with the previous comment, I expect it'll appear at the latter URL once I reply?
[Old technique] had [problem]...
Ah, got it. That makes sense! I'll plan to say a bit more about when/how it makes sense to cite older evidence in cases like this.
Linkpost for https://cims.nyu.edu/~sbowman/bowman2021hype.pdf. To appear on arXiv shortly.
I'm sharing a position paper I put together as an attempt to introduce general NLP researchers to AI risk concerns. From a few discussions at *ACL conferences, it seems like a pretty large majority of active researchers aren't aware of the arguments at all, or at least aren't aware that they have any connection to NLP and large language model work.
The paper makes a slightly odd multi-step argument to try to connect to active debates in the field:
I'll be submitting this to ACL in a month. Comments/criticism welcome, here or privately (bowman@nyu.edu).