All of StellaAthena's Comments + Replies

I’m not sure when you developed this work, but the LLM.int8 paper identifies outliers as an essential factor in achieving performance for models larger than 2.7B parameters (see Fig. 1 and Fig. 3 especially). There’s also some follow-up work here and here. Very curiously, the GLM-130B paper reports that they don’t see outlier features at all, or the negative effects of their lack of impact.

I’ve spoken with Tim (LLM.int8 lead author) about this a bit and some people in EleutherAI, and I’m wondering if there’s some kind of explicit or implicit regularizing e... (read more)

The distinction between "large scale era" and the rest of DL looks rather suspicious to me. You don't give a meaningful defense of which points you label "large scale era" in your plot and largely it looks like you took a handful of the most expensive models each year to give a different label to.

On what basis can you conclude that Turing NLG, GPT-J, GShard, and Switch Transformers aren't part of the "large scale era"? The fact that they weren't literally the largest models trained that year?

There's also a lot of research that didn't make your analysis, in... (read more)

0Jaime Sevilla
It is not feasible to do an exhaustive analysis of all milestone models. We necessarily are missing some important ones, either because we are not aware of them, because they did not provide enough information to deduce the training compute or because we haven't gotten to annotate them yet. Our criteria for inclusion is outlined in appendix A. Essentially it boils down to ML models that have been cited >1000 times, models that have some historical significance and models that have been deployed in an important context (eg something that was deployed as part of Bing search engine would count). For models in the last two years we were more subjective, since there hasn't been enough time for the more relevant work to stand out the test of time. We also excluded 5 models that have abnormally low compute, see figure 4. We tried playing around with the selection of papers that was excluded and it didn't significantly change our conclusions, though obviously the dataset is biased in many ways. Appendix G discusses the possible biases that may have crept in. 
0Jaime Sevilla
Great questions! I think it is reasonable to be suspicious of the large-scale distinction. I do stand by it - I think the companies discontinuously increased their training budgets around 2016 for some flagship models.[1] If you mix these models with the regular trend, you might believe that the trend was doubling very fast up until 2017 and then slowed down. It is not an entirely unreasonable interpretation, but it explains worse the discontinuous jumps around 2016. Appendix E discusses this in-depth. The way we selected the large-scale models is half intuition and half convenience. We compare the compute of each model to the log compute of nearby papers (within 2 years), and we call it large scale if its log compute exceeds 0.72 standard deviations of the mean of that sample. I think there is a reasonable case for including NASv3, Libratus, Megatron-LM, T5-3B, OpenAI Five, Turing NLG, iGPT-XL, GShard (dense), Switch, DALL-E, Pangu-α, ProtT5-XXL and HyperClova on either side of this division. Arguably we should have been more transparent about the effects of choosing a different threshold - we will try to look more into this in the next update of the paper.   1. ^ See appendix F for a surface discussion

If superintelligence is approximately multimodal GPT-17 plus reinforcement learning, then understanding how GPT-3-scale algorithms function is exceptionally important to understanding super-intelligence.

Also, if superintelligence doesn’t happen then prosaic alignment is the only kind of alignment.

Due to the redundancy, changing any single weight—that is associated with one of those two pieces of logic—does not change the output.

You seem to be under the impression that the goal is to make the NN robust to single-weight perturbation. But gradient descent doesn’t modify a neural network one weight at a time, and so being robust to single-weight modification doesn’t come with any real guarantees. The backward pass could result in weights of both forks being updated.

1Ofer
Sure, but the gradient component that is associated with a given weight is still zero if updating that weight alone would not affect loss.

I don’t understand what the purported ontological crisis is. If ghosts exist, then I want them to be happy. That doesn’t require a dogmatic belief that there are ghosts at all. In fact, it can even be true when I believe ghosts don’t exist!

4Abram Demski
I mean, that's fair. But what if your belief system justified almost everything ultimately in terms of "making ancestors happy", and relied on a belief that ancestors are still around to be happy/sad? There are several possible responses which a real human might be tempted to make: * Give up on those values which were justified via ancestor worship, and only pursue the few values which weren't justified that way. * Value all the same things, just not based on ancestor worship any more. * Value all the same things, just with a more abstract notion of "making ancestors happy" rather than thinking the ancestors are literally still around. * Value mostly the same things, but with some updates in places where ancestor worship was really warping your view of what's valuable rather than merely serving as a pleasant justification for what you already think is valuable. So we can fix the scenario to make a more real ontological crisis. It also bears mentioning -- the reason to be concerned about ontological crisis is, mostly, a worry that almost none of the things we express our values in terms of are "real" in a reductionistic sense. So an AI could possibly view the world through much different concepts and still be predictively accurate. The question then is, what would it mean for such an AI to pursue our values?