All of jsteinhardt's Comments + Replies

Hi Alex,

Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, ... (read more)

Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper. 

(I might reply later to specific points)

I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged w... (read more)

I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

The answer is: No, our work is very different from that paper. Here's the paragraph in question:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weig

... (read more)

Yup, I agree with this, and think the argument generalizes to most alignment work (which is why I'm relatively optimistic about our chances compared to some other people, e.g. something like 85% p(success), mostly because most things one can think of doing will probably be done).

It's possibly an argument that work is most valuable in cases of unexpectedly short timelines, although I'm not sure how much weight I actually place on that.

Yup! That sounds great :)

1Ruben Bloom
Here it is! https://www.lesswrong.com/s/4aARF2ZoBpFZAhbbe You might want to edit the description and header image.

Thanks Ruby! Now that the other posts are out, would it be easy to forward-link them (by adding links to the italicized titles in the list at the end)?

1Ruben Bloom
We can also make a Sequence. I assume "More Is Different for AI" should be the title of the overall Sequence too?
1Ruben Bloom
Done!

Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.

And in fact, since min_x f(theta,x) <= E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.

2davidad (David A. Dalrymple)
Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.

Thanks! I appreciated these distinctions. The worst-case argument for modularity came up in a past argument I had with Eliezer, where I argued that this was a reason for randomization (even though Bayesian decision theory implies you should never randomize). See section 2 here: The Power of Noise.

Re: 50% vs. 10% vs. 90%. I liked this illustration, although I don't think your argument actually implies 50% specifically. For instance if it turns out that everyone else is working on the 50% worlds and no one is working on the 90% worlds, you should probably wo... (read more)

I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don't think it's correct to call it the "second-simplest solution", since there are many choices of what facet of the environment is worst-case.

One keyword for this is "partial specification", e.g. here is a paper I wrote that makes a minimal set of statistical assumptions... (read more)

2Paul Christiano
Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just "solve existing problems and then try the same thing tomorrow," and generally I'd agree "solve an existing problem for which you can verify success" is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it's feasible.
2davidad (David A. Dalrymple)
My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it's trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder. (However, I agree that once we already need to do some averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)

Cool paper! One brief comment is this seems closely related to performative prediction and it seems worth discussing the relationship.

Edit: just realized this is a review, not a new paper, so my comment is a bit less relevant. Although it does still seem like a useful connection to make.

2David Scott Krueger
author here -- Yes we got this comment from reviewers in the most recent round as well.  ADS is a bit more general than performative prediction, since it applies outside of prediction context.  Still very closely related. On the other hand, The point of our work is something that people in the performative prediction community seem to only slowly be approaching, which is the incentive for ADS.  Work on CIDs is much more related in that sense. As a historical note: We starting working on this March or April 2018; Performative prediction was on arXiv Feb 2020, ours was at a safety workshop in mid 2019, but not on arXiv until Sept 2020.

My basic take is that there will be lots of empirical examples where increasing model size by a factor of 100 leads to nonlinear increases in capabilities (and perhaps to qualitative changes in behavior). On median, I'd guess we'll see at least 2 such examples in 2022 and at least 100 by 2030.

At the point where there's a "FOOM", such examples will be commonplace and happening all the time. Foom will look like one particularly large phase transition (maybe 99th percentile among examples so far) that chains into more and more. It seems possible (though not c... (read more)

Thanks. For time/brevity, I'll just say which things I agree / disagree with:

> sufficiently capable and general AI is likely to have property X as a strong default [...] 

I generally agree with this, although for certain important values of X (such as "fooling humans for instrumental reasons") I'm probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I'm also probably more optimistic (but not certain) that those efforts will succeed.

[inside view, modest epistemology]: I don't have... (read more)

I'm not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count "the same") had any clear-to-my-model relevance to alignment, or even AGI.  AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we've got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem.  Bigger models generalizing better or worse doesn't say anything obvio... (read more)

Not sure if this helps, and haven't read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I'd have are that:

 * For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
 * For almost all experimental results you would think they were so much less informative as to not be worthwhile.
 * There's a sma... (read more)

I would agree with you that "MIRI hates all experimental work" / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.

Ooh, that's really interesting. Thinking about it, I think my sense of what's going on is (and I'd be interested to hear how this differs from your sense):

  1. Compared to the average alignment researcher, MIRI tends to put more weight on reasoning like 'sufficient
... (read more)

Actually, another issue is that unsupervised translation isn't "that hard" relative to supervised translation--I think that you can get pretty far with simple heuristics, such that I'd guess making the model 10x bigger matters more than making the objective more aligned with getting the answer right (and that this will be true for at least a couple more 10x-ing of model size, although at some point the objective will matter more).

This might not matter as much if you're actually outputting explanations and not just translating from one language to another. Although it is probably true that for tasks that are far away from the ceiling, "naive objective + 10x larger model" will outperform "correct objective".

2Paul Christiano
I do expect "explanations of what's going on in this sentence" to be a lot weaker than translations. For that task, I expect that the model trained on coherence + similar tasks will outperform a 10x larger pre-trained model. If the larger pre-trained model gets context stuffing on similar tasks, but no coherence training, then it's less clear to me. But I guess the point is that the differences between various degrees of successful-generalization will be relatively small compared to model size effects. It doesn't matter so much how good the transfer model is relative to the pre-trained baseline, it matters how large the differences between the possible worlds that we are hoping to distinguish are. I guess my main hope there is to try to understand whether there is some setting where transfer works quite well, either getting very close to the model fine-tuned on distribution, or at least converging as the pre-trained model grows. Hopefully that will make it easier to notice the effects we are looking for, and it's OK if those effects are small relative to model doublings. (Also worth noting that "as good as increasing model size by 10%" is potentially quite economically relevant. So I'm mostly just thinking about the extent to which it can make effects hard to measure.)

Thanks Paul, I generally like this idea.

Aside from the potential concerns you bring up, here is the most likely way I could see this experiment failing to be informative: rather than having checks and question marks in your tables above, really the model's ability to solve each task is a question of degree--each table entry will be a real number between 0 and 1. For, say, tone, GPT-3 probably doesn't have a perfect model of tone, and would get <100% performance on a sentiment classification task, especially if done few-shot.

The issue, then, is that the ... (read more)

2Paul Christiano
Part of my hope is that "coherence" can do quite a lot of the "telling you what humans mean about tone." For example, you can basically force the model to talk (in English) about what things contribute to tone, and why it thinks the tone is like such and such (or even what the tone of English sentences is)---anything that a human who doesn't know French can evaluate. And taken together those things seem like enough to mostly pin down what we are talking about. I'd tentatively interpret that as a negative result, but I agree with your comments below that ultimately a lot of what we care about here is the scaling behavior and putting together a more holistic picture of what's going on, in particular: * As we introduce stronger coherence checks, what happens to the accuracy? Is it approaching the quality of correctness, or is it going to asymptote much lower? * Is the gap shrinking as model quality improves, or growing? Do we think that very large models would converge to a small gap or is it a constant? I'm also quite interested in the qualitative behavior. Probably most interesting are the cases where the initial model is incoherent, the coherence-tuned model is coherent-but-wrong, and the correctness-tuned model is correct. (Of course every example is also fuzzy because of noise from sampling and training, but the degree of fuzziness is smaller as we remove randomness.) In these cases, what is happening with the coherence-tuned model? Are we able to see cases where it cleanly feels like the "wrong" generalization, or is it a plausible ambiguity about what we were looking for? And so on. I'm interested in the related engineering question: in this setting, what can we do to improve the kind of generalization we get? Can we get some handle on the performance gap and possible approaches to closing it? And finally I'm interested in understanding how the phenomenon depends on the task: is it basically similar in different domains / for different kinds of question o
1Jacob Steinhardt
Actually, another issue is that unsupervised translation isn't "that hard" relative to supervised translation--I think that you can get pretty far with simple heuristics, such that I'd guess making the model 10x bigger matters more than making the objective more aligned with getting the answer right (and that this will be true for at least a couple more 10x-ing of model size, although at some point the objective will matter more). This might not matter as much if you're actually outputting explanations and not just translating from one language to another. Although it is probably true that for tasks that are far away from the ceiling, "naive objective + 10x larger model" will outperform "correct objective".

This doesn't seem so relevant to capybaralet's case, given that he was choosing whether to accept an academic offer that was already extended to him.