Discussion with Eliezer Yudkowsky on AGI interventions

Eliezer Yudkowsky

EDIT: This comment fails on a lot of points, as discussed in this apology subcomment. I encourage people interested by the thread to mostly read the apology subcomment and the list of comments linked there, which provide maximum value with minimum drama IMO.

Disclaimer: this is a rant. In the best possible world, I could write from a calmer place, but I’m pretty sure that the taboo on criticizing MIRI and EY too hard on the AF can only be pushed through when I’m annoyed enough. That being said, I’m writing down thoughts that I had for quite some time, so don’t just discard this as a gut reaction to the post.

(Later added disclaimer: it's a good idea to add "I feel like..." before the judgment in this comment, so that you keep in mind that I'm talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))

Tl;dr:

I'm annoyed by EY (and maybe MIRI's?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment, and of taking a lot of time to let go of these flawed agendas in the face of mounting evidence.
I'm annoyed t

... (read more)

[-]adamShimi4y40-4

This is an apology for the tone and the framing of the above comment (and my following answers), which have both been needlessly aggressive, status-focused and uncharitable. Underneath are still issues that matter a lot to me, but others have discussed them better (I'll provide a list of linked comments at the end of this one).

Thanks to Richard Ngo for convincing me that I actually needed to write such an apology, which was probably the needed push for me to stop weaseling around it.

So what did I do wrong? The list is pretty damning:

I took something about the original post that I didn't understand — EY's "And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable." — and because it didn't make sense to me, and because that fitted with my stereotypes for MIRI and EY's dismissiveness of a lot of work in alignment, I turned to an explanation of this as an attack on alignment researchers, saying they were consciously faking it when they knew they should do better. Whereas I feel know that what EY meant is far closer to alignment research at the moment is trying to try to align AI as best as we can, instead of just

... (read more)

[-]TurnTrout1y1012

Although I don't usually write LW comments, I'm writing a post right now and this is helping me clarify my thoughts on a range of historical incidents.

In hindsight, I'm worried that you wrote this apology. I think it's an unhealthy obeisance.

I suspect you noticed how Eliezer often works to degrade the status of people who disagree with him and otherwise treats them poorly. As I will support in an upcoming essay, his writing is often optimized to exploit intellectual insecurity (e.g. by frequently praising his own expertise, or appealing to a fictional utopia of fictional geniuses who agree that you're an idiot or wrong^[1]) and to demean others' contributions (e.g. by claiming to have invented them already, or calling them fake, or emphasizing how far behind everyone else is). It's not that it's impossible for these claims to have factual merit, but rather the presentation and the usage of these claims seem optimized to push others down. This has the effect of increasing his own status.

Anger and frustration are a rational reaction in that situation (though it's important to express those emotions in healthy ways - I think your original comment wasn't perfect there). And yet yo... (read more)

[-]Ben Pace4y40

Thank you for this follow-up comment Adam, I appreciate it.

[-]dxu4y280

Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.

I'm... confused by this framing? Specifically, this bit (as well as other bits like these)

I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.

Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it

Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic

... (read more)

[-]adamShimi4y*180

Okay, so you're completely right that a lot of my points are logically downstream of the debate on whether Prosaic Alignment is Impossible or not. But I feel like you don't get how one sided this debate is, and how misrepresented it is here (and generally on the AF)

Like nobody except EY and a bunch of core MIRI people actually believes that prosaic alignment is impossible. I mean that every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard. That includes MIRI people like Evan Hubinger too. And note that some of these other alignment researchers actually work with Neural Nets and keep up to speed on the implementation details and subtleties, which in my book means their voice should count more.

But that's just a majority argument. The real problem is that nobody has ever given a good argument on why this is impossible. I mean the analogous situation is that a car is driving right at you, a... (read more)

[-]Richard_Ngo4y330

I think one core issue here is that there are actually two debates going on. One is "how hard is the alignment problem?"; another is "how powerful are prosaic alignment techniques?" Broadly speaking, I'd characterise most of the disagreement as being on the first question. But you're treating it like it's mostly on the second question - like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.

My attempt to portray EY's perspective is more like: he's concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he's trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.

Regardless of that, calling the debate "one sided" seems way too strong, especially given how many selection effects are involved. I mean, you could also call the debate about whether alignment is even a problem "one sided" - 95% of all ML researchers don't think it's a problem, or think it's something we'll solve easily. But for fair... (read more)

[-]Lukas Finnveden4y70

95% of all ML researchers don't think it's a problem, or think it's something we'll solve easily

The 2016 survey of people in AI asked people about the alignment problem as described by Stuart Russell, and 39% said it was an important problem and 33% that it's a harder problem than most other problem in the field.

[-]adamShimi4y60

Thanks for the detailed comment!

I think one core issue here is that there are actually two debates going on. One is "how hard is the alignment problem?"; another is "how powerful are prosaic alignment techniques?" Broadly speaking, I'd characterise most of the disagreement as being on the first question. But you're treating it like it's mostly on the second question - like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.

That's an interesting separation of the problem, because I really feel there is more disagreement on the second question than on the first.

My attempt to portray EY's perspective is more like: he's concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he's trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.

Funnily, aren't the people currently working on ageing using quite prosaic techniques? I completely agree that one need to go for the big problems, especially ones t... (read more)

[-]Richard_Ngo4y120

I really feel there is more disagreement on the second question than on the first

What is this feeling based on? One way we could measure this is by asking people about how much AI xrisk there is conditional on there being no more research explicitly aimed at aligning AGIs. I expect that different people would give very different predictions.

People like Paul and Evan and more are actually going for the core problems IMO, just anchoring a lot of their thinking in current ML technologies.

Everyone agrees that Paul is trying to solve foundational problems. And it seems strange to criticise Eliezer's position by citing the work of MIRI employees.

It's just harder when he writes things like everyone working on alignment is faking it and not giving much details.

As Rob pointed out above, this straightforwardly mischaracterises what Eliezer said.

[-]TurnTrout4y150

I worry that "Prosaic Alignment Is Doomed" seems a bit... off as the most appropriate crux. At least for me. It seems hard for someone to justifiably know that this is true with enough confidence to not even try anymore. To have essayed or otherwise precluded all promising paths of inquiry, to not even engage with the rest of the field, to not even try to argue other researchers out of their mistaken beliefs, because it's all Hopeless.

Consider the following analogy: Someone who wants to gain muscle, but has thought a lot about nutrition and their genetic makeup and concluded that Direct Exercise Gains Are Doomed, and they should expend their energy elsewhere.

OK, maybe. But how about try going to the gym for a month anyways and see what happens?

The point isn't "EY hasn't spent a month of work thinking about prosaic alignment." The point is that AFAICT, by MIRI/EY's own values, valuable-seeming plans are being left to rot on the cutting room floor. Like, "core MIRI staff meet for an hour each month and attack corrigibility/deceptive cognition/etc with all they've got. They pay someone to transcribe the session and post the fruits / negative results / reasoning to AF, without individually committing to following up with comments."

(I am excited by Rob Bensinger's comment that this post is the start of more communication from MIRI)

2dxu4y

Thanks for elaborating. I don't think I have the necessary familiarity with the alignment research community to assess your characterization of the situation, but I appreciate your willingness to raise potentially unpopular hypotheses to attention. +1

2adamShimi4y

Thanks for taking the time of asking a question about the discussion even if you lack expertise on the topic. ;)

[-]Rob Bensinger4y220

+1 for this whole conversation, including Adam pushing back re prosaic alignment / trying to articulate disagreements! I agree that this is an important thing to talk about more.

I like the 'give more concrete feedback on specific research directions' idea, especially if it helps clarify generators for Eliezer's pessimism. If Eliezer is pessimistic about a bunch of different research approaches simultaneously, and you're simultaneously optimistic about all those approaches, then there must be some more basic disagreement(s) behind that.

From my perspective, the OP discussion is the opening salvo in 'MIRI does a lot more model-sharing and discussion'. It's more like a preface than like a conclusion, and the next topic we plan to focus on is why Eliezer-cluster people think alignment is hard, how we're thinking about AGI, etc. In the meantime, I'm strongly in favor of arguing about this a bunch in the comments, sharing thoughts and reflections on your own models, etc. -- going straight for the meaty central disagreements now, not waiting to hash this out later.

[-]Rob Bensinger4y*140

Someone privately contacted me to express confusion, because they thought my '+1' means that I think adamShimi's initial comment was unusually great. That's not the case. The reasons I commented positively are:

I think this overall exchange went well -- it raised good points that might have otherwise been neglected, and everyone quickly reached agreement about the real crux.
I want to try to cancel out any impression that criticizing / pushing back on Eliezer-stuff is unwelcome, since Adam expressed worries about a "taboo on criticizing MIRI and EY too hard".
On a more abstract level, I like seeing people 'blurt out what they're actually thinking' (if done with enough restraint and willingness-to-update to mostly avoid demon threads), even if I disagree with the content of their thought. I think disagreements are often tied up in emotions, or pattern-recognition, or intuitive senses of 'what a person/group/forum is like'. This can make it harder to epistemically converge about tough topics, because there's a temptation to pretend your cruxes are more simple and legible than they really are, and end up talking about non-cruxy things.

Separately, I endorse Ben Pace's question ("Can you make a positive case here for how the work being done on prosaic alignment leads to success?") as the thing to focus on.

[-]adamShimi4y70

Thanks for the kind answer, even if we're probably disagreeing about most points in this thread. I think message like yours really help in making everyone aware that such topics can actually be discussed publicly without big backlash.

I like the 'give more concrete feedback on specific research directions' idea, especially if it helps clarify generators for Eliezer's pessimism. If Eliezer is pessimistic about a bunch of different research approaches simultaneously, and you're simultaneously optimistic about all those approaches, then there must be some more basic disagreement(s) behind that.

That sounds amazing! I definitely want to extract some of the epistemic strategies that EY uses to generate criticisms and break proposals. :)

From my perspective, the OP discussion is the opening salvo in 'MIRI does a lot more model-sharing and discussion'. It's more like a preface than like a conclusion, and the next topic we plan to focus on is why Eliezer-cluster people think alignment is hard, how we're thinking about AGI, etc. In the meantime, I'm strongly in favor of arguing about this a bunch in the comments, sharing thoughts and reflections on your own models, etc. -- going straight for the meaty central disagreements now, not waiting to hash this out later.

Excited about that!

[-]Steven Byrnes4y190

Couple things:

First, there is a lot of work in the "alignment community" that involves (for example) decision theory or open-source-game-theory or acausal trade, and I haven't found any of it helpful for what I personally think about (which I'd like to think is "directly attacking the heart of the problem", but others may judge for themselves when my upcoming post series comes out!).

I guess I see this subset of work as consistent with the hypothesis "some people have been nerd-sniped!". But it's also consistent with "some people have reasonable beliefs and I don't share them, or maybe I haven't bothered to understand them". So I'm a bit loath to go around criticizing them, without putting more work into it. But still, this is a semi-endorsement of one of the things you're saying.

Second, my understanding of MIRI (as an outsider, based purely on my vague recollection of their newsletters etc., and someone can correct me) is that (1) they have a group working on "better understand agent foundations", and this group contains Abram and Scott, and they publish pretty much everything they're doing, (2) they have a group working on undisclosed research projects, which are NOT "better under... (read more)

[-]Ben Pace4y*140

Adam, can you make a positive case here for how the work being done on prosaic alignment leads to success? You didn't make one, and without it I don't understand where you're coming from. I'm not asking you to tell me a story that you have 100% probability on, just what is the success story you're acting under, such that EY's stances seem to you to be mostly distracting people from the real work.

[-]adamShimi4y*150

Thanks for trying to understand my point and asking me for more details. I appreciate it.

Yet I feel weird when trying to answer, because my gut reaction to your comment is that you're asking the wrong question? Also, the compression of my view to "EY's stances seem to you to be mostly distracting people from the real work" sounds more lossy than I'm comfortable with. So let me try to clarify and focus on these feelings and impressions, then I'll answer more about which success stories or directions excite me.

My current problem with EY's stances is twofold:

First, in posts like this one, he literally writes that everything done under the label of alignment is faking it and not even attacking the problem, except like 3 people who even if they're trying have it all wrong. I think this is completely wrong, and that's even more annoying because I find that most people working on alignment are trying far harder harder to justify why they e

... (read more)

[-]Rob Bensinger4y320

Thanks for naming specific work you think is really good! I think it's pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn't cruxy (because there's a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it's super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.

John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.

Eliezer liked this post enough that he asked me to signal-boost it in the MIRI Newsletter back in April.

And Paul Christiano and Stuart Armstrong are two of the people Eliezer named as doing very-unusually good work. We continue to pay Stuart to support his research, though he's mainly supported by FHI.

And Evan works at MIRI, which provides some Bayesian evidence about how much we tend to like his stuff. :)

So ... (read more)

[-]Rob Bensinger4y350

Also, I feel like I want to emphasize that, like... it's OK to believe that the field you're working in is in a bad state? The social pressure against saying that kind of thing (or even thinking it to yourself) is part of why a lot of scientific fields are unhealthy, IMO. I'm in favor of you not taking for granted that Eliezer's right, and pushing back insofar as your models disagree with his. But I want to advocate against:

Saying false things about what the other person is saying. A lot of what you've said about Eliezer and MIRI is just obviously false (e.g., we have contempt for "experimental work" and think you can't make progress by "Actually working with AIs and Thinking about real AIs").
Shrinking the window of 'socially acceptable things to say about the field as a whole' (as opposed to unsolicited harsh put-downs of a particular researcher's work, where I see more value in being cautious).

I want to advocate 'smack-talking the field is fine, if that's your honest view; and pushing back is fine, if you disagree with the view'. I want to see more pushing back on the object level (insofar as people disagree), and less 'how dare you say that, do you think you're the king of align... (read more)

7adamShimi4y

Thanks for your great comments! One thing I want to make clear is that I'm quite aware that my comments have not been as high-quality as they should have been. As I wrote in the disclaimer, I was writing from a place of frustration and annoyance, which also implies a focus on more status-y thing. That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed. That being said, much of what I was railing against is a general perception of the situation, from reading a lot of stuff but not necessarily stopping to study all the evidence before writing a fully though-through opinion. I think this is where the "saying obviously false things" comes from (which I think are pretty easy to believe from just reading this post and a bunch of MIRI write-ups), and why your comments are really important to clarify the discrepancy between this general mental picture I was drawing from and the actual reality. Also recentering the discussion on the object-level instead of on status arguments sounds like a good move. You make a lot of good points and I definitely want to continue the conversation and have more detailed discussion, but I also feel that for the moment I need to take some steps back, read your comments and some of the pointers in other comments, and think a bit more about the question. I don't think there's much more to gain from me answering quickly, mostly in reaction. (I also had the brilliant idea of starting this thread just when I was on the edge of burning out from working too much (during my holidays), so I'm just going to take some time off from work. But I definitely want to continue this conversation further when I come back, although probably not in this thread ^^)

[-]Rob Bensinger4y230

Enjoy your rest! :)

That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.

If you'd just aired out your frustration, framing claims about others in NVC-like 'I feel like...' terms (insofar as you suspect you wouldn't reflectively endorse them), and then a bunch of people messaged you in private to say "thank you! you captured my feelings really well", then that would seem clearly great to me.

I'm a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn't even have the top-of-comment disclaimer. So that I could imagine some people who also aren't independently familiar with all the background facts, coul... (read more)

[-]adamShimi4y70

I'm a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn't even have the top-of-comment disclaimer. So that I could imagine some people who also aren't independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you're criticizing.

That sounds a bit unfair, in the sense that it makes it look like I just invented stuff I didn't believe and ran with it. When what actually happen was that I wrote about my frustrations, but made the mistake of stating them as obvious facts instead of impressions.

Of course, I imagine you feel that my portrayal of EY and MIRI was also unfair, sorry about that.

(I added a note to the three most ranty comments on this thread saying that people should mentally add "I feel like..." to judgments in them.)

[-]Rob Bensinger4y90

Thanks for adding the note! :)

I'm confused. When I say 'that's just my impression', I mean something like 'that's an inside-view belief that I endorse but haven't carefully vetted'. (See, e.g., Impression Track Records, referring to Naming Beliefs.)

Example: you said that MIRI has "contempt with experimental work and not doing only decision theory and logic".

My prior guess would have been that you don't actually, for-real believe that -- that it's not your 'impression' in the above sense, more like 'unendorsed venting/hyperbole that has a more complicated relation to something you really believe'.

If you do (or did) think that's actually true, then our models of MIRI are much more different than I thought! Alternatively, if you agree this is not true, then that's all I meant in the previous comment. (Sorry if I was unclear about that.)

3adamShimi4y

I would say that with slight caveats (make "decision theory and logic" a bit larger to include some more mathy stuff and make "all experimental work" a bit smaller to not includes Redwood's work), this was indeed my model. What made me update from our discussion is the realization that I interpreted the dismissal of basically all alignment research as "this has no value whatsoever and people doing it are just pretending to care on alignment", where it should have been interpreted as something like "this is potentially interesting/new/exciting, but it doesn't look like it brings us closer to solving alignment in a significant way, hence we're still failing".

1Rob Bensinger4y

'Experimental work is categorically bad, but Redwood's work doesn't count' does not sound like a "slight caveat" to me! What does this generalization mean at all if Redwood's stuff doesn't count? (Neither, for that matter, does the difference between 'decision theory and logic' and 'all mathy stuff MIRI has ever focused on' seem like a 'slight caveat' to me -- but in that case maybe it's because I have a lot more non-logic, non-decision-theory examples in my mind that you might not be familiar with, since it sounds like you haven't read much MIRI stuff?).

[-]Rohin Shah4y290

(Responding to entire comment thread) Rob, I don't think you're modeling what MIRI looks like from the outside very well.

There's a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
There was a non-public agenda that involved Haskell programmers. That's about all I know about it. For all I know they were doing something similar to the modal logic work I've seen in the past.
Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris's work is not central to the cluster I would call "experimentalist".
There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn't the main point of that paper and was an extremely predictable result.
There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current step

... (read more)

[-]Rob Bensinger4y*130

So, the point of my comments was to draw a contrast between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." I didn't intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as 'oh yeah, that's not true and wasn't what I was thinking') is not great for discussion.

I think it's true that 'MIRI is super not into most ML alignment work', and I think it used to be true that MIRI put almost all of its research effort into HRAD-ish work, and regardless, this all seems like a completely understandable cached impression to have of current-MIRI. If I wrote stuff that makes it sound like I don't think those views are common, reasonable, etc., then I apologize for that and disavow the thing I said.

But this is orthogonal to what I thought I was talking about, so I'm confused about what seems to me like a topic switch. Maybe the ... (read more)

[-]dxu4y*250

So, the point of my comments was to draw a contrast between having a low opinion of "experimental work and not doing only decision theory and logic", and having a low opinion of "mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc." I didn't intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as 'oh yeah, that's not true and wasn't what I meant') is not great for discussion.

It occurs to me that part of the problem may be precisely that Adam et al. don't think there's a large difference between these two claims (that actually matters). For example, when I query my (rough, coarse-grained) model of [your typical prosaic alignment optimist], the model in question responds to your statement with something along these lines:

If you remove "mainstream ML alignment work, and nearly all work outside of the HRAD-ish cluster of decision theory, logic, etc." from "experimental work", what's left? Perhaps there are one or two (non-mainstream, bare

... (read more)

6Rohin Shah4y

^ This response is great. I also think I naturally interpreted the terms in Adam's comment as pointing to specific clusters of work in today's world, rather than universal claims about all work that could ever be done. That is, when I see "experimental work and not doing only decision theory and logic", I automatically think of "experimental work" as pointing to a specific cluster of work that exists in today's world (which we might call mainstream ML alignment), rather than "any information you can get by running code". Whereas it seems you interpreted it as something closer to "MIRI thinks there isn't any information to get by running code". My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn't mainstream ML is a good candidate for how you would start to interpret "experimental work" as the latter.) But this seems very plausibly a typical mind fallacy. EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying "no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers", and now I realize that is not what you were saying.

1Rob Bensinger4y

This is a good comment! I also agree that it's mostly on MIRI to try to explain its views, not on others to do painstaking exegesis. If I don't have a ready-on-hand link that clearly articulates the thing I'm trying to say, then it's not surprising if others don't have it in their model. And based on these comments, I update that there's probably more disagreement-about-MIRI than I was thinking, and less (though still a decent amount of) hyperbole/etc. If so, sorry about jumping to conclusions, Adam!

[-]jsteinhardt4y150

Not sure if this helps, and haven't read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I'd have are that:

* For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
* For almost all experimental results you would think they were so much less informative as to not be worthwhile.
* There's a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.

(I'd be willing to take bets on these or pick candidate experiments to clarify this.)

In addition, a consequence of these beliefs is that compared to me you think we should be spending way more time sitting around thinking about stuff, and way less time doing experiments, than I do.

I would agree with you that "MIRI hates all experimental work" / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.

[-]Rob Bensinger4y*210

I would agree with you that "MIRI hates all experimental work" / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.

Ooh, that's really interesting. Thinking about it, I think my sense of what's going on is (and I'd be interested to hear how this differs from your sense):

Compared to the average alignment researcher, MIRI tends to put more weight on reasoning like 'sufficiently capable and general AI is likely to have property X as a strong default, because approximately-X-ish properties don't seem particularly difficult to implement (e.g., they aren't computationally intractable), and we can see from first principles that agents will be systematically less able to get what they want when they lack property X'. My sense is that MIRI puts more weight on arguments like this for reasons like:
- We're more impressed with the track record of inside-view reasoning in science.
  - I suspect this is partly because the average alignment researcher is impressed with how unusually-poorly inside-view reasoning has done i

... (read more)

[-]Rob Bensinger4y*100

I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren't about interpretability and 'cracking open the hood', or they're not approaching the problem in a way that MIRI's excited by.

E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about 'looking in the learned model's brain' to spot mesa-optimization and analyze what that optimization is ultimately 'trying to do' (or whatever), then you probably won't produce stuff that MIRI's excited about regardless of how experimental vs. theoretical your work is.

(In which case, maybe this is not actually a crux for the usefulness of most alignment experiments, and is instead a crux for the usefulness of most alignment research in general.)

[-]jsteinhardt4y90

Thanks. For time/brevity, I'll just say which things I agree / disagree with:

> sufficiently capable and general AI is likely to have property X as a strong default [...]

I generally agree with this, although for certain important values of X (such as "fooling humans for instrumental reasons") I'm probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I'm also probably more optimistic (but not certain) that those efforts will succeed.

[inside view, modest epistemology]: I don't have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.

> less weight on reasoning like 'X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it's developed

I agree, see my post On the Risks of Emergent Behavior in Foundation Models. In the past I think I put too much weight on this type of reasoning, and also think most people in ML put too much weight on it.

> MIRI thinks AGI is better thought of as 'a weird specific sort of AI', rather than as 'like ex... (read more)

[-]Eliezer Yudkowsky4y170

I'm not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count "the same") had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we've got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn't say anything obvious to any piece of my model of the Big Problem. Though if larger models start generalizing more poorly, then it takes longer to stupidly-brute-scale to AGI, which I suppose affects timelines some, but that just takes timelines from unpredictable to unpredictable sooo.

If we qualify an experiment as interesting when it can tell anyone about anything, then there's an infinite variety of experiments "interesting" in this sense and I could generate an unlimited number of them. But I do restrict my interest to experiments which can not only tell me something I don't know, but tell me something relevant that I don't know. &nb... (read more)

1adamShimi4y

Trying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood's research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example "what aspect of these predictable problems crop up first, and why?"

[-]Eliezer Yudkowsky4y130

are you interested in Redwood's research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now?

It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like "What actually went into the nonviolence predicate instead of just nonviolence?" Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.

2Rob Bensinger4y

(I suspect there are a bunch of other disagreements going into this too, including basic divergences on questions like 'What's even the point of aligning AGI? What should humanity do with aligned AGI once it has it?'.)

[-]orthonormal4y*110

One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)

I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has... disconcerting effects on my own level of optimism.

7adamShimi4y

Not planning to answer more on this thread, but given how my last messages seem to have confused you, here is my last attempt of sharing my mental model (so you can flag in an answer where I'm wrong in your opinion for readers of this thread) Also, I just checked on the publication list, and I've read or skimmed most things MIRI published since 2014 (including most newsletters and blog posts on MIRI website). My model of MIRI is that initially, there was a bunch of people including EY who were working mostly on decision theory stuff, tiling, model theory, the sort of stuff I was pointing at. That predates Nate's arrival, but in my model it becomes far more legible after that (so circa 2014/2015). In my model, I call that "old school MIRI", and that was a big chunk of what I was pointing out in my original comment. Then there are a bunch of thing that seem to have happened: * Newer people (Abram and Scott come to mind, but mostly because they're the one who post on the AF and who I've talked to) join this old-school MIRI approach and reshape it into Embedded Agency. Now this new agenda is a bit different from the old-school MIRI work, but I feel like it's still not that far from decision theory and logic (with maybe a stronger emphasis on the bayesian part for stuff like logical induction). That might be a part where we're disagreeing. * A direction related to embedded agency and the decision theory and logic stuff, but focused on implementations through strongly typed programming languages like Haskell and type theory. That's technically practical, but in my mental model this goes in the same category as "decision theory and logic stuff", especially because that sort of programming is very close to logic and natural deduction. * MIRI starts it's ML-focused agenda, which you already mentioned. The impression I still have is that this didn't lead to much published work that was actually experimental, instead focusing on recasting questions of alignment through

[-]johnswentworth4y*260

... I find that most people working on alignment are trying far harder harder to justify why they expect their work to matter than EY and the old-school MIRI team ever did.

You've had a few comments along these lines in this thread, and I think this is where you're most severely failing to see the situation from Yudkowsky's point of view.

From Yudkowsky's view, explaining and justifying MIRI's work (and the processes he uses to reach such judgements more generally) was the main point of the sequences. He has written more on the topic than anyone else in the world, by a wide margin. He basically spent several years full-time just trying to get everyone up to speed, because the inductive gap was very very wide.

When I put on my Yudkowsky hat and look at both the OP and your comments through that lens... I imagine if I were Yudkowsky I'd feel pretty exasperated at this point. Like, he's written a massive volume on the topic, and now ten years later a large chunk of people haven't even bothered to read it. (In particular, I know (because it's come up in conversation) that at least a few of the people who talk about prosaic alignment a lot haven't read the sequences, and I suspect that a d... (read more)

5adamShimi4y

Thanks for the pushback! My memory of the sequences is that it's far more about defending and explaining the alignment problem than criticizing prosaic AGI (maybe because the term couldn't have been used years before Paul coined it?). Could you give me the best pointers of prosaic Alignment criticism in the sequence? I(I've read the sequences, but I don't remember every single post, and my impression for memory is what I've written above). I feel also that there might be a discrepancy between who I think of when I think of prosaic alignment researchers and what the category means in general/to most people here? My category mostly includes AF posters, people from a bunch of places like EleutherAI/OpenAI/DeepMind/Anthropic/Redwood and people from CHAI and FHI. I expect most of these people to actually have read the sequences, and tried to understand MIRI's perspective. Maybe someone could point out a list of other places where prosaic alignment research is being done that I'm missing, especially places where people probably haven't read the sequences? Or maybe I'm over estimating how many of the people in the places I mentioned have read the sequences?

[-]johnswentworth4y260

I don't mean to say that there's critique of prosaic alignment specifically in the sequences. Rather, a lot of the generators of the Yudkowsky-esque worldview are in there. (That is how the sequences work: it's not about arguing specific ideas around alignment, it's about explaining enough of the background frames and generators that the argument becomes unnecessary. "Raise the sanity waterline" and all that.)

For instance, just the other day I ran across this:

Of this I learn the lesson: You cannot manipulate confusion. You cannot make clever plans to work around the holes in your understanding. You can't even make "best guesses" about things which fundamentally confuse you, and relate them to other confusing things. Well, you can, but you won't get it right, until your confusion dissolves. Confusion exists in the mind, not in the reality, and trying to treat it like something you can pick up and move around, will only result in unintentional comedy.
Similarly, you cannot come up with clever reasons why the gaps in your model don't matter. You cannot draw a border around the mystery, put on neat handles that let you use the Mysterious Thing without

... (read more)

[-]Rob Bensinger4y*240

From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic

If you were in the situation described by The Rocket Alignment Problem, you could think "working with rockets right now isn't useful, we need to focus on our conceptual confusions about more basic things" without feeling inherently contemptuous of experimentalism -- it's a tool in the toolbox (which may or may not be appropriate to the task at hand), not a low- or high-status activity on a status hierarchy.

Separately, I think MIRI has always been pretty eager to run experiments in software when they saw an opportunity to test important questions that way. It's also been 4.5 years now since we announced that we were shifting a lot of resources away from Agent Foundations and into new stuff, and 3 years since we wrote a very long (though still oblique) post about that research, talking about its heavy focus on running software experiments. Though we also made sure to say:

In a sense, you can think of our new research as tackling the same sort of problem that we’ve always been attackin

... (read more)

[-]Ben Pace4y140

Thank you for the links Adam. To clarify, the kind of argument I'm really looking for is something like the following three (hypothetical) examples.

Mesa-optimization is the primary threat model of unaligned AGI systems. Over the next few decades there will be a lot of companies building ML systems that create mesa-optimizers. I think it is within 5 years of current progress that we will understand how ML systems create mesa-optimizers and how to stop it.Therefore I think the current field is adequate for the problem (80%).
When I look at the research we're outputting, it seems to me to me that we are producing research at a speed and flexibility faster than any comparably sized academic department globally, or the ML industry, and so I am much more hopeful that we're able to solve our difficult problem before the industry builds an unaligned AGI. I give it a 25% probability, which I suspect is much higher than Eliezer's.
I basically agree the alignment problem is hard and unlikely to be solved, but I don't think we have any alternative than the current sorts of work being done, which is a combo of (a) agent foundations work (b) designing theoretical training algorithms (like Paul is)

... (read more)

[-]adamShimi4y100

Thanks for the examples, that helps a lot.

I'm glad that I posted my inflammatory comment, if only because exchanging with you and Rob made me actually consider the question of "what is our story to success", instead of just "are we making progress/creating valuable knowledge". And the way you two have been casting it is way less aversive to me that the way EY tends to frame it. This is definitely something I want to think more about. :)

I want to leave this paragraph as social acknowledgment that you mentioned upthread that you're tired and taking a break, and I want to give you a bunch of social space to not return to this thread for however long you need to take! Slow comments are often the best.

Appreciated. ;)

3Ben Pace4y

Glad to hear. And yeah, that’s the crux of the issue for me.

2Rob Bensinger4y

! Yay! That's really great to hear. :)

[-]TurnTrout4y90

I'm sympathetic to most of your points.

highly veiled contempt for anyone not doing that

I have sympathy for the "this feels somewhat contemptuous" reading, but I want to push back a bit on the "EY contemptuously calling nearly everyone fakers" angle, because I think "[thinly] veiled contempt" is an uncharitable reading. He could be simply exasperated about the state of affairs, or wishing people would change their research directions but respect them as altruists for Trying At All, or who knows what? I'd rather not overwrite his intentions with our reactions (although it is mostly the job of the writer to ensure their writing communicates the right information [although part of the point of the website discussion was to speak frankly and bluntly]).

[-]StellaAthena4y30

If superintelligence is approximately multimodal GPT-17 plus reinforcement learning, then understanding how GPT-3-scale algorithms function is exceptionally important to understanding super-intelligence.

Also, if superintelligence doesn’t happen then prosaic alignment is the only kind of alignment.

[-]Richard_Ngo4y130

This is already reflected in the upvotes, but just to say it explicitly: I think the replies to this comment from Rob and dxu in particular have been exceptionally charitable and productive; kudos to them. This seems like a very good case study in responding to a provocative framing with a concentration of positive discussion norms that leads to productive engagement.

[-]Steven Byrnes4y130

if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.

Just in case anyone hasn't already seen these, EY wrote Challenges to Christiano’s capability amplification proposal and this comment (that I already linked to in a different comment on this page) (also has a reply thread), both in 2018. Also The Rocket Alignment Problem.

[-]wunan4y80

Context for anyone who's not aware:

Nerd sniping is a slang term that describes a particularly interesting problem that is presented to a nerd, often a physicist, tech geek or mathematician. The nerd stops all activity to devote attention to solving the problem, often at his or her own peril

Here's the xkcd comic which coined the term.

1adamShimi4y

Thanks, I sometimes forget not everyone knows the term. :)

[-]Vaniver4y70

I'm annoyed by EY (and maybe MIRI's?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment

For what it's worth, my sense is that EY's track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.

And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it. If you're trying to build a bridge to the moon, in fact it's not going to work, and any determination applied there is going to get wasted. I think I see how a "try to understand things and cut to the heart of them" notices when it's in situations like that, and I don't see how "move the ball forward from where it is now" notices when it's in situations like that.

4adamShimi4y

Agreed on the track record, which is part of why that's so frustrating he doesn't give more details and feedback on why all these approaches are doomed in his view. That being said, I disagree for the second part, probably because we don't mean the same thing by "moving the ball"? In your bridge example, "moving the ball" looks to me like trying to see what problems the current proposal could have, how you could check them, what would be your unknown unknowns. And I definitely expect such an approach to find the problems you mention. Maybe you could give me a better model of what you mean by "moving the ball"?

[-]Vaniver4y*170

Oh, I was imagining something like "well, our current metals aren't strong enough, what if we developed stronger ones?", and then focusing on metallurgy. And this is making forward progress--you can build a taller tower out of steel than out of iron--but it's missing more fundamental issues like "you're not going to be able to drive on a bridge that's perpendicular to gravity, and the direction of gravity will change over the course of the trip" or "the moon moves relative to the earth, such that your bridge won't be able to be one object", which will sink the project even if you can find a supremely strong metal.

For example, let's consider Anca Dragan's research direction that I'm going to summarize as "getting present-day robots to understand what humans around them want and are doing so that they can achieve their goals / cooperate more effectively." (In mildly adversarial situations like driving, you don't want to make a cooperatebot, but rather something that follows established norms / prevents 'cutting' and so on, but when you have a human-robot team you do care mostly about effective cooperation.)

My guess is this 1) will make the world a better place in the short run under ... (read more)

[-]Lukas_Gloor4y30

I share the impression that the agent foundations research agenda seemed not that important. But that point doesn't feel sufficient to argue that Eliezer's pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I'm not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it's extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don't know how exactly they nowadays think about prioritization in 2017 and earlier.

I really like Vaniver's comment further below:

For what it's worth, my sense is that EY's track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.

And, like

... (read more)

3adamShimi4y

That's an awesome comment, thanks! I get why you take that from my rant, but that's not really what I meant. I'm more criticizing the "everything is doomed but let's not give concrete feedback to people" stance, and I think part of it comes from believing for so long (and maybe still believing) that their own approach was the only non-fake one. Also just calling everyone else a faker is quite disrespectful and not helping. MIRI does have some positive points for changing their minds, but also some negative points IMO for taking so long to change their mind. Not sure what the total is. Here again, it's not so much that I disagree with EY about there being problems in the current research proposals. I expect that some of the problems he would point out are ones I see too. I just don't get the transition from "there are problems with all our current ideas" to "everyone is faking working on alignment and we're all doomed". Very good point. That being said, many of the more prosaic alignment people changed their minds multiple times, whereas on these specific questions I feel EY and MIRI didn't except when forced by tremendous pressure, which makes me believe that this criticism applies more to them. But that's one point where having some more knowledge of the internal debates at MIRI could make me change my mind completely. My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this. I'm confused about your question, because what you describe sounds like a misprediction that makes sense? Also I feel that in this case, there's a different between solving the coordination problem of having people implement the solution or not go on a race (which looks indeed harder in the light of Co

[-]Rob Bensinger4y60

My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this.

Interesting! This is quite different from the second-hand accounts I heard. (I assume we're touching different parts of the elephant.)

[-]Vanessa Kosoy4y150

I am confused about the emphasis on secrecy.

Certainly, if you're working on a substantial breakthrough in AI capability, there are reasons to keep it secret. But why would you work on that in the first place? One answer I can imagine is: "Current AI building blocks (deep learning) are too opaque and hard to prove things about, so we need to develop alternative building blocks. These alternative building blocks, if such are found, might produce a capability breakthrough." But, currently we don't have a solution even modulo the building blocks. Because, even if we allow computationally infeasible, or what I've been calling "weakly feasible" algorithms, such as Bayesian inference, we still don't have a complete solution. Therefore, it seems reasonable to focus on "alignment modulo building blocks" i.e. alignment in the unbounded / weakly feasible regime, and I don't see any reason to be secretive about it. On the contrary, we want to involve as many people as possible to get more progress and we want the AI community to know about whatever we come up with, to increase the probability they will use it.

Maybe the objection is, this path won't lead us to success quickly enough. But then, what alternative path is better? And how would secrecy enable it?

3Vaniver4y

Most of the mentions of secrecy in this post are in that context. I think a lot of people who say they care about the alignment problem think that the 'two progress bars' model, where you can work on alignment and capability independent of each other, is not correct, and so they don't see all that much of a difference between capability work and alignment work. (If you're trying to predict human approval of plans, for example, generic improvements in ability to predict things or understand plans help you as well.) But even if you don't believe in two progress bars, if you believe in differential tech development, it does seem like secrecy is a good idea (because not everyone is going to be trying to predict human approval). (It's only in worlds where you think alignment is 'easy' compared to capabilities that this isn't a concern.)

[-]Vanessa Kosoy4y50

If there's no difference between capability work and alignment work, then how is it possible to influence anything at all? If capability and alignment go hand in hand, then either transformative capability corresponds to sufficient alignment (in which case there is no technical problem) or it doesn't (in which case we're doomed).

The only world in which secrecy makes sense, AFAICT, is if you're going to solve alignment and capability all by yourself. I am extremely skeptical of this approach.

[-]Zac Hatfield-Dodds4y*150

I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I'm less convinced than Eliezer about theorem-proving for AGI safety.

There are so many independently fatal objections that I don't know how to structure this or convince someone who thinks it would work. I am therefore offering a $1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.
Your proof must not include executing the model, nor equivalent computations (e.g. concolic execution). You may train a custom model and/or directly set model weights, so long as it uses a standard EfficientNet architecture and reaches 99% accuracy. Bonus points for proving more of the sensitivity curve.

I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global ... (read more)

[-]Vanessa Kosoy4y120

You're attacking a strawman of what kind of theorems we want to prove. Obviously we are not going to prove theorems that contain specific datasets as part of the statement. What we're going to do is build a theory founded on certain assumptions about the real-world (such as locality / decoupling of dynamics on different scales / some kind of chaos / certain bounds on computational complexity / existence of simple fundamental laws etc) and humans (e.g. that they are approximately rational agents, for some definition thereof). Such a theory can produce many insights about what factors influence e.g. the possibility of adversarial attacks that you mention, most of which will be qualitative and some of which can be made quantitative by combining with empirical research (such as the research OpenAI does on scaling laws).

And, ofc the theory is only as good as its assumptions. Ofc if there are attack vectors your model doesn't account for, your system can be attacked. Having a theory is not a magical wand that immediately solves everything. But, it does put you in a much, much better position than working off pure guesswork and intuition.

Another angle is that, once we can at least state the theorem we might try to make the AI itself prove it. This can still fail: maybe the theorem is practically unprovable, or maybe we can't safely train the AI to prove theorems. But it does give us some additional leverage.

[-]Zac Hatfield-Dodds4y10

First, an apology: I didn't mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I'm sorry I wasn't clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.

Being able to state theorems about AGI seems absolutely necessary for success; but I don't think it's close to sufficient.

5Vanessa Kosoy4y

I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don't need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/rigorous part and an untrustworthy/heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented). I think the pipeline of success looks something like theoretical models => phenomenological models (i.e. models informed by a combination of theory and experiment) => security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded / irrelevant) => plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it's not a pure waterfall, there is feedback from each stage to previous stages.

3Buck4y

Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?

2Buck4y

You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.

[-]Vanessa Kosoy4y130

I want to push back against the idea that ANNs are "vectors of floating points" and therefore it's impossible to prove things about them. Many algorithms involve continuous variables and we can prove things about them. Support vector machines are also learning algorithms that are "vectors of floating points" and we have a pretty good theory of how they work. In fact, there already is a sizable body of theoretical results about ANNs, even if it still falls significantly short of what we need.

The biggest problem is not necessarily in the "floating points". The problem is that we still don't have satisfactory models of what an "agent" is and what it means for an agent to be "aligned". But, we do have some leads. And once we solve this part, there's no reason of principle why it cannot be combined with some (hitherto unknown) theory of generalization bounds for ANNs.

see also

[-]Rohin Shah4y120

A confusion: it seems that Eliezer views research that is predictable as basically-useless. I think I don't understand what "predictable" means here. In what sense is expected utility quantilization not predictable?

Maybe the point is that coming up with the concept is all that matters, and the experiments that people usually do don't matter because after coming up with the concept the experiments are predictable? I'm much more sympathetic to that, but then I'm confused why "predictable" implies "useless"; many prosaic alignment papers have as their main contribution a new algorithm, which seems like a similar type of thing as quantilization.

[-]Steven Byrnes4y*220

An example that springs to my mind is Abram wrote a blog post in 2018 mentioning the "easy problem of wireheading". He described both the problem and its solution in like one sentence, and then immediately moved on to the harder problems.

Later on, DeepMind did an experiment that (in my assessment) mostly just endorsed what Abram said as being correct.

For the record, I don't think that particular DeepMind experiment was zero value, for various reasons. But at the same time, I think that Abram wins hands-down on the metric of "progress towards AI alignment per researcher-hour", and this is true at both the production and consumption end (I can read Abram's one sentence much much faster than I can skim the DeepMind paper).

If we had a plausible-to-me plan that gets us to safe & beneficial AGI, I would be really enthusiastic about going back and checking all the assumptions with experiments. That's how you shore up the foundations, flesh out the details, start developing working code and practical expertise, etc. etc. But I don't think we have such a plan right now.

Also, there are times when it's totally unclear a priori what an algorithm will do just by thinking about it, and then ... (read more)

7Rohin Shah4y

That's a good example, thanks :) EDIT: To be clear, I don't agree with but I do think this is a good example of what someone might mean when they say work is "predictable".

[-]johnswentworth4y170

An example: when I first heard the Ought experiments described, I was pretty highly confident how they'd turn out - people would mostly fail to coordinate on any problem without an already-very-obvious factorization. (See here for the kinds of evidence informing that high confidence, though applied to a slightly different question. See here and here for the more general reasoning/world models which underlie that prediction.) From what I've heard of the experiments, it seems that that is indeed basically what happened; therefore the experiments provided approximately-zero new information to my model. They were "useless" in that sense.

(I actually think those experiments were worth running just on the small chance that they'd find something very high value, or more likely that the people running them would have some high-value insight, but I'd still say "probably useless" was a reasonable description beforehand.)

I don't know if Eliezer would agree with this particular example, but I think this is the sort of thing he's gesturing at.

5Rohin Shah4y

That one makes sense (to the extent that Eliezer did confidently predict the results), since the main point of the work was to generate information through experiments. I thought the "predictable" part was also meant to apply to a lot of ML work where the main point is to produce new algorithms, but perhaps it was just meant to apply to things like Ought.

[-]LawrenceC4y*00

I actually think this particular view is worth fleshing out, since it seems to come up over and over again in discussions of what AI alignment work is valuable (versus not).

For example, it does seem to me that >80% of the work in actually writing a published paper (at least amongst papers at CHAI) (EDIT: no longer believe this on reflection, see Rohin’s comment below) involves doing work with results that are predictable to the author after the concept (for example, actually getting your algorithm to run, writing code for experiments, running said experiments, writing up the results into a paper, etc.)

[-]Rohin Shah4y10

This just doesn't match my experience at all. Looking through my past AI papers, I only see two papers where I could predict the results of the experiments on the first algorithm I tried at the beginning of the project. The first one (benefits of assistance) was explicitly meant to be a "communication" paper rather than a "research" paper (at the time of project initiation, rather than in hindsight). The second one (Overcooked) was writing up results that were meant to be the baselines against which the actual unpredictable research (e.g. this) was going to be measured against; it just turned out that that was already sufficiently interesting to the broader community.

(Funny story about the Overcooked paper; we wrote the paper + did the user study in ~two weeks iirc, because it was only two weeks before the deadline that we considered that the "baseline" results might already be interesting enough to warrant a conference paper. It's now my most-cited AI paper.)

(I'm also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong b... (read more)

3LawrenceC4y

Thanks for the detailed response. On reflection, I agree with what you said - I think the amount of work it takes to translate a nice sounding idea into anything that actually works on an experimental domain is significant, and what exact work you need is generally not predictable in advance. In particular, I resonated a lot with this paragraph: At least from my vantage point, “having a strong story for why a result should be X” is insufficient for ex ante predictions of what exactly the results would be. (Once you condition on that being the story told in a paper, however, the prediction task does become trivial.) I’m now curious what the MIRI response is, as well as how well their intuitive judgments of the results are calibrated. EDIT: Here’s another toy model I came up with: you might imagine there are two regimes for science - an experiment driven regime, and a theory driven regime. In the former, it’s easy to generate many “plausible sounding” ideas and hard to be justified in holding on to any of them without experiments. The role of scientists is to be (low credence) idea generators and idea testers, and the purpose of experimentation is to primarily to discover new facts that are surprising to the scientist finding them. In the second regime, the key is to come up with the right theory/deep model of AI that predicts lots of facts correctly ex ante, and then the purpose of experiments is to convince other scientists of the correctness of your idea. Good scientists in the second regime are those who discover the right deep models much faster than others. Obviously this is an oversimplification, and no one believes it’s only one or the other, but I suspect both MIRI and Stuart Russell lie more on the “have the right idea, and the paper experiments are there to convince others/apply the idea in a useful domain” view, while most ML researchers hold the more experimentalist view of research?

[-]Ben Pace4y100

Aaaaaaaaaaaaahhhhhhhhhhhhhhhhh!!!!!!!!!!!!

(...I'll be at the office, thinking about how to make enough progress fast enough.)

[-]Ben Pace4y160

Follow-up

One of Eliezer's claims here is

It is very, very clear that at present rates of progress, adding that level of alignment capability as grown over the next N years, to the AGI capability that arrives after N years, results in everybody dying very quickly.

This is a claim I basically agree with.

I don't think the situation is entirely hopeless, but I don't think any of the current plans (or the current alignment field) are on track to save us.

[-]LawrenceC4y*80

I'm sure Eliezer has written about this previously, but why doesn't he think corrigibility is a natural stance?

It does seem like existing approaches to corrigibility (IE, the utility balancing approaches in MIRI/Stuart Armstrong's work and the "agent has incomplete information" approaches outlined in Dylan Hadfield-Menell or Alex Turner's work) are incredibly fragile. I do agree that current approaches involving utility balancing/assigning utility to branches never executed are probably way too finicky to get working. I also agree that all the existing approaches involving the agent modelling itself as having incomplete information rely on well-calibrated priors and also all succumb to the problem of fully updated deference.

However, I think it's not at all obvious to me that corrigibility doesn't have a "small central core". It does seem to me like the "you are incomplete, you will never be complete" angle captures a lot of what we mean by corrigibility.

It's possible the belief is empirical - that is, people have tried all the obvious ways to patch/fix this angle, and they've all failed, so the problem is hard (at least relative to the researchers we have working on it)... (read more)

[-]Steven Byrnes4y160

Eliezer explains why he thinks corrigibility is unnatural in this comment.

5TurnTrout4y

In Aug 2020 I gave formalizing corrigibility another shot, and got something interesting but wrong out the other end. Am planning to publish sometime, but beyond that I'm not aware of other attempts. When I visited MIRI for a MIRI/CHAI social in 2018, I seriously suggested a break-out group in which we would figure out corrigibility (or the desirable property pointed at by corrigibility-adjacent intuitions) in two hours. I think more people should try this exact exercise more often—including myself.

0LawrenceC4y

Yeah, we've also spent a while (maybe ~5 hours total?) in various CHAI meetings (some of which you've attended) trying to figure out the various definitions of corrigibility to no avail, but those notes are obviously not public. :( That being said I don't think failing in several hours of meetings/a few unpublished attempts is that much evidence of the difficulty?

3TurnTrout4y

I just remembered (!) that I have more public writing disentangling various forms of corrigibility, and their benefits—Non-obstruction: A simple concept motivating corrigibility.

2Vaniver4y

I think all three of Eliezer, you, and I share the sense that corrigibility is perhaps philosophically simple. The problem is that for it to actually have a small central core / be a natural stance, you need the 'import philosophy' bit to also have a small central core / be natural, and I think those bits aren't true. Like, the 'map territory' distinction seems to me like a simple thing that's near the core of human sanity. But... how do I make an AI that sees the map territory distinction? How do I check that its plans are correctly determining the causal structure such that it can tell the difference between manipulating its map and manipulating the territory? [And, importantly, this 'philosophical' AI seems to me like it's possibly alignable, and a 'nonphilosophical' AI that views its projections as 'the territory' is probably not alignable. But it's really spooky that all of our formal models are of this projective AI, and maybe we will be able to make really capable systems using it, and rather than finding the core of philosophical competence that makes the system able to understand the map-territory distinction, we'll just find patches for all of the obvious problems that come up (like the abulia trap, where the AI system discovers how to wirehead itself and then accomplishes nothing in the real world) and then we're killed by the non-obvious problems.]

[-]Vanessa Kosoy4y50

I'm on record as early as 2008 as saying that I expected superintelligences to crack protein folding, some people disputed that and were all like "But how do you know that's solvable?" and then AlphaFold 2 came along and cracked the protein folding problem they'd been skeptical about, far below the level of superintelligence.

This is tangential nitpicking (I agree that protein folding is solvable), but I don't think AlphaFold 2 entirely cracked it. AFAIU, AF2 relies on multiple sequence alignment as part of its input: the sequences of homologous proteins from different species. This is a standard method to simplify the problem, because by observing that certain parts of the sequence tend to vary together between homologues, you can guess that they correspond to chain fragments that are adjacent in the folded configuration.

Ofc even so it is very impressive and has plenty of applications. But, if you want to invent your own proteins from scratch, this is not good enough.

[-]Eliezer Yudkowsky4y70

Well, if viewing it on that level, AlphaFold 2 didn't crack the full problem because it doesn't let you put in a chemical function and get out a protein which performs that function while subject to other constraints of a surrounding wet system, which is the protein folding problem you have to solve to get wet nanotech out the other end, which is why we don't already have general wet nanotech today.

[-]habryka3y20Review for 2021 Review

I think this is my second-favorite post in the MIRI dialogues (for my overall review see here).

I think this post was valuable to me in a much more object-level way. I think this post was the first post that actually just went really concrete on the current landscape of efforts int he domain of AI Notkilleveryonism and talked concretely about what seems feasible for different actors to achieve, and what isn't, in a way that parsed for me, and didn't feel either like something obviously political, or delusional.

I didn't find the part about differ... (read more)

[-]Matthew Barnett4y20

I do think that if you get an AGI significantly past human intelligence in all respects, it would obviously tend to FOOM. I mean, I suspect that Eliezer fooms if you give an Eliezer the ability to backup, branch, and edit himself.

What improvements would you make to your brain that you would anticipate yielding greater intelligence? I can think of a few possible strategies:

Just adding a bunch of neurons everywhere. Make my brain bigger.
Study how very smart brains look, and try to make my brain look more like theirs.

For an AI, the first strategy is equivalen... (read more)

4Daniel Kokotajlo4y

EY knows more neuroscience than me (I know very little) but here's a 5-min brainstorm of ideas: --For a fixed compute budget, spend more of it on neurons associated with higher-level thought (the neocortex?) and less of it on neurons associated with e.g. motor control or vision. --Assuming we are an upload of some sort rather than a physical brain, tinker with the rules a bit so that e.g. neuron waste products get magically deleted instead of having to be pumped out, neurons never run out of energy/oxygen and need to rest, etc. Study situations where you are in "peak performance" or "flow" and then explore ways to make your brain enter those states at will. --Use ML pruning techniques to cut away neurons that aren't being useful, to get slightly crappier mini-Eliezers that cost 10% the compute. These can then automate away 90% of your cognition, saving you enough compute that you can either think a few times faster or have a few copies running in parallel. --Build automated tools that search through your brain for circuits that are doing something pretty simple, like a giant OR gate or an oscillator, and then replace those circuits with small bits of code, thereby saving significant compute. If anything goes wrong, no worries, just revert to backup. This was a fun exercise!

[-]David Scott Krueger (formerly: capybaralet)4y20

I'm torn because I mostly agree with Eliezer that things don't look good, and most technical approaches don't seem very promising.

But the attitude of unmitigated doomyness seems counter-productive.
And there's obviously things worth doing and working on and people getting on with it.

It seems like Eliezer is implicitly focused on finding an "ultimate solution" to alignment that we can be highly confident solves the problem regardless of how things play out. But this is not where the expected utility is. The expected utility is mostly in buying ti... (read more)

4David Scott Krueger (formerly: capybaralet)4y

I guess actually the goal is just to get something aligned enough to do a pivotal act. I don't see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn't work for that.

[-]LawrenceC4y60

Yudkowsky mentions this briefly in the middle of the dialogue:

I don't know however if I should be explaining at this point why "manipulate humans" is convergent, why "conceal that you are manipulating humans" is convergent, why you have to train in safe regimes in order to get safety in dangerous regimes (because if you try to "train" at a sufficiently unsafe level, the output of the unaligned system deceives you into labeling it incorrectly and/or kills you before you can label the outputs), or why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being "anti-natural" in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior ("consistent utility function") which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).

Basically, there are reasons to expect that alignment techniques that work in smaller safe regime fail in larger, unsafe regimes. For ... (read more)

1David Scott Krueger (formerly: capybaralet)4y

I'm familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well. I don't find them decisive. Some aren't even particularly convincing. A few points: - EY sets up a false dichotomy between "train in safe regimes" and "train in dangerous regimes". In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense). - The first 2 claims for why corrigibility wouldn't generalize seem to prove too much -- why would intelligence generalize to "qualitatively new thought processes, things being way out of training distribution", but corrigibility would not? - I think the last claim -- that corrigibility is "anti-natural" -- is more compelling. However, we don't actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer's claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the "without manipulation" qualifier to an objective.

1Ramana Kumar4y

This sounds confused to me: the intelligence is the "qualitatively new thought processes". The thought processes aren't some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I'd say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim - I don't think these three claims are to be taken separately). People keep saying this but it seems false to me. I've seen the construction for history-based utility functions that's supposed to show this, and don't find it compelling -- it seems not to be engaging with what EY is getting at with "coherent planning behavior". Is there a construction for (environment)-state-based utility functions? I'm not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)

[-]Vivek Hebbar4y10

What probability do you assign to the proposition "Prosaic alignment will fail"?

Purely based on your inside view model
After updating on everyone else's views

Same question for:

"More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless"

[-]localdeity4y10

If you have an untrustworthy general superintelligence generating English strings meant to be "reasoning/arguments/proofs/explanations" about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I'd expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn't understand even after having been told what happened.

I can't deny that this is possible. Lying is obviously possible; lyi... (read more)

1Donald Hobson4y

Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn't flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.) But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever could do it. Then they ask the superintelligence about nanosystems, and very carefully read the results. Then presumably they go and actually try to build nanosystems. Of course you didn't expect the superintelligences advice to be correct, did you? And not wrong in an easily detectable fail safe way either. You concepts and paradigm are all subtly malicious. Not clear testable and factually wrong statements. But nasty tricks hidden in the invisible background assumptions.

1localdeity4y

Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI. My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and estimate their success chances and rank them. Dismissing them all as imperfect is only worthwhile when you think perfection is achievable. (If you have a strong argument that method M and any steelmanning of it has a <1% chance of success, then that's good cause for dismissing it.)

1Donald Hobson4y

Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve) Now I can't give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it. Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment. (and all possible steelmannings, that's a huge space)

[-]dxu4y*10

Eliezer Yudkowsky

Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn't scale, I'd expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.

I think this view dovetails quite strongly with the view expressed in this comment by maximkazhenkov:

Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscience or ML hardware. I worry that once the research community shifts its focus towards RL, the AGI timeline will collapse - not necessarily because there are no more critical insights left to be discovered, but because it's fundamentally the right path to work on and whatever obstacles remain will buckle quickly once we throw enough warm bodies at them. I think - and this is highly controversial - that the focus on NLP and Vision Transformer has served as a distraction for a couple of years and actually delayed progress towards AGI.

If curiosity-driven exploration gets thrown into the mix and Starcraft/Dota gets solved (for real th

... (read more)

[-]jimrandomh4y00

It seems to me that the surprising simplicity of current-generation ML algorithms is a big part of the problem.

As a thought experiment: suppose you had a human brain, with the sort of debug access you'd have with a neural net; ie, you could see all the connections, edge weights, and firings, and had a decent multiple of the compute the brain has. Could you extract something like a verbal inner monologue, a text stream that was strongly predictive of that human's plans? I don't think it would be trivial, but my guess is that you could. It wouldn't hold up against a meditator optimizing against you, but it would be a solid starting point.

Could you do the same thing to GPT-3? No; you can't get language out of it that predicts its plans, because it doesn't have plans. Could you do the same thing to AlphaZero? No, you can't get language out of it that predicts its plans, because it doesn't use language.

This analogy makes me think neural net transparency might not be as doomed as the early results would suggest; they aren't finding human-legible low-dimensional representations of things because those representations aren't present (GPT-3) or have nothing human-legible to match up to (Alp... (read more)

5TurnTrout3y

Wish granted!

[+][comment deleted]4y10

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

64

Discussion with Eliezer Yudkowsky on AGI interventions

64