How does one write a good and useful review of a technical post on the Alignment Forum?

I don’t know. Like many people, I tend to comment and give feedback on posts closely related to my own research, or to write down my own ideas when reading the paper. Yet this is quite different from the quality peer-review that you can get (if you’re lucky) in more established fields. And from experience, such quality reviews can improve the research dramatically, give some prestige to it, and help people navigate the field.

In an attempt to understand what makes a good review for the Alignment Forum, Joe Collman, Jérémy Perret (Gyrodiot on LW) and me are launching a project to review many posts in depth. The goal is to actually write reviews of various posts, get feedback on their usefulness from authors and readers alike, and try to extract from them some knowledge about how to go about doing such reviews for the field. We hope to have enough insights to eventually write some guidelines that could be used in an official AF review process.

On that note, despite the support of members of the LW team, this project isn’t official. It’s just the three of us trying out something.

Now, the reason for the existence of this post (and why it is a question) is that we’re looking for posts to review. We already have some in mind, but they are necessarily biased towards what we’re more comfortable about. This is where you come in, to suggest a more varied range of posts.

Anything posted on the AF goes, although we will not take into account things that are clearly not “research outputs” (like transcripts of podcasts or pointers to surveys). This means that posts about specific risks, about timelines, about deconfusion, about alignment schemes, and more, are all welcome.

We would definitely appreciate it if you add a reason to your suggestion, to help us decide whether to include the post on our selection. Here is a (non-exhaustive) list of possible reasons:

  • This post is one of the few studying this very important question
  • This is my post and I want some feedback
  • This post was interesting but I cannot decide what to make of it
  • This post is very representative of a way to do AI Alignment research
  • This post is very different from most of AI Alignment research

Thanks in advance, and we’re excited about reading your suggestions!

New Answer
New Comment

7 Answers sorted by

johnswentworth

170

Related to the role of peer review: a lot stuff on LW/AF is relatively exploratory, feeling out concepts, trying to figure out the right frames, etc. We need to be generally willing to ask discuss incomplete ideas, stuff that hasn't yet had the details ironed out. For that to succeed, we need community discussion standards which tolerate a high level of imperfect details or incomplete ideas. I think we do pretty well with this today.

But sometimes, you want to be like "come at me bro". You've got something that you're pretty highly confident is right, and you want people to really try to shoot it down (partly as a social mechanism to demonstrate that the idea is in fact as solid and useful as you think it is). This isn't something I'd want to be the default kind of feedback, but I'd like for authors to be able to say "come at me bro" when they're ready for it, and I'd like for posts which survive such a review to be perceived as more epistemically-solid/useful.

With that in mind, here's a few of my own AF posts which I'd submit for a "come at me bro" review:

  • Probability as Minimal Map - I claim this is both a true and useful interpretation of probability distributions. Come at me bro.
  • Public Static: What Is Abstraction - I claim that this captures all of the key pieces of what "abstraction" means. Come at me bro.
  • Writing Causal Models Like We Write Programs - I claim that this approach fully captures the causal semantics of typical programming languages, the "gears of computation", and "what programs mean". Come at me bro.
  • The Fusion Power Generator Scenario (and this comment) - I claim that any alignment scheme which relies on humans using an AI safely, or relies on humans asking the right questions, is either very limited or not safe. (In particular, this includes everything in the HCH cluster.) Come at me bro.
  • Human Values Are A Function Of Humans' Latent Variables - I claim that this captures all of the conceptually-difficult pieces of "what are human values?", and shows that those conceptual difficulties can be faithfully captured in a Bayesian framework. Come at me bro.

For all of these, things like "this frame is wrong" or "this seems true but not useful" are valid objections. I'm not just claiming that the proofs hold.

But sometimes, you want to be like "come at me bro". You've got something that you're pretty highly confident is right, and you want people to really try to shoot it down (partly as a social mechanism to demonstrate that the idea is in fact as solid and useful as you think it is). This isn't something I'd want to be the default kind of feedback, but I'd like for authors to be able to say "come at me bro" when they're ready for it, and I'd like for posts which survive such a review to be perceived as more epistemically-solid/useful.

Yeah, when I think about implementing a review process for the Alignment Forum, I'm definitely thinking about something you can ask for more polished research, in order to get external feedback and a tag saying this is peer review (for prestige and reference).

Thanks for the suggestions! We'll consider them. :)

Charlie Steiner

70

Steve's big thoughts on alignment in the brain probably deserve a review. Component posts include  https://www.lesswrong.com/posts/diruo47z32eprenTg/my-computational-framework-for-the-brain , https://www.lesswrong.com/posts/DWFx2Cmsvd4uCKkZ4/inner-alignment-in-the-brain , https://www.lesswrong.com/posts/jNrDzyc8PJ9HXtGFm/supervised-learning-of-outputs-in-the-brain

Interestingly, I think there aren't any of my posts I should recommend - basically all of them are speculative. However, I did have a post called Gricean communication and meta-preferences that I think is still fairly interesting speculation, and I've never gotten any feedback on it at all, so I'll happily ask for some: https://www.lesswrong.com/posts/8NpwfjFuEPMjTdriJ/gricean-communication-and-meta-preferences .  

Jaime Sevilla

70

Suggestion 1: Utility != reward by Vladimir Mikulik. This post attempts to distill the core ideas of mesa alignment. This kind of distillment increases the surface area of AI Alignment, which is one of the key bottlenecks of the area (that is, getting people familiarized with the field, motivated to work on it and with a handle on some open questions to work on). I would like an in-depth review because it might help us learn how to do it better!

Suggestion 2: me and my coauthor Pablo Moreno would be interested in feedback in our post about quantum computing and AI alignment. We do not think that the ideas of the paper are useful in the sense of getting us closer to AI alignment, but I think it is useful to have signpost explaining why avenues that might seem attractive to people coming into the field are not worth exploring, while introducing them to the field in a familiar way (in this case our audience are quantum computing experts). One thing that confuses me is that some people have approached me after publishing the post asking me why I think that quantum computing is useful for AI alignment, so I'd be interested in feedback on what went wrong on the communication process given the deflationary nature of the article. 

Daniel Kokotajlo

30

Great idea! Thanks for doing this!

Unsurprisingly, I'd love it if you reviewed any of my posts.

Since you said "technical," I suggest this one in particular. It's a big deal IMO because Armstrong & Mindermann's argument has been cited approvingly by many people and seems to be still widely regarded as correct, but if I'm right, it's actually a bad argument. I'd love a third perspective on this that helps me figure out what's going on.

More generally I'd recommend sorting all AF posts by karma and reviewing the ones that got the most, since presumably karma correlates with how much people here like the post and thus it's extra important to find flaws in high-karma posts.

I was indeed expecting you to suggest one of your post. But that's one of the valid reasons I listed, and I didn't know about this one, so it's great!

We'll consider it. :)

1Daniel Kokotajlo
Insofar as you want to do others of mine, my top recommendation would be this one since it got less feedback than I expected and is my most important timelines-related post of all time IMO.
1Adam Shimi
If we do only one, which one do you think matters the most?
2Daniel Kokotajlo
I'm more interested in feedback on the +12 OOMs one because it's more decision-relevant. It's more of a fuzzy thing, not crunchy logic like the first one I recommended, and therefore less suitable for your purposes (or so I thought when I first answered your question, now I am not sure)

Gordon Seidoh Worley

30

I wrote this post as a summary of a paper I published. It didn't get much attention, so I'd be interesting in having you all review it.

 https://www.lesswrong.com/posts/JYdGCrD55FhS4iHvY/robustness-to-fundamental-uncertainty-in-agi-alignment-1

To say a little more, I think the general approach I lay out in here for taking towards safety work is worth considering more deeply and points towards a better process for choosing interventions in attempts to build aligned AI. I think what's more important than the specific examples where I apply the method is the method itself, but thus far as best I can tell folks did not much engage with that, so unclear to me if that's because they disagree, think it's too obvious, or what.

Thanks for the suggestion! It's great to have some methodological posts!

We'll consider it. :)

Gordon Seidoh Worley

30

I think the generalized insight from Armstrong's no free lunch paper is still underappreciated in that I sometimes see papers that, to me, seem to run up against this and fail to realize there's a free variable in their mechanisms that needs to be fixed if they want them to not go off in random directions.

https://www.lesswrong.com/posts/LRYwpq8i9ym7Wuyoc/other-versions-of-no-free-lunch-in-value-learning

Thanks for the suggestion!

I didn't know about this post. We'll consider it. :)

Gordon Seidoh Worley

20

Another post of mine I'll recommend you:

https://www.lesswrong.com/posts/k8F8TBzuZtLheJt47/deconfusing-human-values-research-agenda-v1

This is the culmination of a series of post on "formal alignment", where I start out saying "what it would mean to formally state what it would mean to build aligned AI" and then from that try to figure out what we'd have to figure out in order to achieve that.

Over the last year I've gotten pulled in other directions so not pushed this line of research forward much, plus I reached a point with it where it was clear it required different specialization than I have to make additional progress, but I still think it presents a different approach to what others are doing in the space of work towards AI alignment and think you might find it interesting to review (along with the preceding posts in the series) for that reason.

Thanks for the suggestion!

We want to go through the different research agendas (and I already knew about yours), as they give different views/paradigms on AI Alignment. Yet I'm not sure how relevant a review of such posts are. In a sense, the "reviewable" part is the actual research that underlies the agenda, right?

1Joe Collman
I don't see a good reason to exclude agenda-style posts, but I do think it'd be important to treat them differently from more here-is-a-specific-technical-result posts. Broadly, we'd want to be improving the top-level collective AI alignment research 'algorithm'. With that in mind, I don't see an area where more feedback/clarification/critique of some kind wouldn't be helpful. The questions seem to be: What form should feedback/review... take in a given context? Where is it most efficient to focus our efforts? Productive feedback/clarification on high-level agendas seems potentially quite efficient. My worry would be to avoid excessive selection pressure towards paths that are clear and simply justified. However, where an agenda does use specific assumptions and arguments to motivate its direction, early 'review' seems useful.