2021 AI Alignment Literature Review and Charity Comparison

Ought co-founder here. Seems worth clarifying how Elicit relates to alignment (cross-posted from EA forum):

1 - Elicit informs how to train powerful AI through decomposition

Roughly speaking, there are two ways of training AI systems:

End-to-end training
Decomposition of tasks into human-understandable subtasks

We think decomposition may be a safer way to train powerful AI if it can scale as well as end-to-end training.

Elicit is our bet on the compositional approach. We’re testing how feasible it is to decompose large tasks like “figure out the answer to this science question by reading the literature” by breaking them into subtasks like:

Brainstorm subquestions that inform the overall question
Find the most relevant papers for a (sub-)question
Answer a (sub-)question given an abstract for a paper
Summarize answers into a single answer

Over time, more of this decomposition will be done by AI assistants.

At each point in time, we want to push the compositional approach to the limits of current language models, and keep up with (or exceed) what’s possible through end-to-end training. This requires that we overcome engineering barriers in gathering human feedback and orchestrating calls to models in a way that doesn’t depend much on current architectures.

I view this as the natural continuation of our past work where we studied decomposition using human participants. Unlike then, it’s now possible to do this work using language models, and the more applied setting has helped us a lot in reducing the gap between research assumptions and deployment.

2 - Elicit makes AI differentially useful for AI & tech policy, and other high-impact applications

In a world where AI capabilities scale rapidly, I think it’s important that these capabilities can support research aimed at guiding AI development and policy, and more generally help us figure out what’s true and make good plans as much as they help persuade and optimize goals with fast feedback or easy specification.

Ajeya mentions this point in The case for aligning narrowly superhuman models:

"Better AI situation in the run-up to superintelligence: If at each stage of ML capabilities progress we have made sure to realize models’ full potential to be helpful to us in fuzzy domains, we will be going into the next stage with maximally-capable assistants to help us navigate a potentially increasingly crazy world. We’ll be more likely to get trustworthy forecasts, policy advice, research assistance, and so on from our AI assistants. Medium-term AI challenges like supercharged fake news / clickbait or AI embezzlement seem like they would be less severe. People who are pursuing more easily-measurable goals like clicks or money seem like they would have less of an advantage over people pursuing hard-to-measure goals like scientific research (including AI alignment research itself). All this seems like it would make the world safer on the eve of transformative AI or AGI, and give humans more powerful and reliable tools for dealing with the TAI / AGI transition."

Beth mentions the more general point in Risks from AI persuasion under possible interventions:

“Instead, try to advance applications of AI that help people understand the world, and advance the development of truthful and genuinely trustworthy AI. For example, support API customers like Ought who are working on products with these goals, and support projects inside OpenAI to improve model truthfulness.”

[-]Owain_Evans4y*90

Re: the Bay Area vs. other places. At this point, there's a fair amount of (messy) empirical evidence about how much being in the Bay Area impacts performance relative to being in other places. You could match organizations by area of research and do a comparison between the Bay and London/Oxford/Cambridge. E.g. OpenAI and Anthropic vs. DeepMind, OpenPhil (long-termist research) vs. FHI-GPI-CSER, CHAI vs Oxford and DeepMind. While people are not randomly assigned to these organizations, there is enough overlap of personnel that the observational evidence is likely to be meaningful. This kind of comparison seems preferable to general arguments like that the Bay Area is expensive + has bad epistemics.

(In terms of general arguments, I'd also mention that the Bay Area has the best track record in the world by a huge margin for producing technology companies and is among the top 5 regions in the world for cutting-edge scientific research.)

ETA: I tried to clarify my thoughts in the reply to Larks.

[-]Larks4y40

Is your argument about personnel overlap that one could do some sort of mixed effect regression, with location as the primary independent variable and controls for individual productivity? If so I'm so somewhat skeptical about the tractability: the sample size is not that big, the data seems messy, and I'm not sure it would capture necessarily the fundamental thing we care about. I'd be interested in the results if you wanted to give it a go though!

More importantly, I'm not sure this analysis would be that useful. Geography-based-priors only really seem useful for factors we can't directly observe; for an organization like CHAI our direct observations will almost entirely screen off this prior. The prior is only really important for factors where direct measurement is difficult, and hence we can't update away from the prior, but for those we can't do the regression. (Though I guess we could do the regression on known firms/researchers and extrapolate to new unknown orgs/individuals).

The way this plays out here is we've already spent the vast majority of the article examining the research productivity of the organizations; geography based priors only matter insomuchas you think they can proxy for something else that is not captured in this.

As befits this being a somewhat secondary factor, it's worth noting that I think (though I haven't explicitly checked) in the past I have supported bay area organisations more than non-bay-area ones.

[-]Owain_Evans4y20

I agree with most of this -- and my original comment should have been clearer. I'm wondering if the past five years of direct observations leads you to update the geography-based prior (which has been included in your alignment review for since 2018). How much do you expect the quality of alignment work to differ from a new organization based in the Bay vs somewhere else? (No need to answer: I realize this is probably a small consideration and I don't want to start an unproductive thread on this topic).

[-]Neel Nanda4y60

I do wonder if vision problems are unusually tractable here; would it be so easy to visualise what individual neurons mean in a language model?

We actually released our first paper trying to extend Circuits from vision to language models yesterday! You can't quite interpret individual neurons, but we've found some examples of where we can interpret what an individual attention head is doing.

[-]Kaj_Sotala4y20

I would be happy to see you write a top-level post about this paper. :)

[-]Neel Nanda4y20

Thanks! I'm probably not going to have time to write a top-level post myself, but I liked Evan Hubinger's post about it.

[-]evhub4y60

(Moderation note: added to the Alignment Forum from LessWrong.)

[-]Owain_Evans4y50

This is a fantastic resource and seems like a great project for a research assistant. As with Rohin Shah's alignment newsletter, I'm excited to see this project continue and (potentially) expand.

[-]Larks4y10

Thanks, that's very kind of you!

[-]Vanessa Kosoy4y50

I noticed that you didn't mention infra-Bayesianism, not in 2020 and not this year. Any particular reason?

[-]Larks4y30

I prioritized posts by named organizations.
- Diffractor does not list any institutional affiliations on his user page.
- No institution I noticed listed the post/sequence on their 'research' page.
- No institution I contacted mentioned the post/sequence.
No post in the sequence was that high in the list of 2021 Alignment Forum posts, sorted by karma.
Several other filtering methods also did not identify the post

However upon reflection it does seem to be MIRI-affiliated so perhaps should have been affiliated; if I have time I may review and edit it in later.

[-]Vanessa Kosoy4y100

Notice that in MIRI's summary of 2020 they wrote "From our perspective, our most interesting public work this year is Scott Garrabrant’s Cartesian frames model and Vanessa Kosoy’s work on infra-Bayesianism."

[-]Owain_Evans4y20

Evans et al.'s Truthful AI: Developing and governing AI that does not lie is a detailed and length piece discussing a lot of issues around truthfulness for AI agents. This includes conceptual, practical and governance issues, especially with regard conversation bots. They argue for truthfulness (or at least, non-negligently-false)

The link should include "that does not lie".
length --> lenghty

Lin et al.'s TruthfulQA: Measuring How Models Mimic Human Falsehoods provides a series of test questions to study how 'honest' various text models are. Of course, these models are trying to copy human responses, not be honest, so because many of the questions allude to common misconceptions, the more advanced models 'lie' more often. Interestingly they also used GPT-3 to evaluate the truth of these answers. See also the discussion here. Researchers from OpenPhil were also named authors on the paper. #Other

"OpenPhil" --> OpenAI
As a minor clarification, all the results in the paper are based on human evaluation of truth. But we show that GPT-3 can be used as a fairly reliably substitute for human evaluation under certain conditions.

[-]Larks4y40

Thanks, fixed in both copies.

51

2021 AI Alignment Literature Review and Charity Comparison

51

Introduction

How to read this document

New to Artificial Intelligence as an existential risk?

Conflict of Interest

Research Organisations

FHI: The Future of Humanity Institute

GovAI: The Center for the Governance of AI

CHAI: The Center for Human-Compatible AI

MIRI: The Machine Intelligence Research Institute

GCRI: The Global Catastrophic Risks Institute

CSER: The Center for the Study of Existential Risk

OpenAI

Google Deepmind

Anthropic

ARC: Alignment Research Center

Redwood Research

Ought

AI Impacts

GPI: The Global Priorities Institute

CLR: The Center on Long Term Risk

CSET: The Center for Security and Emerging Technology

AI Safety camp

FLI: The Future of Life Institute

Lightcone Infrastructure

CLTR: Center for Long Term Resilience (formerly Alpenglow)

Rethink Priorities

Convergence

SERI: The Stanford Existential Risk Initiative

Other Research

Capital Allocators & Other Organisations

LTFF: Long-term future fund

OpenPhil: The Open Philanthropy Project

SFF: The Survival and Flourishing Fund

FTX Foundation

BERI: The Berkeley Existential Risk Initiative

Nonlinear Fund

80,000 Hours

AISS: AI Safety Support

Other News

Organisation Second Preferences

Methodological Thoughts

Inside View vs Outside View

Organisations vs Individuals

Politics

Openness

Research Flywheel

Differential AI progress

Near-term safety AI issues

Financial Reserves

Donation Matching

Poor Quality Research

The Bay Area

Conclusions

Disclosures

Looking for Research Assistant for Next Year

Sources