After years of reading papers, blog posts, online exchanges, books, and occasionally hidden documents about AI alignment and AI risk, and having extremely interesting conversations about it, most arguments I encounter now feel familiar at best, rehashed at worst. This should mean I have a good map of the field being discussed.
I have been, however, frustrated by how little actual advice I could derive from this map. The message I understood from most agendas was "here are the tractable sub-problems we will work on and why they should be solved". I didn't find much justification for why they are critically important, or why one particular set of problems should be explored instead of the next research group's set.
So I looked harder. I found useful mapping work, yet nothing quite exactly like what I was looking for. I also found related concerns in this post and this comment thread.
You'll find, in the following sections, my (current) selection of:
"cartography work", to draw a map of relevant arguments and concepts;
research agendas, from research groups or individuals;
points of entry for newcomers.
Here are the caveats. The list is not exhaustive. I did try to cover as many visible ideas as possible, and there will be significant overlap and cross-references between the items listed here. Some references I consider useful (e.g. this) have not made the cut. I attempted to categorize the resources by focus, but a handful could have ended up in a different category. Please don’t rely on it too much.
My comments aren't summaries, rather justifications for why I included the reference. They also reuse liberally the original formulations. Please tell me if I left strong misrepresentations of the ideas in there.
All these references, and countless comments scattered all across LessWrong, the Alignment forum, and the Effective Altruism forum, will hopefully help me build something actionable, something that would let newcomers and experts explore the field with more clarity and make better decisions.
My short-term plan is to create minimal interactive explanations for the relevance of various propositions in AI alignment, with the option to question and expand their premises. I want to do this for a first few high-level ideas, and if it goes well, expand to a first full scenario.
The long-term plan is to map as many propositions and available scenarios as possible, to have a common framework in which to compare research directions. My intuition (to be challenged) is that there’s broad agreement in the field on most premises I could describe, and that we would benefit a lot from locating cruxes (e.g. here). My overarching motivation is to reduce research debt.
The references here will be my first source of information. The second one would be discussions. If you are the author of one of the resources below and/or if you had more conversations about alignment-related arguments than you can remember, and want to share your insights, please reach out to me. I will do my best to answer in a timely manner.
Thanks to Adam Shimi , Alexis Carlier and Maxime Riché for reviewing drafts of this post and suggesting resources!
Splits the core motivating arguments for AI safety into six rough categories: maximizers being dangerous, target loading, prosaic alignment, human safety, misuses/vulnerabilities, and large impact.
Makes the case for more clarity around the fundamental ideas, analysis of the arguments, description of deployment scenarios, as well as making more explicit the assumptions behind research agendas.
Creates a diagram linking hypotheses, scenarios, agendas, and catastrophic problems. Selects for debated and important arguments, does not claim to be comprehensive, links ideas through diverse relationships (support, conditional support, entailment, etc.)
The post itself goes into more details on the hypotheses, with resources listed for each one.
The first section of the talk highlights the limits of heuristic arguments, the usefulness of spelling out premises and making a deliberate effort to build compelling arguments for your personal stance.
The talk then proceeds to detail the speaker’s own argument for AI alignment work. Many commenters express their gratitude for all this exposition.
Runs through the intuitive arguments being AI risk prioritization: “AI as a big deal”, instability, lock-in, and accidents. Expands why each of them aren’t forceful, or with missing pieces/details.
Calls for the arguments being fleshed out further as a neglected issue, with potential high value.
Describes the evolution of AI risk arguments, from early descriptions of the alignment problem, to discontinuities as a premise for Bostrom’s Superintelligence, to alignment issues without discontinuity. Also describes non-alignment catastrophes, such as misuse risks.
Calls for clarification of arguments related to AI risk, especially on the subject of discontinuities, for better prioritization, and reduction of costly misunderstandings.
The website in general is dedicated to building AI forecasting resources, to inform arguments and decisions. Some of their content most closely related to AI risk arguments:
Describes two scenarios for AI catastrophe which don’t depend on a fast surprise takeover by a powerful AI system. Also notable for the level of engagement in the comments.
Breaks down a wide range of scenarios leading to (at least) catastrophic risk, by decomposing them into a variety of factors: strategic advantage, takeoff speed, autonomy acquisition, plurality of agents, etc.
Explores the idea of there being multiple combinations of factors which may be realized, each of them leading to a catastrophe (as opposed to a specific privileged scenario, which may receive too much focus).
Wei Dai, Daniel Kokotajlo - March 2019 (last updated March 2020)
Thirty-two (and counting) high-level scenarios for AI catastrophe. Wei Dai emphasizes that they aren’t disjunctive, as some scenarios may subsume or cause others. Daniel Kokotajlo (who maintains and updates the list) suggests it could be refined, expanded and reorganized.
Reports arguments on the importance of transparency and interpretability, and about how to improve the field of machine learning to make progress on these issues.
Lists and categorizes a wide range of catastrophic scenarios, from narrow or general AI, near-term or long-term, misuse or accidents, and many other factors, with references.
Clarifies and motivates technical research stemming from the idea of embedded agents, where AI systems are no longer logically separated from their environment, implying modeling and self-modification issues, among others.
Describes the subproblems associated with that hypothesis: decision theory, embedded world-models, robust delegation, and subsystem alignment.
From the Center for the Governance of AI, Future of Humanity Institute. The agenda aims for superficial comprehensiveness, gathering as many questions relevant to AI Governance as possible in 53 pages, and providing extensive references for further details. It doesn’t focus on prioritization, nor tractability/impact estimates.
The questions are divided in three clusters: technical landscape (modeling and forecasting AI progress, mapping AI capabilities, and technical AI safety), AI politics (transformation of government, of the job market, and regulatory concerns), and ideal AI governance (desirable values, institutions and scenarios).
Pedro A. Ortega, Vishal Maini, DeepMind - September 2018
Motivates DeepMind’s technical AI safety research, dividing it in three areas: specification (how to define the purpose of a system, whether explicitly designed or emergent), robustness (how to prevent, anticipate, defend against, and recover from perturbations), and assurance (understand, evaluate and actively control the behavior of a system).
The post defines a broad array of technical terms. The challenges are grounded in problems already present in current AI systems, and in simple environments (gridworlds).
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané - June 2016
Describes and motivates five technical safety research problems in machine learning-based systems, tractable through direct experimentation, in toy environments and/or small-scale models. All problems, sub-problems, and proposed abstract solutions are grounded in the existing machine learning literature.
The authors also argue for the increasing relevance of these problems as AI capabilities progress.
Expands on Bostrom’s Superintelligence, through a mesh of forty high-level statements addressing the possibility of an intelligence explosion, the nature of advanced machine intelligence, the relationship between goals and intelligence, the use and control of advanced AI, and near/long-term considerations in AI safety & strategy.
The common underlying framing is a service-centered model of general intelligence, suggesting the integration of a diversity of task-oriented systems built incrementally, rather than mostly independent, self-improving superintelligent agents.
Outlines four broad categories of technical work: technical alignment (how to create aligned AI), detecting failures (how to proactively check for alignment), methodological understanding (best practices), and system-building (how to do the previous three for large systems).
All problems (or sub-problems, for the first category) are explored through a high-level definition, motivation, solution desiderata, possible research avenues, personal takes, and references.
Maps a large set of concepts and techniques in AI safety. The core content can be explored in this interactive visualization. The concepts are primarily organized through a hierarchical map, with secondary links for related ideas. All concepts are given high-level descriptions with references.
The stated purpose of the work is to provide a comprehensive map and a reference set of concepts for the field, to be extended through further research.
The first section of the post is dedicated to recent work in basic AI risk analysis: new explorations of goal-directedness and comprehensive AI services, as well as new write-ups for, or against AI risk (many of which are listed in this very document).
The rest of the post details recent work in the many sub-problems of AI alignment, noting that the over 300 references have been selected from a larger set of around 500 articles, clustered for readability (the reader shouldn’t take the chosen categorization as authoritative).
Sorts AI alignment work by origin, and not by topic. It highlights more specifically the agendas of the various research teams, and lists the collaborations between them. It also references a wide range of independent research.
In addition, the post details the funding of the various organizations involved in the field, as well as methodological comments on prioritization, funding, and research avenues.
Focuses specifically on powerful AI systems: plausible conceptual models; forecasting of capability increase and risks; technical safety problems; design ideas and concepts; and public policy.
The paper explores safety problems shared by multiple research agendas, and summarizes a wide range of publications in the domain.
Future of Life Institute - November 2015 (first version)
Summarizes in an accessible way the very high-level case for AI alignment research, the most common naive objections and misconceptions, with further reading references.
Makes the case for the risk from superintelligent entities (not necessarily AI systems, though it is presented as its most probable origin). The book represents an early edited, long-form, philosophical introduction to numerous concepts such as the control problem, takeoff speeds, treacherous turn, instrumental convergence, decisive strategic advantage, value loading, and many more.
Makes the case for the risk from advanced AI systems through failure of alignment. The book describes the continued progress in AI capabilities, reviews critically the major arguments around AI risk and forecasting, and argues for early safety research, showcasing significant hurdles to solve, and possible research avenues.
Makes the philanthropic case for AI risk research, describing three classes of risk: misuse risk (malevolent, or value-locking use of powerful technology), accident risk (stemming typically from alignment failure) and other risks (such as structural effects due to automation, or dissemination of increasingly capable tools). Also explains several principles for prioritization work.
Decomposes the then-current main approaches in AI alignment research by building a tree diagram and giving friendly high-level explanations of the ideas. The exploration is itself biased towards iterated amplification, which is put in its broader context.
Clear and friendly explanations of many concepts in AI alignment. For introductory material, it is best to start with his Computerphile videos, produced before the channel’s creation.
Describes iterated amplification, an alignment technique for powerful ML-based systems. Spells out the core hypotheses behind the validity of the techniques. In the fourth section, details the associated research directions, and desiderata for AI alignment research.
Rohin Shah, Paul Christiano, Stuart Armstrong, Jacob Steinhardt, Owain Evans - October 2018
Investigates and motivates value learning, discussing the arguments stemming from the idea of a powerful AI system pursuing a particular utility function, using human behavior as a data source. Clearly restates the core arguments in the conclusion post.
Explores and motivates new ways to work with impact measures, a common component of various approaches in AI safety research, and how to think about scenarios where a powerful AI system makes wide-ranging decisions and actions.
Clarifies and motivates a technical agenda for building specific assumptions into AI systems that would let them infer human preferences, as an instrumental goal for aligning onto them.
Details and motivates philosophically a technical agenda to ground AI alignment in statistical and computational learning theory, as well as algorithmic information theory.
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg - November 2018
DeepMind paper, which defines a more specific agenda around the learning of a reward function through reinforcement learning, from interaction with a human user, in a way that scales to complex and general domains.
Explores and motivates the debate approach to alignment, learning human values through experiments, asking questions and arbitrating between arguments. Tied to the AI safety via debate OpenAI paper.
Victoria Krakovna - Regularly updated since August 2017
Provides a wealth of useful references, which significantly helped expand this list. Still receiving updates!
To reiterate, just above the comment box : I'm looking for insights. If your favorite reference is missing; if you spot a glaring error; if you have a strong opinion on research directions; if you share my frustrations, or disagree: do share! (Yes, the post is long, please don't let that stop you from engaging).
I want to make an actionable map of AI alignment.
After years of reading papers, blog posts, online exchanges, books, and occasionally hidden documents about AI alignment and AI risk, and having extremely interesting conversations about it, most arguments I encounter now feel familiar at best, rehashed at worst. This should mean I have a good map of the field being discussed.
I have been, however, frustrated by how little actual advice I could derive from this map. The message I understood from most agendas was "here are the tractable sub-problems we will work on and why they should be solved". I didn't find much justification for why they are critically important, or why one particular set of problems should be explored instead of the next research group's set.
So I looked harder. I found useful mapping work, yet nothing quite exactly like what I was looking for. I also found related concerns in this post and this comment thread.
You'll find, in the following sections, my (current) selection of:
Here are the caveats. The list is not exhaustive. I did try to cover as many visible ideas as possible, and there will be significant overlap and cross-references between the items listed here. Some references I consider useful (e.g. this) have not made the cut. I attempted to categorize the resources by focus, but a handful could have ended up in a different category. Please don’t rely on it too much.
My comments aren't summaries, rather justifications for why I included the reference. They also reuse liberally the original formulations. Please tell me if I left strong misrepresentations of the ideas in there.
All these references, and countless comments scattered all across LessWrong, the Alignment forum, and the Effective Altruism forum, will hopefully help me build something actionable, something that would let newcomers and experts explore the field with more clarity and make better decisions.
My short-term plan is to create minimal interactive explanations for the relevance of various propositions in AI alignment, with the option to question and expand their premises. I want to do this for a first few high-level ideas, and if it goes well, expand to a first full scenario.
The long-term plan is to map as many propositions and available scenarios as possible, to have a common framework in which to compare research directions. My intuition (to be challenged) is that there’s broad agreement in the field on most premises I could describe, and that we would benefit a lot from locating cruxes (e.g. here). My overarching motivation is to reduce research debt.
The references here will be my first source of information. The second one would be discussions. If you are the author of one of the resources below and/or if you had more conversations about alignment-related arguments than you can remember, and want to share your insights, please reach out to me. I will do my best to answer in a timely manner.
Thanks to Adam Shimi , Alexis Carlier and Maxime Riché for reviewing drafts of this post and suggesting resources!
Argument mapping & reviews
Disentangling arguments for the importance of AI safety
Richard Ngo - January 2019
Splits the core motivating arguments for AI safety into six rough categories: maximizers being dangerous, target loading, prosaic alignment, human safety, misuses/vulnerabilities, and large impact.
Makes the case for more clarity around the fundamental ideas, analysis of the arguments, description of deployment scenarios, as well as making more explicit the assumptions behind research agendas.
Clarifying some key hypotheses in AI alignment
Ben Cottier, Rohin Shah - August 2019
Creates a diagram linking hypotheses, scenarios, agendas, and catastrophic problems. Selects for debated and important arguments, does not claim to be comprehensive, links ideas through diverse relationships (support, conditional support, entailment, etc.)
The post itself goes into more details on the hypotheses, with resources listed for each one.
My personal cruxes for working on AI safety
Buck Shlegeris - January 2020
The first section of the talk highlights the limits of heuristic arguments, the usefulness of spelling out premises and making a deliberate effort to build compelling arguments for your personal stance.
The talk then proceeds to detail the speaker’s own argument for AI alignment work. Many commenters express their gratitude for all this exposition.
How sure are we about this AI stuff?
Ben Garfinkel - February 2019
Runs through the intuitive arguments being AI risk prioritization: “AI as a big deal”, instability, lock-in, and accidents. Expands why each of them aren’t forceful, or with missing pieces/details.
Calls for the arguments being fleshed out further as a neglected issue, with potential high value.
A shift in arguments for AI risk
Tom Adamczewski - February 2019
Describes the evolution of AI risk arguments, from early descriptions of the alignment problem, to discontinuities as a premise for Bostrom’s Superintelligence, to alignment issues without discontinuity. Also describes non-alignment catastrophes, such as misuse risks.
Calls for clarification of arguments related to AI risk, especially on the subject of discontinuities, for better prioritization, and reduction of costly misunderstandings.
Scenarios, forecasting & strategy
AI Impacts (selected references)
AI Impacts contributors - Since 2014
The website in general is dedicated to building AI forecasting resources, to inform arguments and decisions. Some of their content most closely related to AI risk arguments:
What failure looks like
Paul Christiano - March 2019
Describes two scenarios for AI catastrophe which don’t depend on a fast surprise takeover by a powerful AI system. Also notable for the level of engagement in the comments.
Disjunctive Scenarios of Catastrophic AI Risk
Kaj Sotala - February 2018
Breaks down a wide range of scenarios leading to (at least) catastrophic risk, by decomposing them into a variety of factors: strategic advantage, takeoff speed, autonomy acquisition, plurality of agents, etc.
Explores the idea of there being multiple combinations of factors which may be realized, each of them leading to a catastrophe (as opposed to a specific privileged scenario, which may receive too much focus).
The Main Sources of AI Risk?
Wei Dai, Daniel Kokotajlo - March 2019 (last updated March 2020)
Thirty-two (and counting) high-level scenarios for AI catastrophe. Wei Dai emphasizes that they aren’t disjunctive, as some scenarios may subsume or cause others. Daniel Kokotajlo (who maintains and updates the list) suggests it could be refined, expanded and reorganized.
Chris Olah’s views on AGI safety
Evan Hubinger - November 2019
Reports arguments on the importance of transparency and interpretability, and about how to improve the field of machine learning to make progress on these issues.
Classification of global catastrophic risks connected with artificial intelligence
Alexey Turchin, David Denkenberger - January 2018
Lists and categorizes a wide range of catastrophic scenarios, from narrow or general AI, near-term or long-term, misuse or accidents, and many other factors, with references.
Agendas & reports focused on problem framing
Embedded agency
Scott Garrabrant, Abram Demski - November 2018
Clarifies and motivates technical research stemming from the idea of embedded agents, where AI systems are no longer logically separated from their environment, implying modeling and self-modification issues, among others.
Describes the subproblems associated with that hypothesis: decision theory, embedded world-models, robust delegation, and subsystem alignment.
AI Governance: A Research Agenda
Allan Dafoe - August 2018
From the Center for the Governance of AI, Future of Humanity Institute. The agenda aims for superficial comprehensiveness, gathering as many questions relevant to AI Governance as possible in 53 pages, and providing extensive references for further details. It doesn’t focus on prioritization, nor tractability/impact estimates.
The questions are divided in three clusters: technical landscape (modeling and forecasting AI progress, mapping AI capabilities, and technical AI safety), AI politics (transformation of government, of the job market, and regulatory concerns), and ideal AI governance (desirable values, institutions and scenarios).
Building safe artificial intelligence: specification, robustness, and assurance
Pedro A. Ortega, Vishal Maini, DeepMind - September 2018
Motivates DeepMind’s technical AI safety research, dividing it in three areas: specification (how to define the purpose of a system, whether explicitly designed or emergent), robustness (how to prevent, anticipate, defend against, and recover from perturbations), and assurance (understand, evaluate and actively control the behavior of a system).
The post defines a broad array of technical terms. The challenges are grounded in problems already present in current AI systems, and in simple environments (gridworlds).
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané - June 2016
Describes and motivates five technical safety research problems in machine learning-based systems, tractable through direct experimentation, in toy environments and/or small-scale models. All problems, sub-problems, and proposed abstract solutions are grounded in the existing machine learning literature.
The authors also argue for the increasing relevance of these problems as AI capabilities progress.
Reframing Superintelligence: Comprehensive AI Services as General Intelligence
K. Eric Drexler - January 2019
Expands on Bostrom’s Superintelligence, through a mesh of forty high-level statements addressing the possibility of an intelligence explosion, the nature of advanced machine intelligence, the relationship between goals and intelligence, the use and control of advanced AI, and near/long-term considerations in AI safety & strategy.
The common underlying framing is a service-centered model of general intelligence, suggesting the integration of a diversity of task-oriented systems built incrementally, rather than mostly independent, self-improving superintelligent agents.
AI alignment reviews
AI Alignment Research Overview
Jacob Steinhardt - October 2019
Outlines four broad categories of technical work: technical alignment (how to create aligned AI), detecting failures (how to proactively check for alignment), methodological understanding (best practices), and system-building (how to do the previous three for large systems).
All problems (or sub-problems, for the first category) are explored through a high-level definition, motivation, solution desiderata, possible research avenues, personal takes, and references.
The Landscape of AI Safety and Beneficence Research
Richard Mallah - January 2017
Maps a large set of concepts and techniques in AI safety. The core content can be explored in this interactive visualization. The concepts are primarily organized through a hierarchical map, with secondary links for related ideas. All concepts are given high-level descriptions with references.
The stated purpose of the work is to provide a comprehensive map and a reference set of concepts for the field, to be extended through further research.
AI Alignment 2018-19 Review
Rohin Shah - January 2020
The first section of the post is dedicated to recent work in basic AI risk analysis: new explorations of goal-directedness and comprehensive AI services, as well as new write-ups for, or against AI risk (many of which are listed in this very document).
The rest of the post details recent work in the many sub-problems of AI alignment, noting that the over 300 references have been selected from a larger set of around 500 articles, clustered for readability (the reader shouldn’t take the chosen categorization as authoritative).
2019 AI Alignment Literature Review and Charity Comparison
Larks - December 2019
Sorts AI alignment work by origin, and not by topic. It highlights more specifically the agendas of the various research teams, and lists the collaborations between them. It also references a wide range of independent research.
In addition, the post details the funding of the various organizations involved in the field, as well as methodological comments on prioritization, funding, and research avenues.
AGI Safety Literature Review
Tom Everitt, Gary Lea, Marcus Hutter - May 2018
Focuses specifically on powerful AI systems: plausible conceptual models; forecasting of capability increase and risks; technical safety problems; design ideas and concepts; and public policy.
The paper explores safety problems shared by multiple research agendas, and summarizes a wide range of publications in the domain.
Introductory material
Benefits & Risks of Artificial Intelligence
Future of Life Institute - November 2015 (first version)
Summarizes in an accessible way the very high-level case for AI alignment research, the most common naive objections and misconceptions, with further reading references.
Superintelligence: Paths, Dangers, Strategies
Nick Bostrom - July 2014
Makes the case for the risk from superintelligent entities (not necessarily AI systems, though it is presented as its most probable origin). The book represents an early edited, long-form, philosophical introduction to numerous concepts such as the control problem, takeoff speeds, treacherous turn, instrumental convergence, decisive strategic advantage, value loading, and many more.
Human Compatible: Artificial Intelligence and the Problem of Control
Stuart Russell - October 2019
Makes the case for the risk from advanced AI systems through failure of alignment. The book describes the continued progress in AI capabilities, reviews critically the major arguments around AI risk and forecasting, and argues for early safety research, showcasing significant hurdles to solve, and possible research avenues.
Potential Risks from Advanced Artificial Intelligence: The Philanthropic Opportunity
Holden Karnofsky - May 2016
Makes the philanthropic case for AI risk research, describing three classes of risk: misuse risk (malevolent, or value-locking use of powerful technology), accident risk (stemming typically from alignment failure) and other risks (such as structural effects due to automation, or dissemination of increasingly capable tools). Also explains several principles for prioritization work.
Current Work in AI Alignment
Paul Christiano - June 2019
Decomposes the then-current main approaches in AI alignment research by building a tree diagram and giving friendly high-level explanations of the ideas. The exploration is itself biased towards iterated amplification, which is put in its broader context.
Arbital - Explore AI Alignment
Many authors - From 2014 to 2018
Provides detailed explanations for many concepts in AI Alignment, in an explorable way. Now in an archived state.
Robert Miles’s YouTube channel
Robert Miles - Since 2014
Clear and friendly explanations of many concepts in AI alignment. For introductory material, it is best to start with his Computerphile videos, produced before the channel’s creation.
Technical agendas focused on possible solutions
Iterated Amplification
Paul Christiano - October 2018
Describes iterated amplification, an alignment technique for powerful ML-based systems. Spells out the core hypotheses behind the validity of the techniques. In the fourth section, details the associated research directions, and desiderata for AI alignment research.
Value Learning
Rohin Shah, Paul Christiano, Stuart Armstrong, Jacob Steinhardt, Owain Evans - October 2018
Investigates and motivates value learning, discussing the arguments stemming from the idea of a powerful AI system pursuing a particular utility function, using human behavior as a data source. Clearly restates the core arguments in the conclusion post.
Reframing Impact
Alex Turner - July 2019
Explores and motivates new ways to work with impact measures, a common component of various approaches in AI safety research, and how to think about scenarios where a powerful AI system makes wide-ranging decisions and actions.
Research Agenda v0.9: Synthesising a human's preferences into a utility function
Stuart Armstrong - June 2019
Clarifies and motivates a technical agenda for building specific assumptions into AI systems that would let them infer human preferences, as an instrumental goal for aligning onto them.
Deconfusing Human Values Research Agenda v1
G Gordon Worley III - March 2020
Defines a technical agenda for building a formal expression of the structure of human values, modeling them as the input of their decision process.
The Learning-Theoretic AI Alignment Research Agenda
Vanessa Kosoy - July 2018
Details and motivates philosophically a technical agenda to ground AI alignment in statistical and computational learning theory, as well as algorithmic information theory.
Scalable agent alignment via reward modeling: a research direction
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg - November 2018
DeepMind paper, which defines a more specific agenda around the learning of a reward function through reinforcement learning, from interaction with a human user, in a way that scales to complex and general domains.
AI Safety Needs Social Scientists
Geoffrey Irving, Amanda Askell - February 2019
Explores and motivates the debate approach to alignment, learning human values through experiments, asking questions and arbitrating between arguments. Tied to the AI safety via debate OpenAI paper.
Special mentions
Technical AGI safety research outside AI
Richard Ngo - October 2019
The first section Studying and understanding safety problems motivates this very project. The entire post is full of interesting problems to solve.
AI safety resources
Victoria Krakovna - Regularly updated since August 2017
Provides a wealth of useful references, which significantly helped expand this list. Still receiving updates!
To reiterate, just above the comment box : I'm looking for insights. If your favorite reference is missing; if you spot a glaring error; if you have a strong opinion on research directions; if you share my frustrations, or disagree: do share! (Yes, the post is long, please don't let that stop you from engaging).