Reading the ethicists 2: Hunting for AI alignment papers

Charlie Steiner

Introduction

I'm back, reading more papers in ethics (but now also in philosophy and wherever the citations lead me). Unlike last time when I surveyed one journal's approach to AI in general, this time I'm searching specifically for interesting papers that bear on aligning AGI, preferably written by people I've never heard of before. Are they going to be any good? *shrug*

To this end, I skimmed the titles and sometimes abstracts of the last 5 years of papers in a pretty big chunk of the AI ethics space, namely:

Ethics and Information Technology
Minds and Machines
AI and Ethics
Philosophy & Technology
Science and Engineering Ethics
AI & Society
IEEE Transactions on Philosophy and Society
And a few bonus miscellanea

From the set of all papers that even had a remote chance at being relevant (~10 per journal per year), I read more deeply and am relaying to you in this post all the ones that were somewhat on-topic and nontrivial (~0.5 per journal per year). By "nontrivial" I mean that I didn't include papers that just say "the alignment problem exists" - I certainly do not mean that I set a high bar for quality. Then I did some additional searching into what else those authors had published, who they worked with, etc.

What were all the other papers about, the ones that didn't match my criteria? All sorts of things! Whether robots are responsible for their actions, how important privacy is, how to encourage and learn from non-Western robotics paradigms, the ethics of playing MMORPGs, how to make your AI ignore protected characteristics, the impact of bow-hunting technology on middle stone age society, and so on and so forth.

The bulk of this post is a big barely-ordered list of the papers. For each paper I'll give the title, authors, author affiliations, journal and date. Each paper will get a summary and maybe a recommendation. I'll also bring up interesting author affiliations and related papers.

Whereas last post was more a "I did this so you don't have to" kind of deal, I think this post will be more fun if you explore more deeply, reading the papers that catch your interest. (Let me know if you have any trouble finding certain pdfs - most are linked on google scholar.) If you find this post to be too long, don't go through it all in one sitting - I sure didn't!

Papers

The possibility of deliberate norm-adherence in AI, Danielle Swanepoel (U. Johannesburg, SolBridge International School of Business), Ethics and Information Technology, 2021.

I binned most papers talking about artificial moral agents ("AMAs") for being dull and anthropocentric. I decided to include this paper because it's better than most, and its stab at drawing the line between "not moral agent" and "moral agent" is also a good line between "just needs to be reliable" and "actually needs to be value aligned." Recommended if you like people who really like Kant.

Human-aligned artificial intelligence is a multiobjective problem, P. Vamplew (LessWrong), R. Dazeley, C. Foale, S. Firmin and J. Mummery (Federation University Australia), Ethics and Information Technology, 2018.

They argue that if you design an AI's motivations by giving it multiple objective functions that only get aggregated near the point of decision-making, you can do things like using a nonlinear aggregation function as a form of reduced-impact AI, or throwing away outliers for increased robustness. Recommended if you haven't thought about this idea yet and want something to skim while you turn on your brain (also see Peter Vamplew's more technical papers if you want more details).

This has been cited a healthy amount, and by some interesting-looking papers, including:

MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning - They combine the idea with IRL. Check it out.

Building TrusTee: The World's Most Trusted Robot

Soft maximin approaches to Multi-Objective Decision-making for encoding human intuitive values

Instilling moral value alignment by means of multi-objective reinforcement learning (See immediately below)

Instilling moral value alignment by means of multi-objective reinforcement learning, M. Rodriguez-Soto, M. Serramia, M. Lopez-Sanches and J. A. Rodriguez-Aguilar (Autonomous University of Barcelona), Ethics and Information Technology, 2022.

They train a gridworld agent to be ethical by combining a handwritten ethics reward with the task reward.

The “big red button” is too late: an alternative model for the ethical evaluation of AI systems, T. Arnold and M. Scheutz (HRI Lab, Tufts University), Ethics and Information Technology, 2018.

We should prevent AI from doing bad things by building automated ethical tests that we occasionally probe them with, and shutting the AI down if it fails. Don't read the paper, but maybe hire this guy if constructing automated ethical tests.

This paper comes out of a "Moral Competence in Computational Architectures for Robots" grant that's funding work at several universities. Most of the papers are similar to the stuff I'm not passing on to you, but one fun-looking one that caught my eye is Toward the Engineering of Virtuous Machines.

The value alignment problem: a geometric approach, Martin Peterson (Texas A&M), Ethics and Information Technology, 2019.

We should do value alignment by laying down some moral principles that we're confident apply in various situations, and then when the AI is in a new situation it can apply the moral principles of the nearest neighboring known situation. If this sounds like a bad idea that nonetheless raises interesting questions, don't read this paper, instead read his other paper Can we Use Conceptual Spaces to Model Moral Principles?

Can the predictive processing model of the mind ameliorate the value-alignment problem? William Ratoff (Dartmouth), Ethics and Information Technology, 2021.

Predictive-processing-based AI will automatically be super-ethical.

Semantic Vector Spaces for Broadening Consideration of Consequences, Douglas Summers-Stay (LessWrong) (Army Research Laboratory), in Autonomy and Artificial Intelligence: A Threat or Savior?, 2017.

Mostly an explanation of word2vec. Possibly arguing that we shouldn't imagine future AIs as "overly literal genies," instead we should imagine them as having word2vec-like semantics.

I couldn't resist including this because first, it's the most relevant article in a collection I was looking through for a different reason, second, Douglas is a LWer, and third, it illustrates that the U.S. Army is interested in robots that do what you mean.

Machines learning values, Steve Petersen (Niagara University), in Ethics of Artificial Intelligence, 2020.

We should get value aligned AI by giving it rules for detecting its goals (e.g. human well-being) that work when the AI is dumb and the environment is in the training distribution, then giving it a generalization procedure that looks like wide reflective equilibrium. Recommended if you're fine with the proposals being bad, and you want an aid to think more about the intersection of Steve Byrnes' "Finders Keepers" for values and Dennett's Eliminate the Middletoad.

Also, kudos to Steve Petersen for changing his mind.

Moral Gridworlds: A Theoretical Proposal for Modeling Artificial Moral Cognition, Julia Haas (DeepMind, Rhodes College), Minds and Machines, 2020.

We should teach AI morality with suggestively-named cells in gridworlds.

Julia Haas was also involved with Australian National University's Humanising Machine Intelligence project, which seems potentially interesting, but I didn't find any papers I was personally excited about so far.

Artificial Intelligence, Values, and Alignment, Iason Gabriel (DeepMind), Minds and Machines, 2020.

I said I would skip "Hey, the alignment problem exists" papers, but that was mostly just an excuse to exclude a few bad ones - this one is actually good. It's the author's take on (mostly) the normative components of the value alignment problem, and it's an interesting perspective. Recommended if you hope that at least DeepMind is thinking about which monkey gets the banana.

It's been well-cited in several fields as a sort of generalist explanation. A few of the more interesting papers (all good enough to glance at, I think) citing it are:

Invariance in Policy Optimisation and Partial Identifiability in Reward Learning

Learning Altruistic Behaviours in Reinforcement Learning without External Rewards

On Agent Incentives to Manipulate Human Feedback in Multi-Agent Reward Learning Scenarios

What are you optimizing for? Aligning Recommender Systems with Human Values

Implementation of Moral Uncertainty in Intelligent Machines, Kyle Bogosian (EA Forum) (Tulane University), Minds and Machines, 2017.

We should use Will MacAskill's scheme for aggregating moral theories.

This paper is also a direct ancestor of the next paper:

Computer Says I Don’t Know: An Empirical Approach to Capture Moral Uncertainty in Artificial Intelligence, A. Martinho, M. Kroesen, C. Chorus (TU Delft), Minds and Machines, 2021.

The prose is a bit purple, but underneath was a decent paper about modeling preferences from data. Recommended if you're interested in extracting human preferences from surveys and are good at parsing text.

This paper also cited a surprisingly good early AMA paper, Prolegomena to any future artificial moral agent - recommended if you want to time travel to 2000 AD.

AI4People—An Ethical Framework for a Good AI Society: Opportunities, Risks, Principles, and Recommendations, Luciano Floridi et al. (including Francesca Rossi), Minds and Machines, 2018.

Just a reminder that "normal" AI ethics papers still exist - this one seemed particularly well put together, and is influential in AI regulation in the EU. Recommended if you want to know more about AI regulation in the EU.

Algorithmic Decision-Making and the Control Problem, J. Zerilli, A. Knott, J. Maclaurin and C. Gavaghan (Leverhulme Centre for the Future of Intelligence), Minds and Machines, 2019.

I'm ironically including this "control problem" paper that's actually about the problem of humans operating complicated machines. I can only guess that this has got to be some sort of cry for help from Leverhulme (Are they doing okay?).

Two arguments against human-friendly AI, Ken Daley (Southern Methodist University), AI and Ethics, 2021.

We should develop AGI that cares equally for all sentient life, not just humans.

AI and moral thinking: how can we live well with machines to enhance our moral agency?, Paula Boddington (New College of the Humanities), AI and Ethics, 2021.

We should guard against erosion of our moral agency, even if there's AGI running around.

Representation, justification, and explanation in a value-driven agent: an argumentation-based approach, B. Liao, M. Anderson and S. L. Anderson. (Zhejiang University), AI and Ethics, 2021.

We can build moral and interpretable AI by generalizing ethical rules from examples, and giving justifications for actions in terms of deduction from ethical rules.

Solving the Black Box Problem: A Normative Framework for Explainable Artificial Intelligence, Carlos Zednik (OvGU Magdeburg), Philosophy & Technology, 2021.

Explainability depends on what different stakeholders want from an explanation. Provides a breakdown of some things people want, and some interpretability techniques people can use. Recommended if you don't agree with that first sentence, or want to read an informed but unusual perspective on interpreting neural nets.

Cited by a lot of other explainable AI papers, but a very cursory look didn't turn up any worth linking.

Conclusions

Let's do a headcount:

Minds and Machines: 6 (plus 1 impostor)
Ethics and Information Technology: 6
AI and Ethics: 3
Science and Engineering Ethics: 2 (in last post)
Philosophy & Technology: 1
Other: 2
AI & Society: 0
IEEE Transactions on Philosophy and Society: 0
I also skimmed Artificial Intelligence, Information, and AI Communications, and found nothing directly relevant.

This matches my impressions. If you want to publish something about value alignment in an ethics-y journal, you should shoot for Minds and Machines, then Ethics and Information Technology.

Is there a field here? No. Basically all the authors who weren't the usual suspects were writing baby's first AI alignment paper. But there are some papers that are starting to accrete a lineage of people interested in extending their ideas, and it's funny seeing people citing Soares and Fallenstein 2014 authoritatively. You can see the arc that will lead to a field being here in another 15 years, barring interruption.

Was anything good? A sizable fraction was worth reading. I hope you'll find some of these that you like as well.