Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

HIGHLIGHTS

LCA: Loss Change Allocation for Neural Network Training (Janice Lan et al) (summarized by Robert): This paper introduces the Loss Change Allocation (LCA) method. The method's purpose is to gain insight and understanding into the training process of deep neural networks. The method calculates an allocation of the change in overall loss (on the whole training set) between every parameter at each training iteration, which is iteratively refined until the approximation error is less than 1% overall. This loss change allocation can be either positive or negative; if it's negative, then the parameter is said to have helped training at that iteration, and if it's positive then the parameter hurt training. Given this measurement is per-parameter and per-iteration, it can be aggregated to per-layer LCA, or any other summation over parameters and training iterations.

The authors use the method to gain a number of insights into the training process of several small neural networks (trained on MNIST and CIFAR-10).

First, they validate that learning is very noisy, with on average only half of the parameters helping at each iteration. The distribution is heavier-tailed than a normal distribution, and is fairly symmetrical. However, parameters tend to alternate between helping and hurting, and each parameter only tends to help approximately 50% of the time.

Second, they look at the LCA aggregated per-layer, summed over the entire training process, and show that in the CIFAR ResNet model the first and last layers hurt overall (i.e. have positive LCA). In an attempt to remedy this and understand the causes, the authors try freezing these layers, or reducing their learning rate. The first layer can't be fixed (freezing makes it's LCA 0, but later layers' LCA is increased in turn so the overall final loss stays the same). However, for the last layer, freezing or reducing the learning rate increases the overall performance of the network, as the last layer's LCA is decreased more than all the other layer's LCAs are increased. They also hypothesize that by reducing the momentum for the last layer, they can give it fresher information and make it more likely to learn. They find that this does work, though in this setting previous layers’ LCA increases to compensate, leaving overall performance unchanged.

Finally, the authors show that learning seems to be synchronised across layers; layers get local LCA minima at the same training iterations, in a statistically significant way. They show this must be a combination of parameter motion and the gradient, as neither on their own explains this phenomenon.

Robert's opinion: I really liked this paper. The method is simple (although computationally expensive), and gives novel insights. I think understanding how deep learning training works is important as it can help us design better training processes, not just for better performance but for other properties we want the training process to induce. I think there's a lot of future work which could be done with this method, in making it more efficient and then applying it to larger models in domains other than vision. I'd also be interested in seeing if this can be used to understand which parts of the training set help and hurt training; for example seeing whether there's any correlation between the points of synchronised learning and the datapoints in the minibatch at that training iteration. Note: I'd recommend reading the paper (including the appendices) to see the graphs and visualisations the authors produced to demonstrate their arguments, as they're much easier to understand than a textual description.

Rohin's opinion: I also really enjoyed this paper, it has great empirical evidence about how neural networks work. I'd be inclined to analyze the results somewhat differently. In particular, suppose that when calculating LCA, we made the following changes:

1. We used the loss on the training batches instead of the full training set.

2. We didn't improve the approximation error (i.e. we just used the point estimate of the gradient calculated during training).

3. We trained using stochastic gradient descent (SGD) (as opposed to say Adam or Momentum-SGD).

Then all LCA values would be negative (explanation in this comment). So, when the paper shows experiments where LCA values are positive (i.e. the parameters / layers are anti-learning), we can attribute those effects to some combination of these three factors.

Take the observation that learning is very noisy. I would guess that this is primarily because of the first point: there are many many ways to improve the loss on a tiny little minibatch, but only a tiny fraction of those are capturing "real effects" that would improve the loss on the full large training dataset. Likely in the large majority of cases, the update doesn't capture a "real effect", and so it's a coin flip whether or not it will help with the loss on the full training dataset. A large probability of a coin flip + a small probability of a "real effect" gets you to an improvement slightly over half the time. This explanation applies across parameters, iterations, layers, etc.

Similarly, they find that learning is synchronized across layers. I think this is also primarily because of the first point. My guess is that there are some batches of data that are more "canonical" than others, that are easiest to learn from. In the case where we see synchronization for each class, this could be as simple as that particular training batch having more examples of that class than other training batches.

I’d be interested in seeing experiments in which we start with the version of LCA where everything is negative, and made only one of the changes. This would allow us to narrow down which particular change causes a given effect, kind of like an ablation study.

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

How does iterated amplification exceed human abilities? (Issa Rice)

LEARNING HUMAN INTENT

Shared Autonomy via Hindsight Optimization (Shervin Javdani et al) (summarized by Rohin): This paper considers a shared autonomy task in which a user controls a robot to achieve some goal, and the robot learns to assist the user, without knowing the goal in advance. They formalize this as a POMDP in which the state includes the user's goal, which the robot does not get to observe. However, the POMDP observation model assigns higher probability to user actions that better achieve the goal (a standard Boltzmann rationality model), and this allows the agent to reason about what the goal must be. In practice, for computational tractability, rather than choosing optimal actions in the overall POMDP, the robot chooses optimal actions using a technique called hindsight optimization, which assumes that the robot will never learn more information about the user's goal.

Rohin's opinion: The formulation of a POMDP with uncertainty over the goal is remarkably similar to the formulation of Cooperative Inverse Reinforcement Learning (AN #69) (and predates it), with the main difference being that there is only one actor (the robot hardware).

Imitation Learning via Off-Policy Distribution Matching (Ilya Kostrikov et al) (summarized by Zach): One way to view imitation learning is as a distribution matching problem. In other words, the agent is rewarded based on how well it can imitate the state-distribution induced by the expert. In recent years, distribution matching via adversarial methods such as GAIL has become a popular approach to imitation learning. However, one weakness of these methods is that they require on-policy samples which means they require the agent to interact with the environment. In this paper, the authors present an off-policy method for distribution matching which can work without environment interaction. They do this by building on the prior work of DualDICE, a policy-agnostic method to estimate distribution ratios between agent and expert which can then be used to provide a reward to the agent. This allows the optimal policy to be estimated directly from demonstrations without any need for agent interaction. The authors run a few experiments and show that the method has comparable performance to behavioral cloning in the off-policy setting and adversarial methods in the on-policy setting.

Prerequisities: DualDICE

Read more: GAIL

Zach's opinion: This is a cool application of density-estimation via DualDICE. While the experiments are a bit weak, the fact that an off-policy method exists to do distribution-matching is interesting in its own right. Moreover, the method seems able to compete with both BC and GAIL-like methods which is intriguing.

VERIFICATION

Ethical Mission Definition and Execution for Maritime Robots Under Human Supervision (Don Brutzman et al) (summarized by Rohin) (H/T Jon Rodriguez): While underwater robots can perform missions that humans cannot, they cannot be held liable for their actions. Our society requires that someone be responsible for (and can be held liable for) the actions of any such robot, leading to a form of the specification problem: how do we program robots such that it is reasonable to hold their operators accountable for their actions?

This paper divides mission execution into three main parts: the execution level (hardware control), the tactical level (low-level behaviors), and the strategic level (what the robot should do). It proposes that, at the strategic level, we use formal methods to specify what the robot should do. The language should be expressive enough to be useful, while still keeping it sufficiently limited to allow exhaustive testing. They propose using state machines augmented with constraints. The constraints can be used to specify things like "the robot must stay at least 10m away from obstacles". The state machine decides which behaviors to execute, and each such behavior can have three results: success, failure, or exception (in the case that a constraint would have been violated had the behavior continued operating).

Rohin's opinion: It's interesting to see other groups also aiming to have what are essentially robustness guarantees, but motivated instead from the perspective of responsibility and liability. The actual method seems reasonable for the impoverished systems we have today, where we must specify everything that we want the system to do.

FORECASTING

FLI Podcast: On Superforecasting (Lucas Perry and Robert de Neufville)

MISCELLANEOUS (ALIGNMENT)

Formal Metaethics and Metasemantics for AI Alignment (June Ku) (summarized by Rohin): This website presents in great detail a process by which an agent might use data from human brains in order to infer a utility function for a single human (also spelling out what assumptions need to be made along the way), and then how it could combine the utility functions from different humans to arrive at "a fully technical ethical goal function". Emphasis is placed on solving the philosophical problems of metaethics and mental content. Quoting the website, they "suppose that unlimited computation and a complete low-level causal model of the world and the adult human brains in it are available".

Approaches to Deploying a Safe Artificial Moral Agent (Olivier Couttolenc) (summarized by Rohin): This post investigates which of the current moral theories would most reduce existential risk if we programmed it into an AI system, and settles on Aristotelian virtue ethics (over utilitarianism and Kant's categorical imperative).

NEAR-TERM CONCERNS

FAIRNESS AND BIAS

Algorithmic Fairness from a Non-ideal Perspective (Sina Fazelpour et al) (summarized by Rohin): The field of fairness has aimed to develop objective metrics of fairness, which can then be optimized for in order to produce a just AI system. Unfortunately, many intuitively desirable fairness metrics are fundamentally incompatible, and cannot be simultaneously achieved except in special circumstances. Should we lose all hope for fairness?

This paper argues that the problem was that we were building idealized theories, referring to a conception from political philosophy of ideal and non-ideal modes of theorizing. An ideal theory is one that describes an optimal, ideal world, and then identifies injustices by searching for discrepancies between the real world and the idealized one. This leads to three major flaws:

1. It can lead to systematic neglect of some injustices and distortions of our understanding of other injustices. For example, group parity metrics of fairness applied to college admissions would identify east Asian students as privileged relative to white students despite historical and institutional discrimination.

2. It does not offer sufficient practical guidance about what should be done, sometimes leading to misguided mitigation strategies. Consider college admissions again. A disparate learning process aims to be blind to protected characteristics (like gender) while still achieving demographic parity. This forces the model to penalize features that correlate with being male. As a result, we end up rewarding women who go into female-dominated fields, and penalize women who go into male-dominated fields! This was presumably not what we wanted.

3. It does not make clear who among decision-makers is responsible for intervening to correct specific injustices.

The authors suggest that the research community move towards a non-ideal mode of theorizing, in which there is more emphasis on having a deep empirical understanding of the problem (including the various causal factors, rather than summary statistics), and using empirically-informed choices of treatments, rather than modifying ML algorithms to optimize a mathematically defined metric.

Rohin's opinion: I really enjoyed this paper, and my summary doesn't do it justice -- it makes several other good points. I feel similarly about alignment: I feel relatively pessimistic about formal definitions of concepts like goal-directedness (AN #35) or safe exploration (AN #76), and feel much better about schemes that don't assume a formal definition of concepts and instead learn them from humans (or don't require them at all).

Another thing that jumped out at me was that their description of the non-ideal mode of theorizing focuses a lot on understanding what exactly is going on, which is very similar to the concepts of interpretability and universality (AN #81) in alignment.

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

The Ingredients of Real World Robotic Reinforcement Learning (Henry Zhu, Justin Yu, Abhishek Gupta et al) (summarized by Rohin): Suppose we wanted to train a robot to perform a task in the real world, and we didn't want to deal with the headache of sim-to-real transfer. Typically, since all of our experience must be collected in the real world, we would need a human to reset the robot to its initial state. The key idea of this paper is that the point of resets is to ensure that the robot explores a diversity of states causing it to learn a robust policy; this can be achieved by learning a perturbation policy whose objective is to take the robot to states it hasn't visited before. They then combine this with representation learning (so that they can learn from pixels) and use a classifier that distinguishes goal states from non-goal states as the reward function, to get a fully automated setup where once you start the robot's training, it trains itself without any human in the loop.

Read more: Paper: The Ingredients of Real World Robotic Reinforcement Learning

Rohin's opinion: This is a cool proof of concept, but the learned perturbation policy can only take you so far -- no learned perturbation policy is going to allow you to e.g. pick up an object after it is dropped, as you would want if you're training a robot to manipulate a Rubik's cube (AN #70). It seems hard to overcome this sort of problem in a fully automated and learned way (though perhaps you could use more classical techniques to have a "hardcoded" but still automated reset policy).

NEWS

CLR Open Positions: Researchers and Summer Research Fellows (summarized by Rohin): The Center on Long-Term Risk is looking for researchers and summer research fellows to work on high-quality research relevant to s-risks, including on (among other areas) multiagent systems. The application deadline is May 13.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 6:16 PM
I’d be interested in seeing experiments in which we start with the version of LCA where everything is negative, and made only one of the changes. This would allow us to narrow down which particular change causes a given effect, kind of like an ablation study.

Fwiw, I think this is one of the easier papers to replicate in deep learning, and so would make a great starter project for someone trying to get into deep learning and/or AI safety. I also think the resulting analysis could be publishable at a top ML conference.

You might worry about whether this differentially advantages safety or capabilities. My view is that improved understanding of deep learning is positive for the world (see here), I also think that enough people who have thought about the problem agree with me that you shouldn't worry about the unilateralist's curse. But there are people who would argue for the opposite position too.

Does anyone have a review of Jane Ku's "Metaethica.AI"? Nate and Jessica get acknowledgements - maybe you have a gloss? I'm having a little trouble figuring out what's going on. From giving it an hour or so, it seems like it's using functional isomorphism to declare what pre-found 'brains' in a pre-found model of the world are optimizing, and then sort of vaguely constructing a utility function over external referents found by more functional isomorphism (Ramsey-Lewis method).

Am I right that it doesn't talk about how to get the models it uses? That it uses functional isomorphism relatively directly, with few (I saw something about mean squared error in the pseudocode, but couldn't really decipher it) nods to how humans might have models that aren't functionally isomorphic to the real world, and the most-isomoprhic thing out there might not be what humans want to refer to?

(I suspect you weren't asking me, but just in case you were, I don't know the answers to these questions; they're pretty far outside of my expertise.)