The Alignment Newsletter #9: 06/04/18

Rohin Shah

Alignment Newsletter

1 The Alignment Newsletter #9: 06/04/18

by Rohin Shah

4th Jun 2018

2 min read

0

1

Highlights

Playing hard exploration games by watching YouTube (Yusuf Aytar, Tobias Pfaff et al): There are many YouTube videos demonstrating how to play levels of eg. Montezuma's Revenge. Can we use these demonstrations to solve the hard exploration tasks in Atari? One challenge is that the videos have slightly different visual properties (like color and resolution). They propose to learn a shared feature space by using an auxiliary loss where the network must predict the number of timesteps between two frames of a video, or to predict the delay between a video and audio clip from the same trajectory. Using this shared feature space, they can define a reward function that encourages the agent to take trajectories whose features match those of the demonstrations. In experiments they exceed human performance on Atari games with hard exploration problems.

My opinion: It seems to me that this is how we'll have to solve exploration in practice if we don't want to have a huge sample complexity, though I know other researchers are optimistic about solving exploration using curiosity or diversity. It's pretty exciting that they could use a source of data that was already present in the real world.

Technical AI alignment

Problems

The simple picture on AI safety (alexflint): Argues that we should distill the problem of AI safety into a simple core. The author proposes it be distilled into two simple (but not easy) problems -- the technical engineering problem of how to build a safe superintelligence, and the coordination problem of how to prevent an unaligned superintelligence from being built first.

Iterated distillation and amplification

Amplification Discussion Notes (William_S)

Learning human intent

Learning Safe Policies with Expert Guidance (Jessie Huang et al): Expert demonstrations can be consistent with many possible reward functions. Instead of simply trying to mimic the demonstration, the authors consider all possible rewards that are consistent with the demonstration, and then maximize the worst reward, leading to safe behavior.

My opinion: This is very related to Inverse Reward Design, where instead of maxmin planning we use risk-averse planning, and instead of considering all rewards compatible with an expert demonstration we consider all reward functions that are probable based on which reward function the designer wrote down.

Handling groups of agents

Scalable Centralized Deep Multi-Agent Reinforcement Learning via Policy Gradients (Arbaaz Khan et al)

Verification

Training verified learners with learned verifiers (Krishnamurthy (Dj) Dvijotham, Sven Gowal, Robert Stanforth et al)

Miscellaneous (Alignment)

How To Solve Moral Conundrums with Computability Theory (Jongmin Jerome Baek)

AI strategy and policy

How a Pentagon Contract Became an Identity Crisis for Google (Scott Shane et al): After Google accepted a share of the contract for the Maven program run by the Defense Department, Google has been internally fractured, with many employees strongly opposing the use of AI for military applications.

My opinion: Stories like this make me optimistic that we can actually coordinate AI researchers to take appropriate safety precautions when developing advanced AI systems, even if the economic incentives point in the other direction (and I'm not sure they do).