Highlights

OpenAI Five (Many people at OpenAI): OpenAI has trained a team of five neural networks to play a particular set of Dota heroes in a mirror match (playing against the same set of heroes) with a few restrictions, and have started to beat amateur human players. They are aiming to beat a team of top professionals at The International in August, with the same set of five heroes, but without any other restrictions. Salient points:

The method is remarkably simple -- it's a scaled up version of PPO with training data coming from self-play, with reward shaping and some heuristics for exploration, where each agent is implemented by an LSTM.
There's no human data apart from the reward shaping and exploration heuristics.
Contrary to most expectations, they didn't need anything fundamentally new in order to get long-term strategic planning. I was particularly surprised by this. Some interesting thoughts from OpenAI researchers in this thread -- in particular, assuming good exploration, the variance of the gradient should scale linearly with the duration, and so you might expect you only need linearly more samples to counteract this.

They used 256 dedicated GPUs and 128,000 preemptible CPUs. A Hacker News comment estimates the cost at $2500 per hour, which would put the likely total cost in the millions of dollars.
They simulate 900 years of Dota every day, which is a ratio of ~330,000:1, suggesting that each CPU is running Dota ~2.6x faster than real time. In reality, it's probably running many times faster than that, but preemptions, communication costs, synchronization etc. all lead to inefficiency.
There was no explicit communication mechanism between agents, but they all get to observe the full Dota 2 state (not pixels) that any of the agents could observe, so communication is not really necessary.
A version of the code with a serious bug was still able to train to beat humans. Not encouraging for safety.
Alex Irpan covers some of these points in more depth in Quick Opinions on OpenAI Five.
Gwern comments as well.

My opinion: I might be more excited by an approach that was able to learn from human games (which are plentiful), and perhaps finetune with RL, in order to develop an approach that could generalize to more tasks in the future, where human data is available but a simulator is not. (Given the ridiculous sample complexity, pure RL with PPO can only be used in tasks with a simulator.) On the other hand, an approach that leveraged human data would necessarily be at least somewhat specific to Dota. A dependence on human data is unlikely to get us to general intelligence, whereas this result suggests that we can solve tasks that have a simulator, exploration strategy, and a dense reward function, which really is pushing the boundary on generality. This seems to be gdb's take: "We are very encouraged by the algorithmic implication of this result — in fact, it mirrors closely the story of deep learning (existing algorithms at large scale solve otherwise unsolvable problems). If you have a very hard problem for which you have a simulator, our results imply there is a real, practical path towards solving it. This still needs to be proven out in real-world domains, but it will be very interesting to see the full ramifications of this finding."

Paul's research agenda FAQ (zhukeepa): Exactly what it sounds like. I'm not going to summarize it because it's long and covers a lot of stuff, but I do recommend it.

Technical AI alignment

Technical agendas and prioritization

Conceptual issues in AI safety: the paradigmatic gap (Jon Gauthier): Lots of current work on AI safety focuses on what we can call "mid-term safety" -- the safety of AI systems that are more powerful and more broadly deployed than the ones we have today, but work using relatively similar techniques as the ones we use today. However, it seems plausible that there will be a paradigm shift in how we build AI systems, and if so it's likely that we will have a new, completely different set of mid-term concerns, rendering the previous mid-term work useless. For example, at the end of the 19th century, horse excrement was a huge public health hazard, and "mid-term safety" would likely have been about how to remove the excrement. Instead, the automobile was developed and started replacing horses, leading to new set of mid-term concerns (eg. pollution, traffic accidents), and any previous work on removing horse excrement became near-useless.

My opinion: I focus almost exclusively on mid-term safety (while thinking about long-term safety), not because I disagree with this argument, but in spite of it. I think there is a good chance that any work I do will be useless for aligning superintelligent AI because of a paradigm shift, but I do it anyway because it seems very important on short timelines, which are easier to affect; and I don't know of other approaches to take that would have a significantly higher probability of being useful for aligning superintelligent AI.

Optimization Amplifies (Scott Garrabrant): One model of the difference between mathematicians and scientists is that a scientist is good at distinguishing between 0.01%, 50% and 99.99%, whereas a mathematician is good at distinguishing between 99.99% and 100%. Certainly it seems like if we can get 99.99% confidence that an AI system is aligned, we should count that as a huge win, and not hope for more (since the remaining 0.01% is extremely hard to get), so why do we need mathematicians? Scott argues that optimization is particularly special, in that the point of very strong optimization is to hit a very narrow target, which severely affects extreme probabilities, moving them from 0.01% to near-100%. For example, if you draw a million samples from a normal distribution and optimize for the largest one, it is almost certain to be 4 standard deviations above the mean (which is incredibly unlikely for a randomly chosen sample). In this sort of setting, the deep understanding of a problem that you get from a mathematician is still important. Note that Scott is not saying that we don't need scientists, nor that we should aim for 100% certainty that an AI is aligned.

My opinion: I think I agree with this post? Certainly for a superintelligence that is vastly smarter than humans, I buy this argument (and in general am not optimistic about solving alignment). However, humans seem to be fairly good at keeping each other in check, without a deep understanding of what makes humans tick, even though humans often do optimize against each other. Perhaps we can maintain this situation inductively as our AI systems get more powerful, without requiring a deep understanding of what's going on? Overall I'm pretty confused on this point.

Another take on agent foundations: formalizing zero-shot reasoning (zhukeepa): There are strong incentives to build a recursively self-improving AI, and in order to do this without value drift, the AI needs to be able to reason effectively about the nature of changes it makes to itself. In such scenarios, it is insufficient to "reason with extreme caution", where you think really hard about the proposed change, and implement it if you can't find reasons not to do it. Instead, you need to do something like "zero-shot reasoning", where you prove under some reasonable assumptions that the proposed change is good. This sort of reasoning must be very powerful, enabling the AI to eg. build a spacecraft that lands on Mars, after observing Earth for one day. This motivates many of the problems in MIRI's agenda, such as Vingean reflection (self-trust), logical uncertainty (how to handle being a bounded reasoner), counterfactuals, etc., which all help to formalize zero-shot reasoning.

My opinion: This assumes an ontology where there exists a utility function that an AI is optimizing, and changes to the AI seem especially likely to change the utility function in a random direction. In such a scenario, yes, you probably should be worried. However, in practice, I expect that powerful AI systems will not look like they are explicitly maximizing some utility function. If you change some component of the system for the worse, you are likely to degrade its performance, but not likely to drastically change its behavior to cause human extinction. For example, even in RL (which is the closest thing to expected utility maximization), you can have serious bugs and still do relatively well on the objective. A public example of this is in OpenAI Five (https://blog.openai.com/openai-five/), but I also hear this expressed when talking to RL researchers (and see this myself). While you still want to be very careful with self-modification, it seems generally fine not to have a formal proof before making the change, and evaluating the change after it has taken place. (This would fail dramatically if the change drastically changed behavior, but if it only degrades performance, I expect the AI would still be competent enough to notice and undo the change.) It may be the case that adversarial subprocesses could take advantage of these sorts of bugs, but I expect that we need adversarial-subprocess-specific research to address this, not zero-shot reasoning.

The Learning-Theoretic AI Alignment Research Agenda (Vadim Kosoy): This agenda aims to create a general abstract theory of intelligence (in a manner similar to AIXI, but with some deficiencies removed). In particular, once we use the framework of reinforcement learning, regret bounds are a particular way of provably quantifying an agent's intelligence (though there may be other ways as well). Once we have this theory, we can ground all other AI alignment problems within it. Specifically, alignment would be formalized as a value learning protocol that achieves some regret bound. With this formalization, we can solve hard metaphilosophy problems such as "What is imperfect rationality?" through the intuitions gained from looking at the problem through the lens of value learning protocols and universal reinforcement learning.

My opinion: This agenda, like others, is motivated by the scenario where we need to get alignment right the first time, without empirical feedback loops, both because we might be facing one-shot success or failure, and because the stakes are so high that we should aim for high reliability subject to time constraints. I put low probability on the first reason (alignment being one-shot), and it seems much less tractable, so I mostly ignore those scenarios. I agree with the second reason, but aiming for this level of rigor seems like it will take much longer than the time we actually have. Given this high level disagreement, it's hard for me to evaluate the research agenda itself.

Iterated distillation and amplification

Paul's research agenda FAQ (zhukeepa): Summarized in the highlights!

Agent foundations

Forecasting using incomplete models (Vadim Kosoy)

Logical uncertainty and Mathematical uncertainty (Alex Mennen)

Learning human intent

Policy Approval (Abram Demski): Argues that even if we had the true human utility function (assuming it exists), an AI that optimizes it would still not be aligned. It also sketches out an idea for learning policies instead of utility functions that gets around these issues.

My opinion: I disagree with the post but most likely I don't understand it. My strawman of the post is that it is arguing for imitation learning instead of inverse reinforcement learning (which differ when the AI and human know different things), which seems wrong to me.

Human-Interactive Subgoal Supervision for Efficient Inverse Reinforcement Learning (Xinlei Pan et al)

Multi-agent Inverse Reinforcement Learning for General-sum Stochastic Games (Xiaomin Lin et al)

Adversarial Exploration Strategy for Self-Supervised Imitation Learning (Zhang-Wei Hong et al)

Preventing bad behavior

Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes (Shun Zhang et al): As we saw in Alignment Newsletter #11, one approach to avoiding side effects is to create a whitelist of effects that are allowed. In this paper, the agent learns both a whitelist of allowed effects, and a blacklist of disallowed effects. They assume that the MDP in which the agent is acting has been factored into a set of features that can take on different values, and then separate the features as locked (unchangeable), free (changeable), or unknown. If there are no unknown features, then we can calculate the optimal policy using variants of standard techniques (for example, by changing the transition function to remove transitions that would change locked features, and then running any off-the-shelf MDP solver). However, this would require the operator to label all features as locked or unlocked, which would be very tedious. To solve this, they allow the agent to query the operator whether a certain feature is locked or unlocked, and provide algorithms that reduce the number of queries that the agent needs to make in order to find an optimal safe policy.

My opinion: This seems like a good first step towards whitelisting -- there's still a lot of hardcoded knowledge from a human (which features to pay attention to, the transition function) and restrictions (the number of relevant features needs to be small), but it takes a problem and provides a solution that works in that setting. In the recent whitelisting approach, I was worried that the whitelist simply wouldn't include enough transitions for the agent to be able to do anything useful. Since this approach actively queries the operator until it finds a safe policy, that is no longer an issue. However, the corresponding worry would be that it takes prohibitively many queries before the agent can do anything useful. (Their empirical evaluation is on toy gridworlds, so this problem did not come up.) Another worry previously was that whitelisting causes an agent to be "clingy", that is, it wants to prevent all changes to non-whitelisted features, even if they are caused by physical laws, or other humans. A similar problem could arise here when this is generalized to dynamic and/or multiagent environments.

Handling groups of agents

Learning Social Conventions in Markov Games (Adam Lerer and Alexander Peysakhovich)

Interpretability

Open the Black Box Data-Driven Explanation of Black Box Decision Systems (Dino Pedreschi et al)

Interpretable Discovery in Large Image Data Sets (Kiri L. Wagstaff et al)

Near-term concerns

Adversarial examples

On Adversarial Examples for Character-Level Neural Machine Translation (Javid Ebrahimi et al)

AI capabilities

Reinforcement learning

OpenAI Five (Many people at OpenAI): Summarized in the highlights!

Retro Contest: Results (John Schulman et al): OpenAI has announced the results of the Retro Contest. The winning submissions were modified versions of existing algorithms like joint PPO and Rainbow, without any Sonic-specific parts.

A Tour of Reinforcement Learning: The View from Continuous Control (Benjamin Recht)

Evolving simple programs for playing Atari games (Dennis G Wilson et al)

Accuracy-based Curriculum Learning in Deep Reinforcement Learning (Pierre Fournier et al)