Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Scaling Laws for Neural Language Models(Jared Kaplan, Sam McCandlish et al) (summarized by Nicholas): This paper empirically measures the effect of scaling model complexity, data, and computation on the cross entropy loss for neural language models. A few results that I would highlight are:
Performance depends strongly on scale, weakly on model shape: Loss depends more strongly on the number of parameters, the size of the dataset, and the amount of compute used for training than on architecture hyperparameters.
Smooth power laws: All three of these show power-law relationships that don’t flatten out even at the highest performance they reached.
Sample efficiency: Larger models are more efficient than small models in both compute and data. For maximum computation efficiency, it is better to train large models and stop before convergence.
There are lots of other interesting conclusions in the paper not included here; section 1.1 provides a very nice one page summary of these conclusions, which I'd recommend you read for more information.
Nicholas's opinion: This paper makes me very optimistic about improvements in language modelling; the consistency of the power law implies that language models can continue to improve just by increasing data, compute, and model size. However, I would be wary of generalizing these findings to make any claims about AGI, or even other narrow fields of AI. As they note in the paper, it would be interesting to see if similar results hold in other domains such as vision, audio processing, or RL.
A Constructive Prediction of the Generalization Error Across Scales(Jonathan S. Rosenfeld et al) (summarized by Rohin): This earlier paper also explicitly studies the relationship of test error to various inputs, on language models and image classification (the previous paper studied only language models). The conclusions agree with the previous paper quite well: it finds that smooth power laws are very good predictors for the influence of dataset size and model capacity. (It fixed the amount of compute, and so did not investigate whether there was a power law for compute, as the previous paper did.) Like the previous paper, it found that it basically doesn't matter whether the model size is increased by scaling the width or the depth of the network.
Data parallelism is a way of splitting data across multiple machines to increase training throughput. Instead of training a model sequentially on one dataset, the dataset is split and models are trained in parallel. Resulting gradients on every machine are combined centrally and then used for back propagation. Previously, data parallelism approaches were memory-constrained because the entire model still had to fit on each GPU, which becomes infeasible for billion to trillion-parameter models.
Instead of replicating each model on each machine, ZeRO partitions each model across machines and shares states, resulting in a per-machine memory reduction that is linear with the number of machines. (E.g., splitting across 64 GPUs yields a 64x memory reduction).
In addition to ZeRO, Microsoft is releasing DeepSpeed, a library which offers ZeRO as well as several other performance optimizations in an easy-to-use library for PyTorch, a popular open-source machine learning framework. They purport that their library allows for models that are 10x bigger, up to 5x faster to train, and up to 5x cheaper. They use DeepSpeed to train a 17-billion-parameter language model which exceeds state-of-the-art results in natural language processing.
Asya's opinion: I think this is a significant step in machine learning performance which may not be used heavily until average model sizes in general increase. The technique itself is pretty straightforward, which makes me think that as model sizes increase there may be a lot of similar "low-hanging fruit" that yield large performance gains.
Technical AI alignment
Learning human intent
Meta-Inverse Reinforcement Learning with Probabilistic Context Variables(Lantao Yu, Tianhe Yu et al) (summarized by Sudhanshu): This work explores improving performance on multi-task inverse reinforcement learning in a single-shot setting by extending Adversarial Inverse Reinforcement Learning (AN #17) with "latent context variables" that condition the learned reward function. The paper makes two notable contributions: 1) It details an algorithm to simultaneously learn a flexible reward function and a conditional policy with competitive few-shot generalization abilities from expert demonstrations of multiple related tasks without task specifications or identifiers; 2) The authors empirically demonstrate strong performance of a policy trained on the inferred reward of a structurally similar task with modified environmental dynamics, claiming that in order to succeed "the agent must correctly infer the underlying goal of the task instead of simply mimicking the demonstration".
Sudhanshu's opinion: Since this work "integrates ideas from context-based meta-learning, deep latent variable generative models, and maximum entropy inverse RL" and covers the relevant mathematics, it is an involved, if rewarding, study into multi-task IRL. I am convinced that this is a big step forward for IRL, but I'd be interested in seeing comparisons on setups that are more complicated.
'Data efficiency' is implied as a desirable quality, and the paper makes a case that they learn from a limited number demonstrations at meta-test time. However, it does not specify how many demonstrations were required for each task during meta-training. Additionally, for two environments, tens of millions of environment interactions were required, which is entirely infeasible for real systems.
Miscellaneous (Alignment)
The Incentives that Shape Behaviour(Ryan Carey, Eric Langlois et al) (summarized by Asya): This post and paper introduce a method for analyzing the safety properties of a system using a causal theory of incentives (past (AN #49) papers (AN #61)). An incentive is something an agent must do to best achieve its goals. A control incentive exists when an agent must control some component of its environment in order to maximize its utility, while a response incentive is present when the agent's decision must be causally responsive to some component of its environment. These incentives can be analyzed formally by drawing a causal influence diagram, which represents a decision problem as a graph where each variable depends on the values of its parents.
For example, consider the case where a recommender algorithm decides what posts to show to maximize clicks. In the causal influnce diagram representing this system, we can include that we have control over the node 'posts to show', which has a direct effect on the node we want to maximize, 'clicks'. However, 'posts to show' may also have a direct effect on the node 'influenced user opinions', which itself affects 'clicks'. In the system as it stands, in addition to there being a desirable control incentive on 'clicks', there is also an undesirable control incentive on 'influenced user opinions', since they themselves influence 'clicks'. To get rid of the undesirable incentive, we could reward the system for predicted clicks based on a model of the original user opinions, rather than for actual clicks.
Asya's opinion: I really like this formalization of incentives, which come up frequently in AI safety work. It seems like some people are already (AN #54) using (AN #71) this framework, and this seems low-cost enough that it's easy to imagine a world where this features in the safety analysis of algorithm designers.
Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Audio version here (may not be up yet).
Highlights
Scaling Laws for Neural Language Models (Jared Kaplan, Sam McCandlish et al) (summarized by Nicholas): This paper empirically measures the effect of scaling model complexity, data, and computation on the cross entropy loss for neural language models. A few results that I would highlight are:
Performance depends strongly on scale, weakly on model shape: Loss depends more strongly on the number of parameters, the size of the dataset, and the amount of compute used for training than on architecture hyperparameters.
Smooth power laws: All three of these show power-law relationships that don’t flatten out even at the highest performance they reached.
Sample efficiency: Larger models are more efficient than small models in both compute and data. For maximum computation efficiency, it is better to train large models and stop before convergence.
There are lots of other interesting conclusions in the paper not included here; section 1.1 provides a very nice one page summary of these conclusions, which I'd recommend you read for more information.
Nicholas's opinion: This paper makes me very optimistic about improvements in language modelling; the consistency of the power law implies that language models can continue to improve just by increasing data, compute, and model size. However, I would be wary of generalizing these findings to make any claims about AGI, or even other narrow fields of AI. As they note in the paper, it would be interesting to see if similar results hold in other domains such as vision, audio processing, or RL.
A Constructive Prediction of the Generalization Error Across Scales (Jonathan S. Rosenfeld et al) (summarized by Rohin): This earlier paper also explicitly studies the relationship of test error to various inputs, on language models and image classification (the previous paper studied only language models). The conclusions agree with the previous paper quite well: it finds that smooth power laws are very good predictors for the influence of dataset size and model capacity. (It fixed the amount of compute, and so did not investigate whether there was a power law for compute, as the previous paper did.) Like the previous paper, it found that it basically doesn't matter whether the model size is increased by scaling the width or the depth of the network.
ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters (Rangan Majumder et al) (summarized by Asya): This paper introduces ZeRO and DeepSpeed, system optimizations that enable training significantly larger models than we have before.
Data parallelism is a way of splitting data across multiple machines to increase training throughput. Instead of training a model sequentially on one dataset, the dataset is split and models are trained in parallel. Resulting gradients on every machine are combined centrally and then used for back propagation. Previously, data parallelism approaches were memory-constrained because the entire model still had to fit on each GPU, which becomes infeasible for billion to trillion-parameter models.
Instead of replicating each model on each machine, ZeRO partitions each model across machines and shares states, resulting in a per-machine memory reduction that is linear with the number of machines. (E.g., splitting across 64 GPUs yields a 64x memory reduction).
In addition to ZeRO, Microsoft is releasing DeepSpeed, a library which offers ZeRO as well as several other performance optimizations in an easy-to-use library for PyTorch, a popular open-source machine learning framework. They purport that their library allows for models that are 10x bigger, up to 5x faster to train, and up to 5x cheaper. They use DeepSpeed to train a 17-billion-parameter language model which exceeds state-of-the-art results in natural language processing.
Asya's opinion: I think this is a significant step in machine learning performance which may not be used heavily until average model sizes in general increase. The technique itself is pretty straightforward, which makes me think that as model sizes increase there may be a lot of similar "low-hanging fruit" that yield large performance gains.
Technical AI alignment
Learning human intent
Meta-Inverse Reinforcement Learning with Probabilistic Context Variables (Lantao Yu, Tianhe Yu et al) (summarized by Sudhanshu): This work explores improving performance on multi-task inverse reinforcement learning in a single-shot setting by extending Adversarial Inverse Reinforcement Learning (AN #17) with "latent context variables" that condition the learned reward function. The paper makes two notable contributions: 1) It details an algorithm to simultaneously learn a flexible reward function and a conditional policy with competitive few-shot generalization abilities from expert demonstrations of multiple related tasks without task specifications or identifiers; 2) The authors empirically demonstrate strong performance of a policy trained on the inferred reward of a structurally similar task with modified environmental dynamics, claiming that in order to succeed "the agent must correctly infer the underlying goal of the task instead of simply mimicking the demonstration".
Sudhanshu's opinion: Since this work "integrates ideas from context-based meta-learning, deep latent variable generative models, and maximum entropy inverse RL" and covers the relevant mathematics, it is an involved, if rewarding, study into multi-task IRL. I am convinced that this is a big step forward for IRL, but I'd be interested in seeing comparisons on setups that are more complicated.
'Data efficiency' is implied as a desirable quality, and the paper makes a case that they learn from a limited number demonstrations at meta-test time. However, it does not specify how many demonstrations were required for each task during meta-training. Additionally, for two environments, tens of millions of environment interactions were required, which is entirely infeasible for real systems.
Miscellaneous (Alignment)
The Incentives that Shape Behaviour (Ryan Carey, Eric Langlois et al) (summarized by Asya): This post and paper introduce a method for analyzing the safety properties of a system using a causal theory of incentives (past (AN #49) papers (AN #61)). An incentive is something an agent must do to best achieve its goals. A control incentive exists when an agent must control some component of its environment in order to maximize its utility, while a response incentive is present when the agent's decision must be causally responsive to some component of its environment. These incentives can be analyzed formally by drawing a causal influence diagram, which represents a decision problem as a graph where each variable depends on the values of its parents.
For example, consider the case where a recommender algorithm decides what posts to show to maximize clicks. In the causal influnce diagram representing this system, we can include that we have control over the node 'posts to show', which has a direct effect on the node we want to maximize, 'clicks'. However, 'posts to show' may also have a direct effect on the node 'influenced user opinions', which itself affects 'clicks'. In the system as it stands, in addition to there being a desirable control incentive on 'clicks', there is also an undesirable control incentive on 'influenced user opinions', since they themselves influence 'clicks'. To get rid of the undesirable incentive, we could reward the system for predicted clicks based on a model of the original user opinions, rather than for actual clicks.
Asya's opinion: I really like this formalization of incentives, which come up frequently in AI safety work. It seems like some people are already (AN #54) using (AN #71) this framework, and this seems low-cost enough that it's easy to imagine a world where this features in the safety analysis of algorithm designers.
Read more: Paper: The Incentives that Shape Behaviour