[AN #116]: How to make explanations of neurons compositional

Rohin Shah

HIGHLIGHTS

Compositional Explanations of Neurons (Jesse Mu et al) (summarized by Robert): Network dissection is an interpretability technique introduced in 2017, which uses a dataset of images with dense (i.e. pixel) labels of concepts, objects and textures. The method measures the areas of high activation of specific channels in a convolutional neural network, then compares these areas with the labelled areas in the dataset. If there's a high similarity for a particular channel (measured by the intersection divided by the union of the two areas), then we can say this channel is recognising or responding to this human-interpretable concept.

This paper introduces an extension of this idea, where instead of just using the basic concepts (and matching areas in the dataset), they search through logical combinations of concepts (respectively areas) to try and find a compositional concept which matches the channel's activations. For example, a channel might respond to (water OR river) AND NOT blue. This is still a concept humans can understand (bodies of water which aren't blue), but enables us to explain the behaviour of a larger number of neurons than in the original network dissection method. Their work also extends the method to natural language inference (NLI), and they interpret neurons in the penultimate layer of a BiLSTM-based network trained to know whether a sentence entails, contradicts, or is neutral with respect to another. Here they create their own features based on words, lexical similarity between the two sentences, and part-of-speech tags.

Using their method, they find that channels in image classifiers do learn compositional concepts that seem useful. Some of these concepts are semantically coherent (i.e. the example above), and some seem to have multiple unrelated concepts entangled together (i.e. operating room OR castle OR bathroom). In the NLI network, they see that many neurons seem to learn shallow heuristics based on bias in the dataset - i.e. the appearance of single words (like nobody) which are highly informative about the classification.

Finally, they use their method to create copy-paste adversarial examples (like in Activation Atlas (AN #49)). In the Places365 dataset (where the goal is to classify places), they can crudely add images which appear in compositional concepts aligned with highly contributing neurons, to make that neuron fire more, and hence change the classification. Some of these examples generalise across classifier architectures, implying a bias present in the dataset.

Robert's opinion: I think work which targets specific neurons and what they're doing is interesting as it can give us a very low-level understanding of the model, which I feel is necessary to achieve the level of understanding required by alignment solutions which use interpretability (i.e. those in An overview of 11 proposals for building safe advanced AI (AN #102)). The main limitation of this approach is that it currently requires a large amount of dense human labelling of the datasets, and if a concept isn't in the labels of the dataset, then the method won't be able to explain a neuron using this concept. Also, the fact that their interpretability method is able to give insights (in the form of creating copy-paste examples) is a useful sign it's actually doing something meaningful, which I think some other interpretability methods lack.

TECHNICAL AI ALIGNMENT

LEARNING HUMAN INTENT

Learning to Summarize with Human Feedback (Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler et al) (summarized by Rohin): OpenAI has been working on finetuning language models from human preferences (AN #67). This blog post and paper show the progress they have made on text summarization in particular since their last release.

As a reminder, the basic setup is similar to that of Deep RL from Human Preferences: we get candidate summaries by executing the policy, have humans compare which of two summaries is better, and use this feedback to train a reward model that can then be used to improve the policy. The main differences in this paper are:

1. They put in a lot of effort to ensure high data quality. Rather than having MTurk workers compare between summaries, they hire a few contractors who are paid a flat hourly rate, and they put a lot of effort into communicating what they care about to ensure high agreement between labelers and researchers.

2. Rather than collecting preferences in an online training setup, they collect large batches at a time, and run a relatively small number of iterations of alternating between training the reward model and training the policy. My understanding is that this primarily makes it simpler from a practical perspective, e.g. you can look at the large batch of data you collected from humans and analyze it as a unit.

3. They initialize the policy from a model that is first pretrained in an unsupervised manner (as in GPT-3 (AN #102)) and then finetuned on the reference summaries using supervised learning.

On the Reddit task they train on, their summaries are preferred over the reference summaries (though since the reference summaries have varying quality, this does not imply that their model is superhuman). They also transfer the policy to summarize CNN / DailyMail news articles and find that it still outperforms the supervised model, despite not being trained at all for this setting (except inasmuch as the unsupervised pretraining step saw CNN / DailyMail articles).

An important ingredient to this success is that they ensure their policy doesn’t overoptimize the reward, by adding a term to the reward function that penalizes deviation from the supervised learning baseline. They show that if they put a very low weight on this term, the model overfits to the reward model and starts producing bad outputs.

Rohin's opinion: This paper is a great look at what reward learning would look like at scale. The most salient takeaways for me were that data quality becomes very important and having very large models does not mean that the reward can now be optimized arbitrarily.

FORECASTING

Does Economic History Point Toward a Singularity? (Ben Garfinkel) (summarized by Rohin): One important question for the long-term future is whether we can expect accelerating growth in the near future (see e.g. this recent report (AN #105)). For AI alignment in particular, the answer to this question could have a significant impact on AI timelines: if some arguments suggested that it would be very unlikely for us to have accelerating growth soon, we should probably be more skeptical that we will develop transformative AI soon.

So far, the case for accelerating growth relies on one main argument that the author calls the Hyperbolic Growth Hypothesis (HGH). This hypothesis posits that the growth rate rises in tandem with the population size (intuitively, a higher population means more ideas for technological progress which means higher growth rates). This document explores the empirical support for this hypothesis.

I’ll skip the messy empirical details and jump straight to the conclusion: while the author agrees that growth rates have been increasing in the modern era (roughly, the Industrial Revolution and everything after), he does not see much support for the HGH prior to the modern era. The data seems very noisy and hard to interpret, and even when using this noisy data it seems that models with constant growth rates fit the pre-modern era better than hyperbolic models. Thus, we should be uncertain between the HGH and the hypothesis that the industrial revolution triggered a one-off transition to increasing growth rates that have now stabilized.

Rohin's opinion: I’m glad to know that the empirical support for the HGH seems mostly limited to the modern era, and may be weakly disconfirmed by data from the pre-modern era. I’m not entirely sure how I should update -- it seems that both hypotheses would be consistent with future accelerating growth, though HGH predicts it more strongly. It also seems plausible to me that we should still assign more credence to HGH because of its theoretical support and relative simplicity -- it doesn’t seem like there is strong evidence suggesting that HGH is false, just that the empirical evidence for it is weaker than we might have thought. See also Paul Christiano’s response.

NEAR-TERM CONCERNS

MACHINE ETHICS

Reinforcement Learning Under Moral Uncertainty (Adrien Ecoffet et al) (summarized by Rohin): Given that we don’t have a perfect ethical theory ready to load into an AI system, and we don’t seem poised to get one any time soon, it seems worth looking into approaches that can deal with moral uncertainty. Drawing on the literature on moral uncertainty in philosophy, the authors consider several methods by which multiple moral theories can be aggregated, such as averaging over the theories, making decisions through a voting system, and having the theories compete to control the agent’s overall actions. They implement several of these in RL agents, and test them on simple gridworld versions of various trolley problems. They find that all of the methods have advantages and disadvantages.

Rohin's opinion: The central challenge here is that normalizing different moral theories so that they are comparable is difficult (AN #60) (see Section 2.3). This issue plagues even computationally intractable idealizations like assistance games (AN #69) that can perform full Bayesian updating on different moral theories. I’d love to see better theoretical solutions for this challenge.

OTHER PROGRESS IN AI

DEEP LEARNING

Deploying Lifelong Open-Domain Dialogue Learning (Kurt Shuster, Jack Urbanek et al) (summarized by Rohin): Most research in natural language processing (NLP) follows a paradigm in which we first collect a dataset via crowdsourced workers, and then we train a model on this dataset to solve some task. Could we instead have lifelong learning, in which a model could continue learning after being deployed, getting better and better the more it is used? This paper shows one instantiation of such an approach, in a fantasy role-playing game.

The authors take the previously developed LIGHT role-playing setting, and gamify it. The human player talks to a language model while playing some role, and earns stars and badges for saying realistic things (as evaluated by another language model). Rather than paying crowdsourced workers to provide data, the authors instead merely advertise their game, which people then play for fun, reducing the cost of data acquisition. They find that in addition to reducing costs, this results in a more diverse dataset, and also leads to faster improvements in automated metrics.

Rohin's opinion: Ultimately we’re going to want AI systems that learn and improve over time, even during deployment. It’s exciting to see an example of what that might look like.

UNSUPERVISED LEARNING

Understanding View Selection for Contrastive Learning (Yonglong Tian et al) (summarized by Flo): Contrastive multiview learning (AN #92) is a self-supervised approach to pretraining classifiers in which different views of data points are created and an encoder is trained to minimize the distance between encodings of views corresponding to data points with the same label while maximizing the distance between encodings of views with different labels.

The efficacy of this approach depends on the choice of views as well as the downstream task the neural network is going to be trained for. To find the most promising views, the authors propose the Infomin principle: all views should keep task-relevant information while the mutual information between views is minimized. The principle is supported by various observations: Firstly, earlier approaches to contrastive learning in the image domain that use data augmentation to preserve object identity while creating diverse views can be seen as an implicit application of the Infomin principle. Secondly, varying the mutual information between views (for example by changing the distance between two cropped views of the same image) creates an inverted U-curve for downstream performance corresponding to poor performance if there is too much or too little mutual information between the views. Lastly, the authors also find an inverted U-curve in performance for different colour spaces when using channels as views and the Lab colour space which was built to mimic human colour perception is close to the optimum, meaning that human colour perception might be near-optimal for self-supervised representation learning.

The authors then use the Infomin principle to select image augmentations for contrastive pretraining and improve the state of the art in linear readout on ImageNet from 69.3% to 73% for Top-1 accuracy and from 89% to 91.1% for Top-5 accuracy.

Flo's opinion: While the Infomin principle seems powerful and their results look impressive, I am not really convinced that the principle actually played an important role in finding the image augmentations they ended up using, as there is little description of how that happened and the augmentations rather look like the result of combining previously used approaches and doing some hyperparameter optimization.

HIERARCHICAL RL

Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions (Michael Chang et al) (summarized by Zach): Increasing the scalability of learning systems is a central challenge to machine learning. One framework is to organize RL agents as ‘super’ agents, large collections of simpler agents that each make decisions according to their own incentives. If it were possible to get the incentives correct, the dominant equilibria would be identical to the optimal solution for the original RL problem.

In this paper, the authors introduce a framework for decentralizing decision-making by appealing to auction theory. There is a separate simple agent for each action. At every a timestep, a Vickrey auction is run in which each agent can bid for the superagent executing their particular action. The trick is that when an agent successfully wins a bid and acts on a state, it then ‘owns’ the produced next state, and ‘earns’ the result of the auction in the next round. (At the end of an episode, the owner of the state earns the reward of the trajectory.) Intuitively, the agent wants to bid on states in which it can make progress towards earning the final reward, as those will be states that other agents want to buy. The authors show that this scheme incentivizes each agent to bid the Q-value of their action in the given state, which would then lead to an optimal policy.

The authors test out this approach with some simple MDPs. They also investigate a task where they try to get the agents to rotate MNIST images so that a classifier will recognize them. Finally, they investigate task transfer by training agents on simple sub-tasks and then reusing those agents to learn a related task making use of both sub-tasks.

Zach's opinion: Imagine Twitch plays, but you use a reputation to buy and sell your actions. The actual idea in the paper is slightly more mundane than this because the primitives are bidders. Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives (AN #66) is a similar piece of work that also uses primitives as the basic level of selection. However, their incentive mechanism is different: agents pay according to how much information from the environment they use and then get a reward back for their actions. However, there’s good reason to think options could work as well since in both of these papers there’s evidence that primitives that learn sub-tasks are useful in new tasks.

NEWS

Cooperative AI Workshop (summarized by Rohin): This NeurIPS workshop has the goal of improving the cooperation skills of AI systems (whether with humans or other machines), which encompasses a very wide range of research topics. The deadline to submit is September 18.

Senior Systems Safety Engineer (OpenAI) (summarized by Rohin): OpenAI is hiring for a senior systems safety engineer. From my read of the job description, it seems like the goal is to apply the principles from Engineering a Safer World (AN #112) to AI development.

Early-career funding for individuals interested in improving the long-term future (summarized by Rohin): This Open Philanthropy program aims to provide support for people who want to focus on improving the long-term future. The primary form of support would be funding for graduate school, though other one-off activities that build career capital also count. They explicitly say that people interested in working on AI policy or risks from transformative AI should apply to this program (possibly in addition to their AI fellowship (AN #66)). The stage 1 deadline is January 1, but if you submit earlier they aim to respond within 10 working days.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

[-]gwern5y70

The composition paper seems to exemplify what I talk about as my intuition for how NNs work. The models are both very small and trained on little data, but image classification seems to be much easier than NLP (which is why the DL revolution came to image classification many years before NLP), so it's enough to train the CNN to have fairly meaningful disentangled representations of the kind we expect; their RNN model, however, continues to grope through relatively superficial associations and tricks, as the text database is relatively tiny. I'd predict that if they analyze much larger networks, like BiT or GPT-3, they'd find much more composition, and much less reliance on polysemanticity, and less vulnerability to easy 'copy-paste' adversarial examples.

[-]Rohin Shah5y60

Yup, I generally agree (both with the three predictions, and the general story of how NNs work).

AI ALIGNMENT FORUM
AF

13