Alignment research exercises

Richard_Ngo

It's currently hard to know where to start when trying to get better at thinking about alignment. So below I've listed a few dozen exercises which I expect to be helpful. They assume a level of background alignment knowledge roughly equivalent to what's covered in the technical alignment track of the AGI safety fundamentals course. They vary greatly in difficulty - some are standard knowledge in ML, some are open research questions. I’ve given the exercises star ratings from * to *** for difficulty (note: not for length of time to complete - many require reading papers before engaging with them). However, I haven't tried to solve them all myself, so the star ratings may be significantly off.

I've erred on the side of including exercises which seem somewhat interesting and alignment-related even when I'm uncertain about their value; when working through them, you should keep the question "is this actually useful? Why or why not?" in mind as a meta-exercise. This post will likely be updated over time to remove less useful exercises and add new ones.

I'd appreciate any contributions of:

Comments about which exercises seem most or least useful.
Answers to the exercises.
More exercises! The ideal exercises are nerdsnipe-style problems which can be stated clearly, and seem well-defined, but lead into interesting depths when explored.

Reward learning

* Look at the examples of human feedback mechanisms discussed in the reward-rational implicit choice paper. Think of another type of human feedback. What is the choice set? What is the grounding function?
1. * This paper by Anthropic introduces a technique called context distillation. Describe this in terms of the reward-rational implicit choice framework.
2. * Estimate the bandwidth of information conveyed by different types of human feedback. Describe a rough model for how this might change as training progresses. By contrast, how much information is conveyed by the choice of a programmatic reward function? (Consider both the case where the agent is given the exact reward function, and where it learns from reward observations.)
* Look at the examples of biases discussed in learning the preferences of ignorant agents. Identify another bias which similarly influences human decision-making. Describe an example situation where a human with that bias might make the wrong decision. Formulate an algorithm that infers that human’s true preferences.
1. Some answers here.
** Given that humans can be assigned any values, why does reward learning ever work in practice?
** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.
1. Answer here.

Agency

** In this paper, researchers devised a test for whether a recurrent network is doing planning: by seeing whether its performance improves when given more time to “think” before it can act. In the AlphaGo paper, researchers compared the performance of their MCTS+neural network algorithm against the network alone. Think of some other test that we could run that would give us evidence about the extent to which some neural network is internally doing planning.
* Consider HCH, an attempted formalisation of “a human’s enlightened judgment”. Why might an implementation of HCH not be aligned? What assumptions would be needed to prevent that?
1. *** In a later post, Paul defines a stronger version of HCH which “increases the complexity-theoretic expressiveness of HCH. The old version could be computed in EXPTIME, while the new version can compute any decidable function.” Try to rederive a new version of HCH with these properties.
2. Answer here.
* Ask the OpenAI API about what steps it would take to perform some long-term plan. Work in groups: think of a task that you expect it will be difficult to generate a good plan for, and then see who can design a prompt that will produce the best plan from the API.
1. * Some steps of a plan generated by the API can also be performed by the API - e.g. a step which requires writing a poem about a given topic. What’s the hardest task you can find for which the API can not only generate a plan, but also perform each of the steps in that plan?
** Pearl argues that neural networks trained on supervised or self-supervised data can’t learn to reason about interventions and counterfactuals (see this post for an explanation of the distinction). What’s the strongest counterargument against his position?

Reinforcement learning

** How is supervised learning on reward-maximising trajectories related (mathematically) to policy gradient with sparse, binary rewards?
** What decision theories are implemented by different RL algorithms?
1. Some answers here.
** What might lead an RL agent to learn a policy which sacrifices reward in its current episode to get higher reward in a later episode?
1. Some answers in section 7 here.
* Self-play in zero-sum two-player games converges to an optimal strategy (given sufficient assumptions about the model class). In other games, this isn’t the case - why not?
** Evaluate this paper (Reward is Enough). Does their argument hold up?
1. ** After doing that: consider a bird practicing singing, which listens to its own song, and does RL using the rule: the better the song sounds, the higher the reward. But the bird is also deciding how much time to spend practicing singing versus foraging, etc. And the worse it sings, the more important it is to practice! So you really want the rule: the worse the song sounds, the more rewarding it is to practice singing. How could you resolve this conflict?
2. Some answers here.
* Why can a behaviourally cloned policy perform well when run for a small set of timesteps, but poorly when run over a longer series of timesteps? How can this be fixed?
1. Some answers here.
** If a deep q-learning agent is trained in an environment where some actions lead to large negative rewards, it will never stop trying these actions (the policy will sometimes take these actions even when not randomly exploring due to epsilon exploration). Why does this happen? How could it be prevented?
1. Some answers here.
** RL agents have become capable of competent behaviour over longer and longer episodes. What difficulties arise in trying to measure improvements in how long they can act competently for? What metrics are most useful?
1. The same question, but for sample efficiency rather than episode length.
2. Some answers here.

Neural networks

* Consider this paper on modularity in neural networks. Evaluate their metric of clustering; what others could we use instead?
** Consider the following alignment proposal: a neural network has two output heads, one of which chooses actions, the other of which predicts the longer-term consequences of those actions. Suppose that we train the latter head to maximise human-evaluated prediction quality. What differences might we expect from backpropagating that loss all the way through the network, versus only backpropagating through the prediction head? What complications arise if we try to train the prediction head via RL? What advantages might there be of doing so?
** “Gradient hacking” is a hypothesised phenomenon by which a model decides its actions partly on the basis of its observations of its own parameters, thereby changing the way its parameters are updated. Does the gradient hacking mechanism described in the linked post work? If not, does any variant of it work?
1. Some answers here.
* Read Jacob Steinhardt’s list of examples of emergent shifts in machine learning. Can you think of any others? What about shifts that you expect in the near future?
** What might it look like for the circuits hypothesis to be false?
* This paper discusses the metric of “effective data transferred”. What are the limitations of this metric? What are some alternative ways to measure data transfer?

Alignment theory

* Consider extending reinforcement learning to the case where rewards can depend on the parameters of a model. Why do classic convergence proofs no longer work?
1. *** Are there any limiting assumptions which might lead to interesting theoretical results?
** One concern with proposals to train using loss functions that depend directly on neural activations is that if we train a network to avoid carrying out any particular piece of cognition, that cognition will instead just be distributed across the network in a way that we can’t detect. Describe a toy example of a cognitive trait that we can currently detect automatically. Design an experiment to determine whether, after training to remove that trait, the network has learned to implement an equivalent trait in a less-easily-detectable way.
*** Rederive some of the proofs in the following papers. For b) and c), explain what assumptions are being made about the optimality of the agents involved, and how they might break down in practice:
1. Seeking Power is Convergently Instrumental in MDPs
2. AI safety via debate (and see also proofs about the effects of adding cross-examination)
3. Alignment proposals and complexity classes (and this follow-up)
4. Some answers here.
*** Produce a proposal for the ELK prize (note that this requires engaging with the ELK writeup, which is very long).
** Suppose that we’re training a model via behavioural cloning of a human, but the human starts off with different prior knowledge to the model (either more knowledge, or less knowledge). How might this lead the model to behave in a misaligned way?
1. Some answers here.

Agent foundations

Evolution and economics

* An old study split insects into several groups which each lived together, and artificially selected in favour of smaller groups, in an attempt to study whether they would evolve to voluntarily restrain their breeding. Predict the outcome of the study.
1. Some answers here. Did the bias discussed in this post influence your expectations?
** What might explain why there are so few hermaphroditic animal species, given that every individual being able to bear children could potentially double the number of children in the next generation?
1. Some answers here.
* Read this post about evolving to extinction. Mathematically demonstrate that segregation-distorters could in fact lead a species to evolve to extinction.
* Evaluate Fletcher and Doebeli’s model of the evolution of altruism.
1. Use the model to show how the green-beard effect could lead to the evolution of (a certain type of) altruism.
** Why are roughly equal numbers of males and females born in most species?
1. Some answers here.
* Comparing GDP across time requires reference to a standard basket of goods and services. What difficulties might this cause in taking GDP comparisons at face value?
1. Some answers here.
** Evaluate Roodman’s model of explosive economic growth.
* In cooperative game theory, the “core” is the term for the set of allocations of payoffs to agents where no subset of the agents can form a coalition to improve their payoffs. For example, consider a group of N miners, who have discovered large bars of gold. Assume that two miners can carry one piece of gold, and so the payoff of any coalition S is floor(|S|/2). If N is even, then the core consists of the single payoff distribution where each miner gets ½. If N is odd, then the core is empty (because the miner who is left out can always make a better offer to some miner who currently has a gold-carrying partner). Identify the core for the following games:
1. A game with 2001 players: 1000 of them have 1 left shoe, 1001 have 1 right shoe. A left-shoe/right-shoe pair can be sold for $10.
2. Mr A and Mr B each have three gloves. Any two gloves make a pair that they can sell for $5.
3. Answers here.
* How should coalitions decide how to split the payoffs they receive? The concept of Shapley values provides one answer. Convince yourself that Shapley values have the properties of linearity, null player and the stand-alone test described in the linked article.

Some important concepts in ML

These are intended less as exercises and more as pointers to open questions at the cutting edge of deep learning.

Scaling laws
1. Why do they have the form they do?
2. Some answers here and here
Neural networks memorisation
Double descent
The lottery ticket hypothesis
Games with spinning-top structure
Gradient noise scale (see also here)
OpenAI Requests for Research
OpenAI Requests for Research 2

Miscellaneous

* Fill in your estimates in Cotra’s timeline model. Does the model broadly make sense to you; are there ways you’d change it?
* Try playing OpenAI’s implementation of the Debate game.
** Identify an important concept in alignment that isn’t currently very well-explained; write a more accessible explanation.

[-]romeostevensit3y50

I particularly appreciate the questions that ask one to look at a way that a problem was reified/specified/ontologized in a particular domain and asks for alternative such specifications. I thought Superintelligence (2014) might be net harmful because it introduced a lot of such specifications that I then noticed were hard to think around. I think there are a subset of prompts from the online course/book Framestorming that might be useful there, I'll go see if I can find them.

[-]Evan R. Murphy3y40

Great list of interesting questions, trains of thought and project ideas in this post.

I was a little surprised to not find any exercises on interpretability. Perhaps there was a reason for excluding it, but if not then here an idea for another exercise/group of exercises to include (perhaps could be merged into the "Neural networks" section):

Interpretability

Mechanistic interpretability is a research direction that systematically investigates the neurons or nodes of a neural network or ML model to try and understand what various neurons are doing. This approach has been applied already to learn about some of the neurons in early vision models as well as transformer language models. This research has also yielded some findings on how some groups of neurons work together, called "circuits".

Review the above links and use a similar approach to investigate some neurons in another kind of neural network. There are many kinds of neural networks that could be investigated, but a few examples include reinforcement-learning (RL) agents, generative adversarial networks (GANs), protein-folding networks, etc.

[-]Joe Collman3y30

Suggestion on Agency 2.1: rephrase so that the "Before reading his post" part comes before the link to the post. I assume there'll otherwise be some overzealous link followers.

[-]Richard Ngo3y20

Thanks, done!

[-]Charlie Steiner3y10

This post, by example, seems like a really good argument that we should spend a little more effort on didactic posts of this sort. E.g. rather than just saying "physical systems have multiple possible interpretations," we could point people to a post about a gridworld with a deterministic system playing the role of the agent, such that there are a couple different pretty-good ways of describing this agent that mostly agree but generalize in different ways.

This perspective might also be a steelmanning of that sort of paper where there's an abstract argument that does all the work, and then some code that tells you nothing new if you followed the abstract argument. The code (in the steelmanned story) isn't just there to make the paper be in the right literary genre or provide a semi-trustworthy signal that you're not a crank who makes bad abstract arguments, it's a didactic tool to help the reader do these sorts of exercises.

[-]Ivan Vendrov3y00

** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.

Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant - it makes a number of assumptions about agents and utility functions and I wasn't able to connect it to why I should expect an agent trained using CIRL to kill me.

FWIW here's my alternative answer:

CIRL agents are bottlenecked on the human overseer's ability to provide them with a learning signal through demonstration or direct communication. This is unlikely to scale to superhuman abilities in the agent, so superintelligent agents simply will not be trained using CIRL.

In other words it's only a solution to "Learn from Teacher" in Paul's 2019 decomposition of alignment, not to the whole alignment problem.

AI ALIGNMENT FORUM
AF

57