Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Specifically, these forms of preference learning can be described by two properties: (1) the set of choices that the human picks from and (2) how each choice corresponds to a distribution over agent trajectories. Given these properties, we assume that the human makes their choice according to a Boltzmann-rational model (where the human is more likely to choose an option if it leads to higher expected reward). We have now specified a likelihood over the choice given the reward, and we can use Bayes rule to infer a distribution over the reward given the human's choice.
Consider more exotic types of feedback, such as the human's decision to turn the agent off (AN #69). Here, the human has two options: turning the agent off (corresponding to the agent staying still forever), or letting it continue (corresponding to the agent taking the trajectory that maximizes its current expected reward). If the agent has the right reward function, then the Boltzmann rational human would let it continue; as a result, if the human instead tries to turn the agent off, Bayes Rule allows the agent to infer that its belief about the reward must be wrong. Thus, even this decision of whether to turn the agent off can be captured in this framework.
The paper then shows two examples of new feedback types that can be generated from this framework: first, credit assignment, in which the human identifies a subset of the trajectory that had maximal reward, and second, meta-choice, where the choice of which type of feedback to give can itself give information about the reward function.
Rohin's opinion: I like this paper; it's a very clear explanation of a "recipe" used to develop preference learning algorithms (especially at CHAI and Interact, two of the labs I'm a part of). It is particularly applicable to the case where there's a separate training phase where the human gives feedback on the reward function, and a deployment phase where the agent optimizes the reward function. Things get murkier once you move to a more online setting in which the human and agent are acting simultaneously, as in assistance games / CIRL games (AN #69), where the agent can learn from pragmatic actions (see also the deep RL version (AN #64)). In particular, while previously we could separate out the grounding of choices to trajectories, and the interpretation of those trajectories (the Boltzmann rational model), this can no longer be done in an assistance game, since the way that the agent interprets the trajectories changes what the agent does which changes the choices available to the human in the next timestep. I'd be excited for more work on understanding this setting.
Technical AI alignment
Learning human intent
Goal-conditioned Imitation Learning(Yiming Ding, Carlos Florensa et al) (summarized by Zach): Goal-conditioned tasks are objectives that can be specified at the start of an episode. Specifically, the objective is set to encourage the agent to reach an arbitrary state in the environment. This paper investigates using goal-conditioning to improve the performance of imitation learning algorithms. The authors build off of prior work into Hindsight-Experience Replay (HER), a method that allows standard RL algorithms to learn from failure by relabeling final states as goal states. One drawback of HER is that the search process is breadth-first since we won't know which search directions are useful before we encounter the true goal state. This can complicate exploration. On the other hand, when we have access to expert demonstrations, such as in imitation learning, we can generally avoid breadth-first search and instead focus on copying the demonstrations using a method such as generative adversarial imitation learning (GAIL). However, with GAIL we evaluate entire agent trajectories as either similar/dissimilar from the expert demonstration. Yet, it's also true that we could view different points along the trajectory as sub-goals which greatly augment the demonstration data-set. Using this insight, the authors extend goal-conditioning to the imitation learning setting. The authors test their goal-conditioned algorithm on a variety of basic manipulation tasks and show that with the goal relabeling the task is learned faster and at a higher quality than with other approaches such as GAIL or HER.
Zach's opinion: The basic idea of augmenting demonstration data by relabeling the goal is clever. While I understood HER suffered from being a breadth-first search algorithm, I hadn't considered the possibility that GAIL might be limited to only extracting direct information from the demonstrations. Generalizing GAIL so that it can reach arbitrary states allows for a smooth transition between learning from demonstrations and achieving the desired outcome.
The two-layer model of human values, and problems with synthesizing preferences(Kaj Sotala) (summarized by Rohin): This post points out a problem with the recent preference synthesis research agenda (AN #60) (and presumably other value learning agendas as well): these agendas tend to require simple models of how human behavior, speech, or mental models relate to human preferences. However, in reality, it seems likely that the brain is a big learning machine without any innate "values", and what we experience as our conscious selves is a "strategy" chosen by this learning machine, and as such does not have a sensible interpretation as something that optimizes for "values". The author suggests that value learning agendas need to deal directly with the fact that there are these two "layers" in humans, and presents some preliminary thoughts that don't reach any particular conclusions.
Rohin's opinion: I think this is an important critique: it seems to me that the hardest part of the three principles suggested in Human Compatible (AN #69) is the one that requires human preferences to be tied to human behavior. It seems quite hard to make an accurate and robust model for this, for reasons like the ones given in this post.
Using vector fields to visualise preferences and make them consistent(Michael Aird et al) (summarized by Rohin): This post proposes that we represent a person's preferences as follows: for every state, we have a vector whose direction specifies how the person would most like the state to change, and whose magnitude specifies the intensity of the preference. Under suitable conditions on the state space, this defines a vector field. Intransitive or circular preferences correspond to the curl of the vector field. The authors propose that a consistent set of preferences can then be inferred by "removing the curl", e.g. by using the Helmholtz decomposition.
Preventing bad behavior
Pessimism About Unknown Unknowns Inspires Conservatism(Michael Cohen et al) (summarized by Rohin): The argument for AI risk typically involves some point at which an AI system does something unexpected and bad in a new situation that we haven't seen before (as in e.g. a treacherous turn). One way to mitigate the risk is to simply detect new situations, and ensure the AI system does something known to be safe in such situations, e.g. deferring to a human, or executing some handcoded safe baseline policy. Typical approaches involve a separate anomaly detection model. This paper considers: can we use the AI system itself to figure out when to defer to a mentor?
The key insight is that if an AI system maintains a distribution over rewards, and "assumes the worst" about the reward in new situations, then simply by deferring to the mentor with higher probability when the mentor would get higher expected reward, it will end up deferring to the mentor in new situations. Hence, the title: by making the agent pessimistic about unknown unknowns (new situations), we get a conservative agent that defers to its mentor in new situations.
This is formalized in an AIXI-like setting, where we have agents that can have beliefs over all computable programs, and we only consider an online learning setting where there is a single trajectory over all time (i.e. no episodes). The math is fairly dense and I didn't try to fully understand it; as a result my summary may be inaccurate. The agent maintains a belief over world models (which predict how the environment evolves and how reward is given) and mentor models (which predict what the mentor will do, where the mentor's policy can depend on the true world model). It considers the β most likely world models (where β is a hyperparameter between 0 and 1). It computes the worst-case reward it could achieve under these world models, and the expected reward that the mentor achieves. It is more likely to defer to the mentor when the mentor's expected reward is higher (relative to its worst-case reward).
Such an agent queries the mentor finitely many times and eventually takes actions that are at least as good as the mentor's choices in those situations. In addition, for events with some bound on complexity, we can set things up (e.g. by having a high β) such that for any event, with high probability the agent never causes the event to occur unless the mentor has already caused the event to occur some time in the past. For example, with high probability the agent will never push the big red button in the environment, unless it has seen the mentor push the big red button in the past.
Rohin's opinion: I think it is an underrated point that in some sense all we need to do to avoid x-risk is to make sure AI systems don't do crazy high-impact things in new situations, and that risk aversion is one way to get such an agent. This is also how Inverse Reward Design (AN #69) gets its safety properties: when faced with a completely new "lava" tile that the agent has never seen before, the paper's technique only infers that it should be uncertain about the tile's reward. However, the expected reward is still 0, and to get the agent to actually avoid the lava you need to use risk-averse planning.
The case for pessimism is similar to the case for impact measures, and similar critiques apply: it is not clear that we can get a value-agnostic method that is both sufficiently safe to rule out all catastrophes, and sufficiently useful to replace other AI techniques. The author himself points out that if we set β high enough to be confident it is safe, the resulting agent may end up always deferring to the mentor, and so not actually be of any use. Nonetheless, I think it's valuable to point out these ways that seem to confer some nice properties on our agents, even if they can't be pushed to the extremes for fear of making the agents useless.
AI strategy and policy
AI Alignment Podcast: On the Long-term Importance of Current AI Policy(Lucas Perry, Nicolas Moës and Jared Brown) (summarized by Rohin): While this podcast focuses both on the details of current policy as well as the long-term impact of engaging in policy today, I'm mostly interested in the latter, and so will simply quote Lucas's summary of points for that part:
1) Experience gained on short-term AI policy issues is important to be considered a relevant advisor on long-term AI policy issues coming up in the future.
2) There are very few people that care about AGI safety currently in government, politics or in policy communities.
3) There are opportunities to influence current AI policy decisions in order to provide a fertile ground for future policy decisions or, better but rarer, to be directly shaping AGI safety policy today through evergreen texts. Future policy that is implemented is path dependent on current policy that we implement today. What we do now is precedent setting.
4) There are opportunities today to develop a skillset useful for other policy issues and causes.
5) Little resource is being spent on this avenue for impact, so the current return on investment is quite good.
Rohin's opinion: I think quite a lot about points 1 and 3, which I think also apply to technical safety research, not just policy. For our research to have an impact, it is necessary that either the research or its authors have enough credibility to actually influence decision-makers. In addition, the problems we will face in the future could depend on technical work done today: for example, if we were convinced that (say) AIs trained via evolution are too risky, we could push for AI to be developed in other ways now.
FLI Podcast: Distributing the Benefits of AI via the Windfall Clause(Lucas Perry and Cullen O’Keefe) (summarized by Rohin): Last week, we had a brief summary of the Windfall Clause paper. This podcast goes into more depth about the potential benefits and objections to this clause: it's in some sense a more accessible and conversational elaboration of many of the points made in the paper.
Other progress in AI
Reinforcement learning
What Can Learned Intrinsic Rewards Capture?(Zeyu Zheng, Junhyuk Oh et al) (summarized by Rohin): This paper studies whether a learned reward function can serve as a locus of knowledge about the environment, that can be used to accelerate training of new agents. In particular, such a learned intrinsic reward can help with test-time adaptation: in a novel environment, the intrinsic reward can quickly "tell" the agent e.g. where it should explore -- even if in the new environment the agent has a different action space, or uses a different learning algorithm (situations that meta learning would typically not be able to handle).
The authors create an algorithm that learns an intrinsic reward function, that when used to train a new agent over a “lifetime” (which consists of multiple episodes), leads to the best cumulative reward over the lifetime, using a meta-gradient approach. Experiments on gridworlds demonstrate that these learned intrinsic rewards: 1. switch between early exploration and later exploitation, 2. explore only for information that is relevant for optimal behavior, 3. capture invariant causal relationships, and 4. can anticipate and adapt to changes in the extrinsic reward within a lifetime.
Rohin's opinion: A common intuition that many researchers have is that specifying what to do (the reward function) should be easier than specifying how to do it (the policy). In practice, this doesn't seem to be the case for deep learning, where imitation via inverse reinforcement learning (inferring a reward function and optimizing it) seems to be similar to imitation learning via behavior cloning ("copying" the policy). Similarly, this method seems broadly similar to meta learning algorithms like MAML and RL^2, though it does outperform them on one (probably carefully designed) transfer learning task.
Deep learning
The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence(Gary Marcus) (summarized by Rohin): This paper suggests a few directions which would allow us to build more robust AI systems with better "understanding" of the world: specifically, it highlights symbol manipulation, encoded knowledge, reasoning, and cognitive models as areas of research for the next decade.
Rohin's opinion: The AI claims made in this paper seem pretty reasonable to me, though I think the paper overstates how much deep learning aficionados disagree with them. I certainly agree for example that existing deep learning systems do not generalize well outside of their training environment, or that AI systems will need to work with abstract knowledge, or that AI systems will have to learn from external, cultural knowledge represented in natural language. And while I am perhaps not as enamored of deep learning as (say) OpenAI or DeepMind, I'm a pretty big fan of it, and try to design algorithms where deep learning can do most of the "heavy lifting".
News
FHI Summer Research Fellowship (summarized by Rohin): This six week summer fellowship allows fellows to take the lead on a project relevant to the long-term future, working with an FHI Research Scholar. Application deadline is March 22.
Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Audio version here (may not be up yet).
Highlights
Reward-rational (implicit) choice: A unifying formalism for reward learning (Hong Jun Jeon, Smitha Milli et al) (summarized by Rohin): We've got algorithms for learning preferences from demonstrations (AN #12) (possibly ranked (AN #60)), comparisons (AN #67), proxy rewards (AN #69), and even the observed state (AN #45). The insight of this paper is that these are all instances of a simple underlying formalism.
Specifically, these forms of preference learning can be described by two properties: (1) the set of choices that the human picks from and (2) how each choice corresponds to a distribution over agent trajectories. Given these properties, we assume that the human makes their choice according to a Boltzmann-rational model (where the human is more likely to choose an option if it leads to higher expected reward). We have now specified a likelihood over the choice given the reward, and we can use Bayes rule to infer a distribution over the reward given the human's choice.
Consider more exotic types of feedback, such as the human's decision to turn the agent off (AN #69). Here, the human has two options: turning the agent off (corresponding to the agent staying still forever), or letting it continue (corresponding to the agent taking the trajectory that maximizes its current expected reward). If the agent has the right reward function, then the Boltzmann rational human would let it continue; as a result, if the human instead tries to turn the agent off, Bayes Rule allows the agent to infer that its belief about the reward must be wrong. Thus, even this decision of whether to turn the agent off can be captured in this framework.
The paper then shows two examples of new feedback types that can be generated from this framework: first, credit assignment, in which the human identifies a subset of the trajectory that had maximal reward, and second, meta-choice, where the choice of which type of feedback to give can itself give information about the reward function.
Rohin's opinion: I like this paper; it's a very clear explanation of a "recipe" used to develop preference learning algorithms (especially at CHAI and Interact, two of the labs I'm a part of). It is particularly applicable to the case where there's a separate training phase where the human gives feedback on the reward function, and a deployment phase where the agent optimizes the reward function. Things get murkier once you move to a more online setting in which the human and agent are acting simultaneously, as in assistance games / CIRL games (AN #69), where the agent can learn from pragmatic actions (see also the deep RL version (AN #64)). In particular, while previously we could separate out the grounding of choices to trajectories, and the interpretation of those trajectories (the Boltzmann rational model), this can no longer be done in an assistance game, since the way that the agent interprets the trajectories changes what the agent does which changes the choices available to the human in the next timestep. I'd be excited for more work on understanding this setting.
Technical AI alignment
Learning human intent
Goal-conditioned Imitation Learning (Yiming Ding, Carlos Florensa et al) (summarized by Zach): Goal-conditioned tasks are objectives that can be specified at the start of an episode. Specifically, the objective is set to encourage the agent to reach an arbitrary state in the environment. This paper investigates using goal-conditioning to improve the performance of imitation learning algorithms. The authors build off of prior work into Hindsight-Experience Replay (HER), a method that allows standard RL algorithms to learn from failure by relabeling final states as goal states. One drawback of HER is that the search process is breadth-first since we won't know which search directions are useful before we encounter the true goal state. This can complicate exploration. On the other hand, when we have access to expert demonstrations, such as in imitation learning, we can generally avoid breadth-first search and instead focus on copying the demonstrations using a method such as generative adversarial imitation learning (GAIL). However, with GAIL we evaluate entire agent trajectories as either similar/dissimilar from the expert demonstration. Yet, it's also true that we could view different points along the trajectory as sub-goals which greatly augment the demonstration data-set. Using this insight, the authors extend goal-conditioning to the imitation learning setting. The authors test their goal-conditioned algorithm on a variety of basic manipulation tasks and show that with the goal relabeling the task is learned faster and at a higher quality than with other approaches such as GAIL or HER.
Zach's opinion: The basic idea of augmenting demonstration data by relabeling the goal is clever. While I understood HER suffered from being a breadth-first search algorithm, I hadn't considered the possibility that GAIL might be limited to only extracting direct information from the demonstrations. Generalizing GAIL so that it can reach arbitrary states allows for a smooth transition between learning from demonstrations and achieving the desired outcome.
The two-layer model of human values, and problems with synthesizing preferences (Kaj Sotala) (summarized by Rohin): This post points out a problem with the recent preference synthesis research agenda (AN #60) (and presumably other value learning agendas as well): these agendas tend to require simple models of how human behavior, speech, or mental models relate to human preferences. However, in reality, it seems likely that the brain is a big learning machine without any innate "values", and what we experience as our conscious selves is a "strategy" chosen by this learning machine, and as such does not have a sensible interpretation as something that optimizes for "values". The author suggests that value learning agendas need to deal directly with the fact that there are these two "layers" in humans, and presents some preliminary thoughts that don't reach any particular conclusions.
Rohin's opinion: I think this is an important critique: it seems to me that the hardest part of the three principles suggested in Human Compatible (AN #69) is the one that requires human preferences to be tied to human behavior. It seems quite hard to make an accurate and robust model for this, for reasons like the ones given in this post.
Using vector fields to visualise preferences and make them consistent (Michael Aird et al) (summarized by Rohin): This post proposes that we represent a person's preferences as follows: for every state, we have a vector whose direction specifies how the person would most like the state to change, and whose magnitude specifies the intensity of the preference. Under suitable conditions on the state space, this defines a vector field. Intransitive or circular preferences correspond to the curl of the vector field. The authors propose that a consistent set of preferences can then be inferred by "removing the curl", e.g. by using the Helmholtz decomposition.
Preventing bad behavior
Pessimism About Unknown Unknowns Inspires Conservatism (Michael Cohen et al) (summarized by Rohin): The argument for AI risk typically involves some point at which an AI system does something unexpected and bad in a new situation that we haven't seen before (as in e.g. a treacherous turn). One way to mitigate the risk is to simply detect new situations, and ensure the AI system does something known to be safe in such situations, e.g. deferring to a human, or executing some handcoded safe baseline policy. Typical approaches involve a separate anomaly detection model. This paper considers: can we use the AI system itself to figure out when to defer to a mentor?
The key insight is that if an AI system maintains a distribution over rewards, and "assumes the worst" about the reward in new situations, then simply by deferring to the mentor with higher probability when the mentor would get higher expected reward, it will end up deferring to the mentor in new situations. Hence, the title: by making the agent pessimistic about unknown unknowns (new situations), we get a conservative agent that defers to its mentor in new situations.
This is formalized in an AIXI-like setting, where we have agents that can have beliefs over all computable programs, and we only consider an online learning setting where there is a single trajectory over all time (i.e. no episodes). The math is fairly dense and I didn't try to fully understand it; as a result my summary may be inaccurate. The agent maintains a belief over world models (which predict how the environment evolves and how reward is given) and mentor models (which predict what the mentor will do, where the mentor's policy can depend on the true world model). It considers the β most likely world models (where β is a hyperparameter between 0 and 1). It computes the worst-case reward it could achieve under these world models, and the expected reward that the mentor achieves. It is more likely to defer to the mentor when the mentor's expected reward is higher (relative to its worst-case reward).
Such an agent queries the mentor finitely many times and eventually takes actions that are at least as good as the mentor's choices in those situations. In addition, for events with some bound on complexity, we can set things up (e.g. by having a high β) such that for any event, with high probability the agent never causes the event to occur unless the mentor has already caused the event to occur some time in the past. For example, with high probability the agent will never push the big red button in the environment, unless it has seen the mentor push the big red button in the past.
Rohin's opinion: I think it is an underrated point that in some sense all we need to do to avoid x-risk is to make sure AI systems don't do crazy high-impact things in new situations, and that risk aversion is one way to get such an agent. This is also how Inverse Reward Design (AN #69) gets its safety properties: when faced with a completely new "lava" tile that the agent has never seen before, the paper's technique only infers that it should be uncertain about the tile's reward. However, the expected reward is still 0, and to get the agent to actually avoid the lava you need to use risk-averse planning.
The case for pessimism is similar to the case for impact measures, and similar critiques apply: it is not clear that we can get a value-agnostic method that is both sufficiently safe to rule out all catastrophes, and sufficiently useful to replace other AI techniques. The author himself points out that if we set β high enough to be confident it is safe, the resulting agent may end up always deferring to the mentor, and so not actually be of any use. Nonetheless, I think it's valuable to point out these ways that seem to confer some nice properties on our agents, even if they can't be pushed to the extremes for fear of making the agents useless.
AI strategy and policy
AI Alignment Podcast: On the Long-term Importance of Current AI Policy (Lucas Perry, Nicolas Moës and Jared Brown) (summarized by Rohin): While this podcast focuses both on the details of current policy as well as the long-term impact of engaging in policy today, I'm mostly interested in the latter, and so will simply quote Lucas's summary of points for that part:
1) Experience gained on short-term AI policy issues is important to be considered a relevant advisor on long-term AI policy issues coming up in the future.
2) There are very few people that care about AGI safety currently in government, politics or in policy communities.
3) There are opportunities to influence current AI policy decisions in order to provide a fertile ground for future policy decisions or, better but rarer, to be directly shaping AGI safety policy today through evergreen texts. Future policy that is implemented is path dependent on current policy that we implement today. What we do now is precedent setting.
4) There are opportunities today to develop a skillset useful for other policy issues and causes.
5) Little resource is being spent on this avenue for impact, so the current return on investment is quite good.
Rohin's opinion: I think quite a lot about points 1 and 3, which I think also apply to technical safety research, not just policy. For our research to have an impact, it is necessary that either the research or its authors have enough credibility to actually influence decision-makers. In addition, the problems we will face in the future could depend on technical work done today: for example, if we were convinced that (say) AIs trained via evolution are too risky, we could push for AI to be developed in other ways now.
FLI Podcast: Distributing the Benefits of AI via the Windfall Clause (Lucas Perry and Cullen O’Keefe) (summarized by Rohin): Last week, we had a brief summary of the Windfall Clause paper. This podcast goes into more depth about the potential benefits and objections to this clause: it's in some sense a more accessible and conversational elaboration of many of the points made in the paper.
Other progress in AI
Reinforcement learning
What Can Learned Intrinsic Rewards Capture? (Zeyu Zheng, Junhyuk Oh et al) (summarized by Rohin): This paper studies whether a learned reward function can serve as a locus of knowledge about the environment, that can be used to accelerate training of new agents. In particular, such a learned intrinsic reward can help with test-time adaptation: in a novel environment, the intrinsic reward can quickly "tell" the agent e.g. where it should explore -- even if in the new environment the agent has a different action space, or uses a different learning algorithm (situations that meta learning would typically not be able to handle).
The authors create an algorithm that learns an intrinsic reward function, that when used to train a new agent over a “lifetime” (which consists of multiple episodes), leads to the best cumulative reward over the lifetime, using a meta-gradient approach. Experiments on gridworlds demonstrate that these learned intrinsic rewards: 1. switch between early exploration and later exploitation, 2. explore only for information that is relevant for optimal behavior, 3. capture invariant causal relationships, and 4. can anticipate and adapt to changes in the extrinsic reward within a lifetime.
Rohin's opinion: A common intuition that many researchers have is that specifying what to do (the reward function) should be easier than specifying how to do it (the policy). In practice, this doesn't seem to be the case for deep learning, where imitation via inverse reinforcement learning (inferring a reward function and optimizing it) seems to be similar to imitation learning via behavior cloning ("copying" the policy). Similarly, this method seems broadly similar to meta learning algorithms like MAML and RL^2, though it does outperform them on one (probably carefully designed) transfer learning task.
Deep learning
The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence (Gary Marcus) (summarized by Rohin): This paper suggests a few directions which would allow us to build more robust AI systems with better "understanding" of the world: specifically, it highlights symbol manipulation, encoded knowledge, reasoning, and cognitive models as areas of research for the next decade.
See also Import AI #187 and Matthew Barnett's summary.
Rohin's opinion: The AI claims made in this paper seem pretty reasonable to me, though I think the paper overstates how much deep learning aficionados disagree with them. I certainly agree for example that existing deep learning systems do not generalize well outside of their training environment, or that AI systems will need to work with abstract knowledge, or that AI systems will have to learn from external, cultural knowledge represented in natural language. And while I am perhaps not as enamored of deep learning as (say) OpenAI or DeepMind, I'm a pretty big fan of it, and try to design algorithms where deep learning can do most of the "heavy lifting".
News
FHI Summer Research Fellowship (summarized by Rohin): This six week summer fellowship allows fellows to take the lead on a project relevant to the long-term future, working with an FHI Research Scholar. Application deadline is March 22.