Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Stuart Russell at CHAI has published a book about AI safety. Expect a bonus newsletter this week summarizing the book and some of the research papers that underlie it!
Reframing Impact - Part 1(Alex Turner) (summarized by Rohin): This sequence has exercises that will be spoiled by this summary, so take a moment to consider whether you want to read the sequence directly.
This first part of the sequence focuses on identifying what we mean by impact, presumably to help design an impact measure in the future. The punch line: an event is impactful to an agent if it changes the agent's ability to get what it wants. This is Attainable Utility (AU) theory. To quote the sequence: "How could something possibly be a big deal to us if it doesn't change our ability to get what we want? How could something not matter to us if it does change our ability to get what we want?"
Some implications and other ideas:
- Impact is relative to an agent: a new church is more impactful if you are a Christian than if not.
- Some impact is objective: getting money is impactful to almost any agent that knows what money is.
- Impact is relative to expectations: A burglar robbing your home is impactful to you (you weren't expecting it) but not very impactful to the burglar (who had planned it out). However, if the burglar was unsure if the burglary would be successful, than success/failure would be impactful to them.
While this may seem obvious, past work (AN #10) has talked about impact as being caused by changes in state. While of course any impact does involve a change in state, this is the wrong level of abstraction to reason about impact: fundamentally, impact is related to what we care about.
Rohin's opinion: To quote myself from a discussion with Alex, "you're looking at the optimal Q-function for the optimal utility function and saying 'this is a good measure of what we care about' and of course I agree with that". (Although this is a bit inaccurate -- it's not the optimal Q-function, but the Q-function relative to what we expect and know.)
This may be somewhat of a surprise, given that I've been pessimistic about impact measures in the past. However, my position is that it's difficult to simultaneously get three desiderata: value-agnosticism, avoidance of catastrophes, and usefulness. This characterization of impact is very explicitly dependent on values, and so doesn't run afoul of that. (Also, it just makes intuitive sense.)
This part of the sequence did change some of my thinking on impact measures as well. In particular, the sequence makes a distinction between objective impact, which applies to all (or most) agents, and value impact. This is similar to the idea of convergent instrumental subgoals, and the idea that large-scale multiagent training (AN#65) can lead to generally useful behaviors that can be applied to novel tasks. It seems plausible to me that we could make value-agnostic impact measures that primarily penalize this objective impact, and this might be enough to avoid catastrophes. This would prevent us from using AI for big, impactful tasks, but could allow for AI systems that pursue small, limited tasks. I suspect we'll see thoughts along these lines in the next parts of this sequence.
Technical AI alignment
Technical agendas and prioritization
AI Safety "Success Stories"(Wei Dai) (summarized by Matthew): It is difficult to measure the usefulness of various alignment approaches without clearly understanding what type of future they end up being useful for. This post collects "Success Stories" for AI -- disjunctive scenarios in which alignment approaches are leveraged to ensure a positive future. Whether these scenarios come to pass will depend critically on background assumptions, such as whether we can achieve global coordination, or solve the most ambitious safety issues. Mapping these success stories can help us prioritize research.
Matthew's opinion: This post does not exhaust the possible success stories, but it gets us a lot closer to being able to look at a particular approach and ask, "Where exactly does this help us?" My guess is that most research ends up being only minimally helpful for the long run, and so I consider inquiry like this to be very useful for cause prioritization.
Preventing bad behavior
Formal Language Constraints for Markov Decision Processes(Eleanor Quint et al) (summarized by Rohin): Within the framework of RL, the authors propose using constraints defined by DFAs (deterministic finite automata) in order to eliminate safety failures, or to prevent agents from exploring clearly ineffective policies (which would accelerate learning). Constraints can be defined on any auxiliary information that can be computed from the "base" MDP. A constraint could either restrict the action space, forcing the agent to take an action that doesn't violate the constraint, which they term "hard" constraints; or a constraint could impose a penalty on the agent, thus acting as a form of reward shaping, which they term a "soft" constraint. They consider two constraints: one that prevents the agent from "dithering" (going left, then right, then left, then right), and one that prevents the agent from "overactuating" (going in the same direction four times in a row). They evaluate their approach with these constraints on Atari games and Mujoco environments, and show that they lead to increased reward and decreased constraint violations.
Rohin's opinion: This method seems like a good way to build in domain knowledge about what kinds of action sequences are unlikely to work in a domain, which can help accelerate learning. Both of the constraints in the experiments do this. The paper also suggests using the technique to enforce safety constraints, but the experiments don't involve any safety constraints, and conceptually there do seem to be two big obstacles. First, the constraints will depend on state, but it is very hard to write such constraints given access only to actions and high-dimensional pixel observations. Second, you can only prevent constraint violations by removing actions one timestep before the constraint is violated: if there is an action that will inevitably lead to a constraint violation in 10 timesteps, there's no way in this framework to not take that action. (Of course, you can use a soft constraint, but this is then the standard technique of reward shaping.)
In general, methods like this face a major challenge: how do you specify the safety constraint that you would like to avoid violating? I'd love to see more research on how to create specifications for formal analysis.
Robustness beyond Security: Representation Learning(Logan Engstrom et al) (summarized by Cody): Earlier this year, a provocative paper (AN #62) out of MIT claimed that adversarial perturbations weren’t just spurious correlations, but were, at least in some cases, features that generalize to the test set. A subtler implied point of the paper was that robustness to adversarial examples wasn’t a matter of resolving the model’s misapprehensions, but rather one of removing the model’s sensitivity to features that would be too small for a human to perceive. If we do this via adversarial training, we get so-called “robust representations”. The same group has now put out another paper, asking the question: are robust representations also human-like representations?
To evaluate how human-like the representations are, they propose the following experiment: take a source image, and optimize it until its representations (penultimate layer activations) match those of some target image. If the representations are human-like, the result of this optimization should look (to humans) very similar to the target image. (They call this property “invertibility”.) Normal image classifiers fail miserably at this test: the image looks basically like the source image, making it a classic adversarial example. Robust models on the other hand pass the test, suggesting that robust representations usually are human-like. They provide further evidence by showing that you can run feature visualization without regularization and get meaningful results (existing methods result in noise if you don’t regularize).
Cody's opinion: I found this paper clear, well-written, and straightforward in its empirical examination of how the representations learned by standard and robust models differ. I also have a particular interest in this line of research, since I have thought for a while that we should be more clear about the fact that adversarially-susceptible models aren’t wrong in some absolute sense, but relative to human perception in particular.
Rohin’s opinion: I agree with Cody above, and have a few more thoughts.
Most of the evidence in this paper suggests that the learned representations are “human-like” in the sense that two images that have similar representations must also be perceptually similar (to humans). That is, by enforcing that “small change in pixels” implies “small change in representations”, you seem to get for free the converse: “small change in representations” implies “small change in pixels”. This wasn’t obvious to me: a priori, each feature could have corresponded to 2+ “clusters” of inputs.
The authors also seem to be making a claim that the representations are semantically similar to the ones humans use. I don’t find the evidence for this as compelling. For example, they claim that when putting the “stripes” feature on a picture of an animal, only the animal gets the stripes and not the background. However, when I tried it myself in the interactive visualization, it looked like a lot of the background was also getting stripes.
One typical regularization for feature visualization is to jitter the image while optimizing it, which seems similar to selecting for robustness to imperceptible changes, so it makes sense that using robust features helps with feature visualization. That said, there are several other techniques for regularization, and the authors didn’t need any of them, which is very interesting. On the other hand, their visualizations don't look as good to me as those from other papers.
Robustness beyond Security: Computer Vision Applications(Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom et al) (summarized by Rohin): Since a robust model seems to have significantly more "human-like" features (see post above), it should be able to help with many of the tasks in computer vision. The authors demonstrate results on image generation, image-to-image translation, inpainting, superresolution and interactive image manipulation: all of which are done simply by optimizing the image to maximize the probability of a particular class label or the value of a particular learned feature.
Rohin's opinion: This provides more evidence of the utility of robust features, though all of the comments from the previous paper apply here as well. In particular, looking at the results, my non-expert guess is that they are probably not state-of-the-art (but it's still interesting that one simple method is able to do well on all of these tasks).
What You See Isn't Always What You Want(Alex Turner) (summarized by Rohin): This post makes the point that for Markovian reward functions on observations, since any given observation can correspond to multiple underlying states, we cannot know just by analyzing the reward function whether it actually leads to good behavior: it also depends on the environment. For example, suppose we want an agent to collect all of the blue blocks in a room together. We might simply reward it for having blue in its observations: this might work great if the agent only has the ability to pick up and move blocks, but won't work well if the agent has a paintbrush and blue paint. This makes the reward designer's job much more difficult. However, the designer could use techniques that don't require a reward on individual observations, such as rewards that can depend on the agent's internal cognition (as in iterated amplification), or rewards that can depend on histories (as in Deep RL from Human Preferences).
Rohin's opinion: I certainly agree that we want to avoid reward functions defined on observations, and this is one reason why. It seems like a more general version of the wireheading argument to me, and applies even if you think that the AI won't be able to wirehead, as long as it is capable enough to find other plans for getting high reward besides the one the designer intended.
Other progress in AI
Reinforcement learning
Behaviour Suite for Reinforcement Learning(Ian Osband et al) (summarized by Zach): Collecting clear, informative and scalable problems that capture important aspects about how to design general and efficient learning algorithms is difficult. Many current environments used to evaluate RL algorithms introduce confounding variables that make new algorithms difficult to evaluate. In this project, the authors assist this effort by introducing Behaviour Suite for Reinforcement Learning (bsuite), a library that facilitates reproducible and accessible research on core issues in RL. The idea of these experiments is to capture core issues, such as 'exploration' or 'memory', in a way that can be easily tested or evaluated. The main contribution of this project is an open-source project called bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any RL agent on bsuite. The suite is designed to be flexible and includes code to run experiments in parallel on Google cloud, with Jupyter notebook, and integrations with OpenAI Gym.
Zach's opinion: It's safe to say that work towards good evaluation metrics for RL agents is a good thing. I think this paper captures a lot of the notions of what makes an agent 'good' in a way that seems readily generalizable. The evaluation time on the suite is reasonable, no more than 30 minutes per experiment. Additionally, the ability to produce automated summary reports in standard formats is a nice feature. One thing that seems to be missing from the core set of experiments is a good notion of transfer learning capability beyond simple generalization. However, the authors readily note that the suite is a work in progress so I wouldn't doubt something covering that would be introduced in time.
Rohin's opinion: The most interesting thing about work like this is what "core issues" they choose to evaluate -- it's not clear to me whether e.g. "memory" in a simple environment is something that future research should optimize for.
Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.
Stuart Russell at CHAI has published a book about AI safety. Expect a bonus newsletter this week summarizing the book and some of the research papers that underlie it!
Audio version here (may not be up yet).
Highlights
Reframing Impact - Part 1 (Alex Turner) (summarized by Rohin): This sequence has exercises that will be spoiled by this summary, so take a moment to consider whether you want to read the sequence directly.
This first part of the sequence focuses on identifying what we mean by impact, presumably to help design an impact measure in the future. The punch line: an event is impactful to an agent if it changes the agent's ability to get what it wants. This is Attainable Utility (AU) theory. To quote the sequence: "How could something possibly be a big deal to us if it doesn't change our ability to get what we want? How could something not matter to us if it does change our ability to get what we want?"
Some implications and other ideas:
- Impact is relative to an agent: a new church is more impactful if you are a Christian than if not.
- Some impact is objective: getting money is impactful to almost any agent that knows what money is.
- Impact is relative to expectations: A burglar robbing your home is impactful to you (you weren't expecting it) but not very impactful to the burglar (who had planned it out). However, if the burglar was unsure if the burglary would be successful, than success/failure would be impactful to them.
While this may seem obvious, past work (AN #10) has talked about impact as being caused by changes in state. While of course any impact does involve a change in state, this is the wrong level of abstraction to reason about impact: fundamentally, impact is related to what we care about.
Rohin's opinion: To quote myself from a discussion with Alex, "you're looking at the optimal Q-function for the optimal utility function and saying 'this is a good measure of what we care about' and of course I agree with that". (Although this is a bit inaccurate -- it's not the optimal Q-function, but the Q-function relative to what we expect and know.)
This may be somewhat of a surprise, given that I've been pessimistic about impact measures in the past. However, my position is that it's difficult to simultaneously get three desiderata: value-agnosticism, avoidance of catastrophes, and usefulness. This characterization of impact is very explicitly dependent on values, and so doesn't run afoul of that. (Also, it just makes intuitive sense.)
This part of the sequence did change some of my thinking on impact measures as well. In particular, the sequence makes a distinction between objective impact, which applies to all (or most) agents, and value impact. This is similar to the idea of convergent instrumental subgoals, and the idea that large-scale multiagent training (AN#65) can lead to generally useful behaviors that can be applied to novel tasks. It seems plausible to me that we could make value-agnostic impact measures that primarily penalize this objective impact, and this might be enough to avoid catastrophes. This would prevent us from using AI for big, impactful tasks, but could allow for AI systems that pursue small, limited tasks. I suspect we'll see thoughts along these lines in the next parts of this sequence.
Technical AI alignment
Technical agendas and prioritization
AI Safety "Success Stories" (Wei Dai) (summarized by Matthew): It is difficult to measure the usefulness of various alignment approaches without clearly understanding what type of future they end up being useful for. This post collects "Success Stories" for AI -- disjunctive scenarios in which alignment approaches are leveraged to ensure a positive future. Whether these scenarios come to pass will depend critically on background assumptions, such as whether we can achieve global coordination, or solve the most ambitious safety issues. Mapping these success stories can help us prioritize research.
Matthew's opinion: This post does not exhaust the possible success stories, but it gets us a lot closer to being able to look at a particular approach and ask, "Where exactly does this help us?" My guess is that most research ends up being only minimally helpful for the long run, and so I consider inquiry like this to be very useful for cause prioritization.
Preventing bad behavior
Formal Language Constraints for Markov Decision Processes (Eleanor Quint et al) (summarized by Rohin): Within the framework of RL, the authors propose using constraints defined by DFAs (deterministic finite automata) in order to eliminate safety failures, or to prevent agents from exploring clearly ineffective policies (which would accelerate learning). Constraints can be defined on any auxiliary information that can be computed from the "base" MDP. A constraint could either restrict the action space, forcing the agent to take an action that doesn't violate the constraint, which they term "hard" constraints; or a constraint could impose a penalty on the agent, thus acting as a form of reward shaping, which they term a "soft" constraint. They consider two constraints: one that prevents the agent from "dithering" (going left, then right, then left, then right), and one that prevents the agent from "overactuating" (going in the same direction four times in a row). They evaluate their approach with these constraints on Atari games and Mujoco environments, and show that they lead to increased reward and decreased constraint violations.
Rohin's opinion: This method seems like a good way to build in domain knowledge about what kinds of action sequences are unlikely to work in a domain, which can help accelerate learning. Both of the constraints in the experiments do this. The paper also suggests using the technique to enforce safety constraints, but the experiments don't involve any safety constraints, and conceptually there do seem to be two big obstacles. First, the constraints will depend on state, but it is very hard to write such constraints given access only to actions and high-dimensional pixel observations. Second, you can only prevent constraint violations by removing actions one timestep before the constraint is violated: if there is an action that will inevitably lead to a constraint violation in 10 timesteps, there's no way in this framework to not take that action. (Of course, you can use a soft constraint, but this is then the standard technique of reward shaping.)
In general, methods like this face a major challenge: how do you specify the safety constraint that you would like to avoid violating? I'd love to see more research on how to create specifications for formal analysis.
Interpretability
Counterfactual States for Atari Agents via Generative Deep Learning (Matthew L. Olson et al)
Adversarial examples
Robustness beyond Security: Representation Learning (Logan Engstrom et al) (summarized by Cody): Earlier this year, a provocative paper (AN #62) out of MIT claimed that adversarial perturbations weren’t just spurious correlations, but were, at least in some cases, features that generalize to the test set. A subtler implied point of the paper was that robustness to adversarial examples wasn’t a matter of resolving the model’s misapprehensions, but rather one of removing the model’s sensitivity to features that would be too small for a human to perceive. If we do this via adversarial training, we get so-called “robust representations”. The same group has now put out another paper, asking the question: are robust representations also human-like representations?
To evaluate how human-like the representations are, they propose the following experiment: take a source image, and optimize it until its representations (penultimate layer activations) match those of some target image. If the representations are human-like, the result of this optimization should look (to humans) very similar to the target image. (They call this property “invertibility”.) Normal image classifiers fail miserably at this test: the image looks basically like the source image, making it a classic adversarial example. Robust models on the other hand pass the test, suggesting that robust representations usually are human-like. They provide further evidence by showing that you can run feature visualization without regularization and get meaningful results (existing methods result in noise if you don’t regularize).
Cody's opinion: I found this paper clear, well-written, and straightforward in its empirical examination of how the representations learned by standard and robust models differ. I also have a particular interest in this line of research, since I have thought for a while that we should be more clear about the fact that adversarially-susceptible models aren’t wrong in some absolute sense, but relative to human perception in particular.
Rohin’s opinion: I agree with Cody above, and have a few more thoughts.
Most of the evidence in this paper suggests that the learned representations are “human-like” in the sense that two images that have similar representations must also be perceptually similar (to humans). That is, by enforcing that “small change in pixels” implies “small change in representations”, you seem to get for free the converse: “small change in representations” implies “small change in pixels”. This wasn’t obvious to me: a priori, each feature could have corresponded to 2+ “clusters” of inputs.
The authors also seem to be making a claim that the representations are semantically similar to the ones humans use. I don’t find the evidence for this as compelling. For example, they claim that when putting the “stripes” feature on a picture of an animal, only the animal gets the stripes and not the background. However, when I tried it myself in the interactive visualization, it looked like a lot of the background was also getting stripes.
One typical regularization for feature visualization is to jitter the image while optimizing it, which seems similar to selecting for robustness to imperceptible changes, so it makes sense that using robust features helps with feature visualization. That said, there are several other techniques for regularization, and the authors didn’t need any of them, which is very interesting. On the other hand, their visualizations don't look as good to me as those from other papers.
Read more: Paper: Adversarial Robustness as a Prior for Learned Representations
Robustness beyond Security: Computer Vision Applications (Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom et al) (summarized by Rohin): Since a robust model seems to have significantly more "human-like" features (see post above), it should be able to help with many of the tasks in computer vision. The authors demonstrate results on image generation, image-to-image translation, inpainting, superresolution and interactive image manipulation: all of which are done simply by optimizing the image to maximize the probability of a particular class label or the value of a particular learned feature.
Rohin's opinion: This provides more evidence of the utility of robust features, though all of the comments from the previous paper apply here as well. In particular, looking at the results, my non-expert guess is that they are probably not state-of-the-art (but it's still interesting that one simple method is able to do well on all of these tasks).
Read more: Paper: Image Synthesis with a Single (Robust) Classifier
Critiques (Alignment)
Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More (summarized by Rohin): See Import AI.
Miscellaneous (Alignment)
What You See Isn't Always What You Want (Alex Turner) (summarized by Rohin): This post makes the point that for Markovian reward functions on observations, since any given observation can correspond to multiple underlying states, we cannot know just by analyzing the reward function whether it actually leads to good behavior: it also depends on the environment. For example, suppose we want an agent to collect all of the blue blocks in a room together. We might simply reward it for having blue in its observations: this might work great if the agent only has the ability to pick up and move blocks, but won't work well if the agent has a paintbrush and blue paint. This makes the reward designer's job much more difficult. However, the designer could use techniques that don't require a reward on individual observations, such as rewards that can depend on the agent's internal cognition (as in iterated amplification), or rewards that can depend on histories (as in Deep RL from Human Preferences).
Rohin's opinion: I certainly agree that we want to avoid reward functions defined on observations, and this is one reason why. It seems like a more general version of the wireheading argument to me, and applies even if you think that the AI won't be able to wirehead, as long as it is capable enough to find other plans for getting high reward besides the one the designer intended.
Other progress in AI
Reinforcement learning
Behaviour Suite for Reinforcement Learning (Ian Osband et al) (summarized by Zach): Collecting clear, informative and scalable problems that capture important aspects about how to design general and efficient learning algorithms is difficult. Many current environments used to evaluate RL algorithms introduce confounding variables that make new algorithms difficult to evaluate. In this project, the authors assist this effort by introducing Behaviour Suite for Reinforcement Learning (bsuite), a library that facilitates reproducible and accessible research on core issues in RL. The idea of these experiments is to capture core issues, such as 'exploration' or 'memory', in a way that can be easily tested or evaluated. The main contribution of this project is an open-source project called bsuite, which instantiates all experiments in code and automates the evaluation and analysis of any RL agent on bsuite. The suite is designed to be flexible and includes code to run experiments in parallel on Google cloud, with Jupyter notebook, and integrations with OpenAI Gym.
Zach's opinion: It's safe to say that work towards good evaluation metrics for RL agents is a good thing. I think this paper captures a lot of the notions of what makes an agent 'good' in a way that seems readily generalizable. The evaluation time on the suite is reasonable, no more than 30 minutes per experiment. Additionally, the ability to produce automated summary reports in standard formats is a nice feature. One thing that seems to be missing from the core set of experiments is a good notion of transfer learning capability beyond simple generalization. However, the authors readily note that the suite is a work in progress so I wouldn't doubt something covering that would be introduced in time.
Rohin's opinion: The most interesting thing about work like this is what "core issues" they choose to evaluate -- it's not clear to me whether e.g. "memory" in a simple environment is something that future research should optimize for.
Read more: See Import AI