First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”)
I wouldn't quite go so far as to say it "tackles" the problem of outer alignment, but it does tie into (pragmatic) attempts to solve the problem by identifying the ontology of realistically specifiable reward functions. However, maybe I'm misunderstanding you?
I'm not sure -- what significance are you placing on the word "tackle" in this context? I would also not say that the main value proposition of this research agenda lies in identifying the ontology of the reward function --- the main questions for this area of research may even be mostly orthogonal to that question.
I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?
No, that is not a misinterpretation: I do think that this research agenda has the potential to get pretty close to solving outer alignment. More specifically, if it is (practically) possible to solve outer alignment through some form of reward learning, then I think this research agenda will establish how that can be done (and prove that this method works), and if it isn't possible, then I think this research agenda will produce a precise understanding of why that isn't possible (which would in turn help to inform subsequent research). I don't think this research agenda is the only way to solve outer alignment, but I think it is the most promising way to do it.
At the time of writing, I have just (nearly) finished my PhD at Oxford. During that time, most of my main research has been motivated by the goal of developing a theoretical foundation for the field of reward learning. The purpose of this sequence is to explain and motivate this research agenda, and to provide an accessible overview over some of my results. It will be divided into several smaller posts:
I hope that this sequence will make it easier to get an overview over my recent research. I also want to welcome discussion, so please make a comment if you have any questions. Finally, I think this research agenda is promising, and it has proven to be very tractable, yet not many people work on it (or in related areas). I therefore also want to advertise this research, and make this post a good starting point for people who may be interested in working on similar questions themselves.
Defining the Theoretical Reward Learning Research Agenda
In one sentence, the aim of this research agenda is to develop a mature theoretical foundation for the field of reward learning (and relevant adjacent areas). This is very closely related to work that has been done in the AI safety space for quite some time, but I have approached it from a slightly different angle. Concretely, this research agenda involves answering questions such as:
If we had satisfactory answers to these questions, then we would have a much better understanding of which reward learning methods are reliable (if any). For example, given a reward learning algorithm, we would be able to answer questions such as:
And so on.
I have made substantial progress on most of the questions above, and with a bit more effort, I think we could soon have a complete solution to most (or all) of them.
Motivation and Background
There are several reasons for why I am excited about the theoretical reward learning research agenda. First of all, it tackles one of the main core difficulties of AI safety in a fairly direct way — namely, the difficulty of how to specify what we want AI systems to do (aka “outer alignment”). Progress on this research agenda therefore seems likely to be useful in a fairly broad range of scenarios, and applicable to a fairly broad range of AI systems (in contrast to research that is closely tied to specific alignment strategies, specific types of AI systems, or very specific assumptions).
Moreover, I want AI safety solutions that produce explicit, quantitative safety guarantees that are underpinned and motivated by explicit, auditable assumptions. I don’t think that purely empirical methods are adequate for producing safety assurances that are satisfactory or acceptable for very powerful AI systems. A very mature form of mechanistic interpretability could produce such quantitative safety guarantees, but at the moment, the field of interpretability does not appear to be making progress sufficiently quickly. An alternative is to turn to formal learning theory, which is what the theoretical reward learning research agenda does — such research can yield quantitative safety guarantees in the form of convergence guarantees, generalisation bounds, regret bounds, and the like, and if such results are formulated properly, then they can provide useful constraints on the behaviour of even superhuman AI systems. This general approach thus has some hope to provide useful quantitative safety guarantees (in addition to other benefits, which I discuss below).
When I explain my research to people, I sometimes get the reaction, “Why work on reinforcement learning? Isn’t that outdated, compared to LLMs?” Let me therefore also preempt this question. In general, there are many ways to get an AI system to do what we want — for example, we can use supervised learning, imitation learning, prompting, or reward maximisation. In some cases we can also use more exotic methods, such as direct manipulation of latent activation vectors in trained models. Moreover, reward maximisation is not always the best (or most practical) method. For example, LLMs are primarily trained through supervised learning, and then fine-tuned through (regularised) reward optimisation. However, the reward optimisation step is not essential, and we can obtain similar results by only using supervised fine-tuning or prompting. So what makes reward functions important?
The fundamental reason for why I think reward optimisation is especially important (among all methods for specifying goals to AI systems) is that this is the only method that is not inherently limited by what we already know how to do. For example, suppose that we want to design a chess-playing AI. We can do this using only supervised learning and imitation learning (by training on chess games between strong human players). Moreover, we may even be able to get slightly superhuman performance this way, by combining the best play of many different human players. However, if we want to design a chess-playing AI that can invent completely new strategies and entirely outclass human chess players, then we must use something analogous to reward maximisation (together with either a search algorithm or an RL algorithm, or some other alternative to these). The same principle holds for other domains; without reward optimisation, we are always anchored to current human performance. Right now, LLMs are mainly used to automate things that humans already know how to do. However, when we want to move beyond that, I expect that reward optimisation (or something closely analogous to reward optimisation) will become important again.
I therefore still expect it to be the case that we at some point will have to provide an AI system with a reward function (or something very similar to that), and tell it to optimise that reward function within some large and open-ended domain. If that is the case, then we will want to know if the method that was used to create that reward function is reliable, and what assumptions are required to ensure that it is reliable. We will also want to know what kinds of failure modes we should expect, and if there are any ways to mitigate those failure modes. Answering those questions in turn requires a better theoretical understanding of the dynamics of reward learning and reward optimisation, in line with the questions I outlined above.
It is also important to note that none of this relies on the assumption that the AI system is trained through reinforcement learning (with Bellman updates, etc). Note that “reinforcement learning” refers to both a problem setting, and to a set of algorithms. The research agenda I have outlined is about the problem setting, not the algorithms. The problem setting is in turn highly general, and essentially covers the domain of all sequential decision problems.
Another reason for why I am excited about the theoretical reward learning research agenda is that it (plausibly) is important to have a deeper general understanding of the dynamics of pursuing goals, regardless of how exactly those goals are pursued. For example, game theory is highly relevant for understanding the dynamics of conflicts and strategic interactions in many different kinds of situations, even if the agents involved in those situations aren’t explicitly trying to compute Nash equilibria. Similarly, a better understanding of reward optimisation may be informative in many kinds of situations, even if those situations don’t involve agents that explicitly try to optimise reward functions.
Moreover, progress on this research agenda seems to be very tractable. I have personally made notable progress on several of these questions in a fairly short period of time, and I don’t anticipate that there would be any substantial difficulties in making even more progress. If this research agenda gets a bit more attention, then I think it would be entirely realistic to develop something like a “mathematical theory of outer alignment” on a timescale of maybe 1-3 years (and this is not accounting for the possibility of getting strong AI proof assistants within that time span).
This being said, I should also mention that there (of course) are important areas of AI safety that this research agenda doesn’t touch on at all. Most notably, it does not help with interpretability or inner alignment (except possibly that it may have something to say about what kinds of circuits we should look for in AI systems). Moreover, inner alignment may (at least plausibly) be a bigger problem than outer alignment (though this is also unclear).
Special Focus On IRL
Many of my recent papers have had a special focus on inverse reinforcement learning (IRL). In this section, I will briefly explain the unifying theme of these papers, and explain why I have focused on these problems in particular.
An IRL algorithm attempts to infer a person’s preferences based on their behaviour — this is usually formalised as the problem of inferring a reward function R from a policy π that has been computed from R. To do this, we need a model of how π relates to R (in other words, a model of how a person’s behaviour relates to their preferences). This is called a behavioural model. The behavioural model may in turn be specified manually, or learnt from data.
Of course, the relationship between a person’s preferences and their behaviour is in reality extremely complex. Indeed, a complete solution to this problem would also amount to a solution to the problem of how to create AGI. As such, while we can use various methods to create behavioural models that are more and more accurate, we should not expect to be able to create behavioural models that are perfectly accurate (regardless of how they are created). Therefore, we must expect the behavioural model to be misspecified in real-life cases.
This now raises the question of how robust IRL is to misspecification of the behavioural model. Is a mostly accurate behavioural model enough to ensure that the learnt reward function is mostly accurate, or can slight errors in the behavioural model lead to large errors in the inferred reward function? Stated differently, is the error in the inferred reward function somehow proportional to the error in the behavioural model? If it is, then IRL may be a feasible approach to learning representations of human preferences, whereas if it is not, then we should probably give up on IRL. I have dedicated a number of papers to investigating this question.
There are several reasons for why I have focused on this question in particular, out of all other potential questions in the theoretical reward learning research agenda. The first reason is (very simply) that it would be very good if we could get IRL to work. IRL is the only reward learning method that is self-supervised, which means that we have far (far) more data for IRL algorithms than we do for other forms of reward learning algorithms (such as e.g. RLHF). However, in order to know if IRL can be trusted, we must first understand how IRL is affected by misspecification. Is a roughly plausible model of bounded rationality enough to learn a pretty good reward function? If so, IRL will probably work! However, if IRL is highly sensitive to misspecification, then it will probably not work.
However, this is not my only reason for working on IRL. Another important reason is that I believe that in order to make deep progress in research, it is often necessary to tackle problems that are sufficiently hard. If a problem is too easy, it may be possible to solve it through shortcuts, whereas if it’s hard enough, then it cannot be solved without also making progress on deeper underlying issues. This “deeper” progress is in turn often useful independently of the particular problem that it was used to solve. The issue of how sensitive IRL is to misspecification is one of the (seemingly) hardest concrete questions I have identified within the theoretical reward learning research agenda, and so it is a good candidate for a problem that is “sufficiently hard”. In other words, I also see the problem of misspecification in IRL as a kind of “research whetstone” which may lead to further and more general progress in the theory of reward learning.
Another reason I think is worth mentioning is that there is a kind of duality relationship between IRL and reward optimisation. In IRL, we are going from a policy to a reward function, whereas in reward optimisation, we are going from a reward function to a policy (this is, after all, what the word “inverse” is referring to in “inverse reinforcement learning"). This means that in order to reason about the properties of IRL, we need to reason about the relationship between reward functions and (different types of) policies. Such reasoning may in turn also be useful for reasoning about the process of producing policies from reward functions. In other words, the specific subproblems that are involved with determining how sensitive IRL is to misspecified behavioural models have a lot of overlap with the specific subproblems that are involved with determining the consequences of incorrectly specified reward functions (and this problem, in turn, relates to almost all other questions in the theory of reward learning).
A final reason is that some people in the AI safety research community have a strategy for AI safety that uses IRL as a cornerstone (usually in the form of CIRL). This adds a further reason to scrutinise the overall theoretical feasibility of IRL.
Papers
So far, these are the papers I have written or contributed to that I consider to be part of the theoretical reward learning research agenda. This includes both papers written mostly by me, papers written in collaboration with other people, and papers written by other people under my supervision:
Here is also a selection of papers I had no involvement with, but which I consider to be contributing towards the broader aims of the theoretical reward learning research agenda, and which I want to especially highlight:
(This list is not exhaustive — more relevant papers may be found here, and in the references of the papers linked above.)
It is worth noting that Partial Identifiability and Misspecification in Inverse Reinforcement Learning is a long paper that combines the results from Invariance in Policy Optimisation and Partial Identifiability in Reward Learning, Misspecification in Inverse Reinforcement Learning, Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification, and STARC: A General Framework For Quantifying Differences Between Reward Functions. If you plan on reading all of these papers in detail, then it may be a good idea to instead just read this paper — it combines all the results in one place, with a unified framing and terminology, has streamlined some things compared to the earlier papers, and also contains a small handful of new results.
The Rest of this Sequence
In the coming entries of this sequence, I will provide in-depth summaries of some of my papers, and explain their setup and results in more detail (but less detail than what is provided in the papers themselves). Specifically, the papers I will cover are:
In addition to this, there will also be a post giving shorter overviews of some other papers and results from this research agenda, but without going into as much depth as the posts above. Finally, in the last entry of this sequence, I will also attempt to provide some helpful starting points for anyone who may be interested in contributing to this research themselves, in the form of a list of a few concrete open problems and some tools for how they may be approached.
If you have any questions, then please ask them in the comments!