I want to thank Sebastian Farquhar, Laurence Midgley and Johan van den Heuvel, for feedback and discussion on this post.
Some time ago I asked the question “What is the role of Bayesian ML in AI safety/alignment?”. The response of the EA and Bayesian ML community was very helpful. Thus, I decided to collect and distill the answers and provide more context for current and future AI safety researchers.
Clarification: I don’t think many people (<1% of the alignment community) should work on Bayesian ML or that it is even the most promising path to alignment. I just want to provide a perspective and give an overview. I personally am not that bullish on Bayesian ML anymore (see shortcomings) but I’m in a relatively unique position where I have a decent overview of AI safety and the Bayesian ML literature and think an overview post like this might be helpful.
A working definition
There is no agreed-upon definition for Bayesian ML. I use the term for systems that broadly have any of the following properties
Implicitly or explicitly use Bayes theorem.
Approximate and quantify uncertainty for their estimates, e.g. return distributions instead of point estimates, and allow for the specification of prior distributions.
Systems that have a latent state that can be continuously updated with new data without being fully retrained, e.g. conjugate inference. This is the vaguest property since it plausibly also applies to pre-training and fine-tuning LLMs which are usually not seen as explicitly Bayesian algorithms.
There are a lot of things that current ML systems do poorly in comparison to humans. They are often not as data-efficient as humans are, they don’t generalize well, they are often not robust to adversarial inputs, they often can’t learn during deployment and much more (none of these properties are strictly necessary for a system to be catastrophically dangerous; so their absence is not a guarantee for safety). However, many of these properties would likely exist in a system that could be called AGI.
The Bayesian framework provides some answers to these problems. Bayesian methods are often more data efficient, they can be easily updated through Bayes theorem, they are sometimes more robust to adversarial inputs (see here or here) and much more. In practice, it is often hard to build Bayesian methods that fulfill all of these properties, but in theory, they should exist.
Therefore, while current Bayesian systems often underperform compared to their non-Bayesian counterparts, we might have to turn to Bayesian systems in the future if we want to have agents with all of these properties. In this case, the Bayesian framing is a bet on the future trajectory of ML rather than a statement about current AI systems.
Some people within the Bayesian ML community have stated this view in the past and work primarily on Bayesian ML. Emtiyaz Khan, for example, works on the Bayes-duality project which aims to “develop a new learning paradigm for Artificial Intelligence that learns like humans in an adaptive, robust, and continuous fashion”. Andrew Gordon Wilson is one of the leading researchers in Bayesian ML and much of his work is inspired by the problems of current ML systems I described above.
I personally think that these problems are real and important but I’m not sure that the answer to them has to be Bayesian or Bayesian in the way we expect. For example, I could imagine, that an RL agent might become Bayesian after sufficiently long training without any explicit Bayesian inductive bias or other Bayesian design choices by the people who train it (see here for evidence in toy models and here for behavioral flags of Bayesian behavior). Furthermore, I want to clarify that I think researchers should adopt a Bayesian mindset but that doesn’t imply that the system itself has to be explicitly designed in a Bayesian fashion.
Not overfitting to recent trends
You could frame the point above as “not overfitting to recent trends”. On the macro scale, AI has seen many different frameworks from symbolic AI over probabilistic graphical models to Deep Learning. On a smaller scale, Deep Learning has seen many different tasks like image classification, RL, language modeling, etc., and architectures like CNNs, RNNs, transformers, and many more. All of these trends have their ups and downs and following the latest hype cycle is often not the optimal strategy if you want to make fundamental discoveries. Therefore, if you think that future AI systems are Bayesian, the performance of today’s system is merely a small distraction in a large decade-long project and most of the current attention is merely one of many hype cycles.
On the other hand, Deep Learning seems to be really powerful, so I don’t expect a completely different framework within the next couple of years and think that most Bayesian ML will be built on top of Deep Learning systems rather than being orthogonal to them.
“Knowing what we don’t know” & Out-of-distribution detection
One of the core applications of Bayesian ML is uncertainty quantification. Often uncertainty quantification is intended to give you better out-of-distribution (OOD) detection, i.e. the model tells you what it doesn’t know. For example, if you train an image classifier on dogs and cats and you ask it to classify a giraffe, you want the system to be able to tell you “this is neither a dog nor a cat, I don’t know what that is”. In theory, a well-calibrated Bayesian model is a great way to address this problem. In practice, Bayesian NNs are often better calibrated wrt OOD performance than vanilla NNs but by far not as good as we would want them to be (e.g. comparable to or better than humans) and there are important caveats about what kind of OOD behavior we talk about (see e.g. here).
One of the core problems of alignment is ensuring robust OOD behavior and robustness against distribution shift, e.g. if you train an agent in a specific environment and then deploy it in a different environment, you want the agent to act “reasonably”. In other words, the fact that you shift the input distributions should not imply that the agent will take extreme and potentially dangerous actions. The important caveat here is that we first want robust goal generalization and then robust capabilities generalization. When capabilities generalize but goals don’t this is a recipe for failure. Most academics who work on robust generalization work on the capabilities and not the goals part, so working on the goals is likely especially neglected.
I personally think that OOD robustness is an important topic for alignment but I’m not sure if Bayesian models are the best answer. I found the talk “I can’t believe Bayesian DL is not better” gave a good intuition on why some of the current Bayesian methods might not be as good as we would expect.
Learning from human preferences
One possible approach to reduce the misspecification of goal functions is to let the system learn the reward function from human feedback (related to but not the same as inverse reinforcement learning; see e.g. this overview paper). There are many ways in which this reward function could be learned without the need for Bayesian methods but I think there are two arguments for why you might want to model this in a Bayesian fashion. Firstly, you might want to model the learned reward function as a distribution over functions rather than one single function. This makes your reward function more robust and enables probabilistic assessments. Secondly, LHF/IRL has to be somewhat data efficient to be practical because most of these problems usually don’t have that many datapoints. A user just doesn’t want to teach the model forever before it becomes useful. A straightforward project here would be to apply Bayesian ML to RLHF.
IRL is one possible approach to address the alignment problem and has therefore gotten much attention from AI safety researchers in the past. The center for human-compatible AI, for example, has a long list of papers on IRL and active members of the safety community such as Rohin Shah, David Lindner and Adam Gleave have worked on it at some point in their careers.
I personally find the approach of IRL theoretically clean and very interesting and I think that the Bayesian angle could provide some benefits. However, I ultimately expect most human values and value systems to be complex and somewhat inconsistent and learning them, therefore, requires models that can represent such a complex function. Currently, foundation models such as GPT-N that are finetuned with RLHF seem like the best approach for that (which doesn’t have a Bayesian motivation). Current Bayesian methods, on the other hand, often require tractable distributions (e.g. Gaussians) or hand-crafted models which I think are not suited for the necessary scale and complexity. However, I could imagine some combination of DL and Bayesian methods to provide a decent solution in the long run.
Reward uncertainty in RL
One related idea is to specify the reward as a distribution over functions rather than a singular function. A distribution might reduce overfitting and goodharting and might lead to systems that are more robust to distributional shifts. Since the Bayesian approach is a natural first choice to specify distributions, it could be a good fit for reward uncertainty in RL.
Instrumental - Bayesians understand the world better
Even if you think Bayesian ML will play no role in AI safety, some skills are likely helpful and transferable nonetheless.
A deep understanding of probability and statistics: Understanding the basics like probability distributions, hypothesis testing, sampling strategies, etc. is useful even if you never work with Bayesian ML. In the end, ML is mostly designing algorithms that learn from data, and thus understanding these tools is an advantage.
The Bayesian framework: Bayes theorem is simple yet powerful. Viewing the world through priors, likelihoods and evidence is a nice lens to understand it. Probabilistic graphical models and Bayesian hypothesis testing have a similar “clean but powerful” vibe and I personally think I often understand the world better when using these concepts.
Statistical modeling: Many data-driven disciplines such as neuroscience, economics, medicine, etc. have started to use Bayesian statistics because it allows you to express concepts that are hard to do otherwise, e.g. a principled quantification of uncertainty, better model comparisons and modeling priors. Furthermore, some of the bottlenecks of Bayesian methods such as software packages or computing power have been reduced in recent years. I expect this trend to continue; thus, it makes sense to know the basics of Bayesian data analysis when working on practical applications. I was surprised by how useful the basic Bayesian data science toolkit was for me when doing quantitative AI forecasting, e.g. with Epoch.
Real-world examples
I think that an example of a real-world application of the Bayesian lens is the paper “RL with KL penalties is better seen as Bayesian inference”. The authors show that a specific technique to train LLMs can also be phrased as a variational inference problem which provides a neat Bayesian interpretation. I feel like similar situations happen all the time where people design a specific technique and later realize that it has a relatively clean Bayesian interpretation and thus connects to a lot of other things we already know and value. For example, I found it valuable to think of regularization techniques as priors.
One important question is, of course, how much predictive power these clean Bayesian interpretations have and I’m personally undecided about that.
Causality
Some people think that causality is a key ingredient both for more capable AI systems as well as safer AI systems. In the case of capabilities, some people suggest that current systems learn spurious correlations rather than causal relationships which prevents them from generalizing correctly. In the case of safety, some people suggest that agents act according to their incentives and to understand and specify incentives correctly, we need to understand their underlying causal mechanisms and the available counterfactual actions.
Causality doesn’t directly require Bayesian inference but they overlap. The most common way to model causal relationships, for example, is via structural causal models which are a subset of Bayes nets. Furthermore, causal modeling requires many components from the probabilistic ML toolbox since they are expressed via probabilities and distributions. Therefore, a background in Bayesian ML is very helpful to contribute to causal ML research.
I personally think causal incentives and causal ML could be relevant for safety but I expect it to be in a very convoluted way. For example, I think that the causal models that end up being used in practice (and have the potential to be dangerous) are not some neat human-designed statistical causal models. Rather, I expect these causal models to be very messy and stored in the weights of RL agents that learned them by interacting with their environment. Therefore, understanding more about causality could be an important component of AI safety but can’t be applied to state-of-the-art models without advanced interpretability techniques.
Combining Bayes Nets and Neural Networks
There are many synergies between Bayes Nets (or probabilistic graphical models) and Neural Networks. Bayes Nets work well in the low-data regime, they are often interpretable (though with limits) and allow for the incorporation of priors. NNs on the other hand, work well with large amounts of unstructured data and are much more scalable. In some sense, the two approaches complement each other. Thus, it seems intuitive that there should be a combination of Bayes Nets and NNs that gets the best of both worlds, e.g. a high-level Bayes Net module that is concerned with abstract reasoning and a low-level NN module that automates perception. One of the reasons why this would be helpful for alignment is that the high-level abstract variables would be more interpretable and controllable (at least that’s the naive hypothesis).
Johan and I have tried a minimal version of this approach in Johan's Master's thesis (not yet public) but didn’t think it was very promising. It was hard to get these hybrid systems to train reliably and the final results weren’t much more interpretable than an NN. However, we only explored two possible ways to combine NNs and Bayes Nets, so this shouldn’t be seen as strong evidence. Probably there are better ways that we haven’t considered yet.
Shortcomings of Bayesian ML for AI safety
While Bayesian ML has some nice theoretical properties and framings, I think there are still some fundamental shortcomings. These include:
Bayesian DL tends to lag behind the state of the art by multiple years. Many approaches in Bayesian DL are much more computationally expensive than their non-Bayesian alternatives or not applicable to large state-of-the-art models for other reasons. There are sometimes cheaper approximations but they come with other shortcomings. In general, my current intuition about Bayesian DL is that it is really nice in theory but very hard to get to work well in practice.
Real AGI is probably messy. I expect AGI to be a large messy patchwork of Deep Learning modules. Even if all of these modules are Bayesian wrt their specific loss function, it doesn’t mean that the entire system acts in a Bayesian way or brings any theoretical guarantees. Analogously, you could argue that individual neurons in the brain act in a Bayesian way or that some high-level learning of humans can be phrased in a Bayesian fashion but that doesn’t imply any theoretical guarantees about human behavior.
Bayesian DL doesn’t answer the alignment problems I find most important. I expect that in the long run deception and inner alignment will pose the biggest (not necessarily the most likely) risks from advanced AI. The Bayesian lens might be helpful to think about them but I think it doesn’t provide that much value. Other approaches that are orthogonal to Bayesian ML such as interpretability seem more suited in my opinion. Obviously, the importance of this point strongly hinges on what you think are the most relevant problems in AI safety.
Note, that I still think the Bayesian lens is the correct way to think about the world and it is the right way to do statistics. I’m merely saying that I personally don’t think Bayesian ML will play a big role in alignment.
Implications and conclusions
I summarized my best understanding of where Bayesian ML could be helpful for AI safety. I once thought that the Bayesian lens might be a good way to address some of the core problems of alignment. This was one of my main motivations to choose a Ph.D. in the field. After working with Bayesian ML for more than two years now, I feel like there are some interesting ideas and perspectives but it doesn’t address the most fundamental challenges of AI safety.
Therefore, I see the work on Bayesian ML for AI safety as a way to hedge other approaches and diversify our bets as a community. Concretely, I think that Bayesian ML would be more relevant if the Deep Learning paradigm breaks and scaling turns out to be insufficient for generalization. In that case, a new more explicitly Bayesian paradigm could be the answer. However, I don’t think that most AI safety researchers (probably <1%) should work on these topics at this point in time because other approaches just seem much more promising.
I want to thank Sebastian Farquhar, Laurence Midgley and Johan van den Heuvel, for feedback and discussion on this post.
Some time ago I asked the question “What is the role of Bayesian ML in AI safety/alignment?”. The response of the EA and Bayesian ML community was very helpful. Thus, I decided to collect and distill the answers and provide more context for current and future AI safety researchers.
Clarification: I don’t think many people (<1% of the alignment community) should work on Bayesian ML or that it is even the most promising path to alignment. I just want to provide a perspective and give an overview. I personally am not that bullish on Bayesian ML anymore (see shortcomings) but I’m in a relatively unique position where I have a decent overview of AI safety and the Bayesian ML literature and think an overview post like this might be helpful.
A working definition
There is no agreed-upon definition for Bayesian ML. I use the term for systems that broadly have any of the following properties
Roles
High-level - Future AI systems might be Bayesian
This section is largely inspired by a response from Emtiyaz Khan and a different response from Sebastian Farquhar.
There are a lot of things that current ML systems do poorly in comparison to humans. They are often not as data-efficient as humans are, they don’t generalize well, they are often not robust to adversarial inputs, they often can’t learn during deployment and much more (none of these properties are strictly necessary for a system to be catastrophically dangerous; so their absence is not a guarantee for safety). However, many of these properties would likely exist in a system that could be called AGI.
The Bayesian framework provides some answers to these problems. Bayesian methods are often more data efficient, they can be easily updated through Bayes theorem, they are sometimes more robust to adversarial inputs (see here or here) and much more. In practice, it is often hard to build Bayesian methods that fulfill all of these properties, but in theory, they should exist.
Therefore, while current Bayesian systems often underperform compared to their non-Bayesian counterparts, we might have to turn to Bayesian systems in the future if we want to have agents with all of these properties. In this case, the Bayesian framing is a bet on the future trajectory of ML rather than a statement about current AI systems.
Some people within the Bayesian ML community have stated this view in the past and work primarily on Bayesian ML. Emtiyaz Khan, for example, works on the Bayes-duality project which aims to “develop a new learning paradigm for Artificial Intelligence that learns like humans in an adaptive, robust, and continuous fashion”. Andrew Gordon Wilson is one of the leading researchers in Bayesian ML and much of his work is inspired by the problems of current ML systems I described above.
I personally think that these problems are real and important but I’m not sure that the answer to them has to be Bayesian or Bayesian in the way we expect. For example, I could imagine, that an RL agent might become Bayesian after sufficiently long training without any explicit Bayesian inductive bias or other Bayesian design choices by the people who train it (see here for evidence in toy models and here for behavioral flags of Bayesian behavior). Furthermore, I want to clarify that I think researchers should adopt a Bayesian mindset but that doesn’t imply that the system itself has to be explicitly designed in a Bayesian fashion.
Not overfitting to recent trends
You could frame the point above as “not overfitting to recent trends”. On the macro scale, AI has seen many different frameworks from symbolic AI over probabilistic graphical models to Deep Learning. On a smaller scale, Deep Learning has seen many different tasks like image classification, RL, language modeling, etc., and architectures like CNNs, RNNs, transformers, and many more. All of these trends have their ups and downs and following the latest hype cycle is often not the optimal strategy if you want to make fundamental discoveries. Therefore, if you think that future AI systems are Bayesian, the performance of today’s system is merely a small distraction in a large decade-long project and most of the current attention is merely one of many hype cycles.
On the other hand, Deep Learning seems to be really powerful, so I don’t expect a completely different framework within the next couple of years and think that most Bayesian ML will be built on top of Deep Learning systems rather than being orthogonal to them.
“Knowing what we don’t know” & Out-of-distribution detection
One of the core applications of Bayesian ML is uncertainty quantification. Often uncertainty quantification is intended to give you better out-of-distribution (OOD) detection, i.e. the model tells you what it doesn’t know. For example, if you train an image classifier on dogs and cats and you ask it to classify a giraffe, you want the system to be able to tell you “this is neither a dog nor a cat, I don’t know what that is”. In theory, a well-calibrated Bayesian model is a great way to address this problem. In practice, Bayesian NNs are often better calibrated wrt OOD performance than vanilla NNs but by far not as good as we would want them to be (e.g. comparable to or better than humans) and there are important caveats about what kind of OOD behavior we talk about (see e.g. here).
One of the core problems of alignment is ensuring robust OOD behavior and robustness against distribution shift, e.g. if you train an agent in a specific environment and then deploy it in a different environment, you want the agent to act “reasonably”. In other words, the fact that you shift the input distributions should not imply that the agent will take extreme and potentially dangerous actions. The important caveat here is that we first want robust goal generalization and then robust capabilities generalization. When capabilities generalize but goals don’t this is a recipe for failure. Most academics who work on robust generalization work on the capabilities and not the goals part, so working on the goals is likely especially neglected.
There are many active scholars who work on Bayesian ML and OOD detection such as Agustinus Kristiadi, Andrew Gordon Wilson, Sebastian Nowozin, Vincent Fortuin, Roger Grosse, Alexander Immer and more.
I personally think that OOD robustness is an important topic for alignment but I’m not sure if Bayesian models are the best answer. I found the talk “I can’t believe Bayesian DL is not better” gave a good intuition on why some of the current Bayesian methods might not be as good as we would expect.
Learning from human preferences
One possible approach to reduce the misspecification of goal functions is to let the system learn the reward function from human feedback (related to but not the same as inverse reinforcement learning; see e.g. this overview paper). There are many ways in which this reward function could be learned without the need for Bayesian methods but I think there are two arguments for why you might want to model this in a Bayesian fashion. Firstly, you might want to model the learned reward function as a distribution over functions rather than one single function. This makes your reward function more robust and enables probabilistic assessments. Secondly, LHF/IRL has to be somewhat data efficient to be practical because most of these problems usually don’t have that many datapoints. A user just doesn’t want to teach the model forever before it becomes useful. A straightforward project here would be to apply Bayesian ML to RLHF.
IRL is one possible approach to address the alignment problem and has therefore gotten much attention from AI safety researchers in the past. The center for human-compatible AI, for example, has a long list of papers on IRL and active members of the safety community such as Rohin Shah, David Lindner and Adam Gleave have worked on it at some point in their careers.
I personally find the approach of IRL theoretically clean and very interesting and I think that the Bayesian angle could provide some benefits. However, I ultimately expect most human values and value systems to be complex and somewhat inconsistent and learning them, therefore, requires models that can represent such a complex function. Currently, foundation models such as GPT-N that are finetuned with RLHF seem like the best approach for that (which doesn’t have a Bayesian motivation). Current Bayesian methods, on the other hand, often require tractable distributions (e.g. Gaussians) or hand-crafted models which I think are not suited for the necessary scale and complexity. However, I could imagine some combination of DL and Bayesian methods to provide a decent solution in the long run.
Reward uncertainty in RL
One related idea is to specify the reward as a distribution over functions rather than a singular function. A distribution might reduce overfitting and goodharting and might lead to systems that are more robust to distributional shifts. Since the Bayesian approach is a natural first choice to specify distributions, it could be a good fit for reward uncertainty in RL.
Instrumental - Bayesians understand the world better
Even if you think Bayesian ML will play no role in AI safety, some skills are likely helpful and transferable nonetheless.
Real-world examples
I think that an example of a real-world application of the Bayesian lens is the paper “RL with KL penalties is better seen as Bayesian inference”. The authors show that a specific technique to train LLMs can also be phrased as a variational inference problem which provides a neat Bayesian interpretation. I feel like similar situations happen all the time where people design a specific technique and later realize that it has a relatively clean Bayesian interpretation and thus connects to a lot of other things we already know and value. For example, I found it valuable to think of regularization techniques as priors.
One important question is, of course, how much predictive power these clean Bayesian interpretations have and I’m personally undecided about that.
Causality
Some people think that causality is a key ingredient both for more capable AI systems as well as safer AI systems. In the case of capabilities, some people suggest that current systems learn spurious correlations rather than causal relationships which prevents them from generalizing correctly. In the case of safety, some people suggest that agents act according to their incentives and to understand and specify incentives correctly, we need to understand their underlying causal mechanisms and the available counterfactual actions.
Causal inference is an active subfield of ML with many active members such as the group of Bernhardt Schölkopf in Tübingen and a lot of other scientists. On the safety side, the group of Tom Everitt at DeepMind spearheads the work on causal incentives to model and investigate the incentives of different agents.
Causality doesn’t directly require Bayesian inference but they overlap. The most common way to model causal relationships, for example, is via structural causal models which are a subset of Bayes nets. Furthermore, causal modeling requires many components from the probabilistic ML toolbox since they are expressed via probabilities and distributions. Therefore, a background in Bayesian ML is very helpful to contribute to causal ML research.
I personally think causal incentives and causal ML could be relevant for safety but I expect it to be in a very convoluted way. For example, I think that the causal models that end up being used in practice (and have the potential to be dangerous) are not some neat human-designed statistical causal models. Rather, I expect these causal models to be very messy and stored in the weights of RL agents that learned them by interacting with their environment. Therefore, understanding more about causality could be an important component of AI safety but can’t be applied to state-of-the-art models without advanced interpretability techniques.
Combining Bayes Nets and Neural Networks
There are many synergies between Bayes Nets (or probabilistic graphical models) and Neural Networks. Bayes Nets work well in the low-data regime, they are often interpretable (though with limits) and allow for the incorporation of priors. NNs on the other hand, work well with large amounts of unstructured data and are much more scalable. In some sense, the two approaches complement each other. Thus, it seems intuitive that there should be a combination of Bayes Nets and NNs that gets the best of both worlds, e.g. a high-level Bayes Net module that is concerned with abstract reasoning and a low-level NN module that automates perception. One of the reasons why this would be helpful for alignment is that the high-level abstract variables would be more interpretable and controllable (at least that’s the naive hypothesis).
Johan and I have tried a minimal version of this approach in Johan's Master's thesis (not yet public) but didn’t think it was very promising. It was hard to get these hybrid systems to train reliably and the final results weren’t much more interpretable than an NN. However, we only explored two possible ways to combine NNs and Bayes Nets, so this shouldn’t be seen as strong evidence. Probably there are better ways that we haven’t considered yet.
Shortcomings of Bayesian ML for AI safety
While Bayesian ML has some nice theoretical properties and framings, I think there are still some fundamental shortcomings. These include:
Note, that I still think the Bayesian lens is the correct way to think about the world and it is the right way to do statistics. I’m merely saying that I personally don’t think Bayesian ML will play a big role in alignment.
Implications and conclusions
I summarized my best understanding of where Bayesian ML could be helpful for AI safety. I once thought that the Bayesian lens might be a good way to address some of the core problems of alignment. This was one of my main motivations to choose a Ph.D. in the field. After working with Bayesian ML for more than two years now, I feel like there are some interesting ideas and perspectives but it doesn’t address the most fundamental challenges of AI safety.
Therefore, I see the work on Bayesian ML for AI safety as a way to hedge other approaches and diversify our bets as a community. Concretely, I think that Bayesian ML would be more relevant if the Deep Learning paradigm breaks and scaling turns out to be insufficient for generalization. In that case, a new more explicitly Bayesian paradigm could be the answer. However, I don’t think that most AI safety researchers (probably <1%) should work on these topics at this point in time because other approaches just seem much more promising.