In this post, we describe a generalization of Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) called crisp supra-MDPs and supra-POMDPs. The new feature of these decision processes is that the stochastic transition dynamics are multivalued, i.e. specified by credal sets. We describe how supra-MDPs give rise to crisp causal laws, the hypotheses of infra-Bayesian reinforcement learning. Furthermore, we discuss how supra-MDPs can approximate MDPs by a coarsening of the state space. This coarsening allows an agent to be agnostic about the detailed dynamics while still having performance guarantees for the full MDP.
Analogously to the classical theory, we describe an algorithm to compute a Markov optimal policy for supra-MDPs with finite time horizons. We also prove the existence of a stationary optimal policy for...
Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory again.
As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016 showed that the standard neural network architectures trained with standard training methods could memorize large quantities of random labelled data,...
Yep
Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.
Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let The set of probability distributions over is given by
There is a bijection between and the closed interval which is...
a subset? Why is it not just that product space? I'm assuming it's because this is a set of partial functions, but I don't see how taking a subset lets you account for that.
You're absolutely right, it should be a quotient space, not a subspace. In principle, it can be represented as a closed subspace of the product of copies of
In this case, as written, you don't need to say "An open set is then an arbitrary union of basis elements"
Actually, we do? For example, consider the space
However, thi...
The goal of this post is to give a summary of classical reinforcement learning theory that primes the reader to learn about infra-Bayesianism, which is a new framework for reinforcement learning that aims to solve problems related to AI alignment. We will concentrate on basic aspects of the classical theory that have analogous concepts in infra-Bayesianism, and explain these concepts using infra-Bayesianism conventions. The more technical proofs are contained in the proof section.
For the first part of this sequence and for links to other writings, see What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism.
One special case of reinforcement learning is the case of stochastic bandits. For example, a bird may have a choice of three levers. When the bird steps on a lever, either some...
You're right, it's supposed to be
where
Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern.
In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis.
Fitness-seekers are, in many ways, notably safer than what I'll...
I think you're making the distinction more confusing than it has to be.
There are things that has motivational pull, and there are things that don't but I do them anyway because they are a necessary step to get what I actually want.
Say I want to get an apple, and the easiest way to get one is going to the store and by some. Going to the sore in this story is clearly an instrumental goal, and enjoying eating my apple is a terminal goal[1]
Things that are instrumental can acquire the property of being terminal by association in our brain, because of how huma...
Double descent is a puzzling phenomenon in machine learning where increasing model size/training time/data can initially hurt performance, but then improve it. Evan Hubinger explains the concept, reviews prior work, and discusses implications for AI alignment and understanding inductive biases.
Shifting the losses by one time step doesn't really matter, since we're mostly interested in the shape of the regret bound which (up to mild changes in the constants) is not affected by this.