Goodhart in RL with KL: Appendix

Thomas Kwa

This is the appendix to the previous post on Goodhart’s Law and KL regularization, containing all of our proofs.

Theorem about distributions

Theorem 1: Given any heavy-tailed reference distribution over $R$ with mean $μ_{Q}$ , and any $M, ϵ > 0$ , there is a distribution $P$ with mean $μ_{P} > M$ and $D_{K L} (P ∥ Q) < ϵ$ .

Proof: WLOG let $μ_{Q} = 0$ . We construct a sequence of distributions ${P_{t}}$ such that ${lim}_{t \to \infty} E_{P_{t}} [X] \geq c$ for any constant $c$ , and ${lim}_{t \to \infty} D_{K L} (P_{t} ∥ Q) = 0$ . We define $P_{t}$ for any $t > c$ thusly. Writing $F_{P_{t}} (x)$ for the CDF $P r_{X \sim P_{t}} (X \leq x)$ and ${¯ F}_{P_{t}} (x)$ for $1 - F_{P_{t}} (x)$ , we let

{¯ F}_{P_{t}} (x) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} 1 - \frac{1 - c / t}{F_{Q} (t)} F_{Q} (x) & x \leq t \frac{c / t}{{¯ F}_{Q} (t)} {¯ F}_{Q} (x) & x > t \end{matrix}

Intuitively, we rescale the part of the distribution to the right of $t$ evenly to have total probability $c / t$ , which is less than 1 because $t > c$ .

We must check that ${lim}_{t \to \infty} E_{P_{t}} [X] = c$ . We can write

\begin{matrix} E_{P_{t}} [X] = F_{P_{t}} (t) E_{P_{t}} [X | X \leq t] & + {¯ F}_{P_{t}} (t) E_{P_{t}} [X | X > t] = F_{P_{t}} (t) E_{Q} [X | X \leq t] & + {¯ F}_{P_{t}} (t) E_{Q} [X | X > t] = F_{Q} (t) E_{Q} [X | X \leq t] & + {¯ F}_{Q} (t) E_{Q} [X | X > t] + (F_{P_{t}} (t) - F_{Q} (t)) & E_{Q} [X | X \leq t] + ({¯ F}_{P_{t}} (t) - {¯ F}_{Q} (t)) E_{Q} [X | X > t] = E_{Q} [X] + ({¯ F}_{P_{t}} (t) - & {¯ F}_{Q} (t)) (E_{Q} [X | X > t] - E_{Q} [X | X \leq t]) \end{matrix}

We know that $E_{Q} [X | X > t] > t$ because it is an integral of values strictly greater than t. Because $E_{Q} [X] = 0$ is a weighted average of $E_{Q} [X | X > t]$ and $E_{Q} [X | X \leq t]$ , and $E_{Q} [X | X > t] > 0$ , we know $E_{Q} [X | X \leq t] < 0$ . So $E_{Q} [X | X > t] - E_{Q} [X | X \leq t] > t .$ We also know that for sufficiently large $t$ , $(F_{P_{t}} (t) - F_{Q} (t)) > 0$ . Intuitively, starting from $Q$ , which has mean 0, $P_{t}$ moves a probability mass approaching $\frac{c}{t}$ from mean <0 to mean >t.

Now we can say

lim t \to \infty E_{P_{t}} [X] > lim t \to \infty [E_{Q} [X] + ({¯ F}_{P_{t}} (t) - {¯ F}_{Q} (t)) (t - 0)] = lim t \to \infty (\frac{c}{t} - {¯ F}_{Q} (t)) t = lim t \to \infty c - t {¯ F}_{Q} (t)

Because $Q$ has a finite mean, ${lim}_{t \to \infty} t {¯ F}_{Q} (t) = 0$ , and so ${lim}_{t \to \infty} E_{P_{t}} [X] \geq c$ .

Now we check that ${lim}_{t \to \infty} D_{K L} (P_{t} ∥ Q) = 0$ :

DKL(Pt∥Q)=∫RlogPt(dx)Q(dx)Pt(dx)=∫x≤tlogPt(dx)Q(dx)Pt(dx)+∫x>tlogPt(dx)Q(dx)Pt(dx)=FPt(t)logFPt(t)FQ(t)+¯FPt(t)log¯FPt(t)¯FQ(t)\quad since both ratios are constant=FPt(t)log1−c/tFQ(t)+¯FPt(t)log¯FPt(t)¯FQ(t)

Since both $1 - c / t$ and $F_{Q} (t)$ go to $1$ as $t \to \infty$ , the left term goes to $0$ , and so

\begin{matrix} lim t \to \infty D_{K L} (P_{t} ∥ Q) & \leq 0 + lim t \to \infty {¯ F}_{P_{t}} (t) log \frac{{¯ F}_{P_{t}} (t)}{{¯ F}_{Q} (t)} = lim t \to \infty \frac{c}{t} log \frac{c}{t {¯ F}_{Q} (t)} \leq lim t \to \infty \frac{c}{t} log \frac{1}{{¯ F}_{Q} (t)} = lim t \to \infty - \frac{c}{t} log {¯ F}_{Q} (t) since t>c \end{matrix}

$Q$ is heavy tailed, so by definition ${lim}_{t \to \infty} e^{a t} {¯ F}_{Q} (t) = \infty for all a > 0$ . This implies that for every $a > 0$ there is a sufficiently large $t_{c}$ so that for all $t > t_{c}$ , ${¯ F}_{Q} (x) > e^{- a t}$ , which means that $log {¯ F}_{Q} (t) > - a t$ .

Therefore for every $a > 0$ , ${lim}_{t \to \infty} D_{K L} (P_{t} ∥ Q) \leq {lim}_{t \to \infty} - c / t log {¯ F}_{Q} (t) < {lim}_{t \to \infty} - \frac{- a c t}{t} = a c$ , which since KL divergence is nonnegative means that ${lim}_{t \to \infty} D_{K L} (P_{t} ∥ Q) = 0$ as desired. $■$

Theorem about deterministic Markovian-return MDPs

Definition: A deterministic-transition MDP with Markovian returns (DMRMDP) is an MDP $(S, A, P, R)$ such that:

The transition function $P : S \times A \to S$ is deterministic, i.e., for each state $s \in S$ and action $a \in A$ , there exists a unique state $s^{'} \in S$ such that $P (s^{'} | s, a) = 1$ .
There is a set of sink states $E \subseteq S$ that terminate a trajectory, which is disjoint with the set of start states.
Returns are Markovian; that is, for any two trajectories $τ = (s_{1}, a_{1}, \dots, s_{n}), τ^{'} = (s_{1}^{'}, a_{1}^{'}, \dots, s_{n}^{'}),$ if $s_{n} = s_{n}^{'}$ , then $τ$ and $τ^{'}$ have identical return distributions. Equivalently, for the trajectory random variable $T = (S_{1}, A_{1}, \dots)$ distributed according to any policy, with return $G$ , $G ⊥ ⊥ (S_{< i}, A_{< i}) | S_{i}$ for any $i \geq 1$ .

Note: Sampling from a language model and applying RLHF is well-modeled as a DMRMDP, since the state is a sequence of tokens (actions) which deterministically results from the last token and returns depend only on the final state.

Theorem 2: Let $W = (S, A, P, R)$ be a deterministic-transition MDP with Markovian returns. Given $W$ we define the function that takes policies to trajectories $T r : (S \to Δ A) \to Δ (S \times A)^{*}$ , and the average return function $g : (S \times A)^{*} \to R$ which induces a function $G : Δ (S \times A)^{*} \to Δ R$ . Let $π_{0} : S \to Δ A$ be some reference policy. If $G \circ T r (π_{0})$ is heavy-tailed with finite mean $μ_{Q}$ , then for any $M, ϵ > 0$ , there is a policy $π$ with mean return $E [U | U \sim G \circ T r (π)] > M$ and $E_{s \in T, T \sim T r (π)} [D_{K L} (π (s) ∥ π_{0} (s))] < ϵ$ .

Proof: We will exhibit a distribution of trajectories $ρ$ such that $D_{K L} (ρ ∥ T r (π_{0})) < ϵ$ and $E [G (ρ)] > M$ , and then construct a policy $π$ with $T r (π) = ρ$ . Note that this proof applies for continuous action spaces if trajectories are replaced with measurable sets, but this would make it harder to read.

Let $ρ_{π_{0}} = T r (π_{0})$ . We have a heavy-tailed distribution of return $Q ≜ G (ρ_{π_{0}})$ over $R$ , so we can apply Theorem 1. But to define $ρ$ , we can construct $P_{t}$ in the proof of Theorem 1 in a particular way. For any $t > c$ , we need a $P_{t}$ that uniformly upweights values of mean return such that ${¯ F}_{P_{t}} (t) = c / t$ . We can define $ρ_{t}$ such that any trajectory $τ$ is upweighted by a factor depending only on its mean return:

ρ_{t} (τ) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} \frac{1 - c / t}{F_{Q} (t)} ρ_{π_{0}} (τ) & g (τ) \leq t \frac{c / t}{{¯ F}_{Q} (t)} ρ_{π_{0}} (τ) & g (τ) > t \end{matrix}

Then we can let $P_{t} ≜ G \circ ρ_{t}$ and the rest of the proof of Theorem 1 applies. Therefore, applying the theorem, we can let $ρ = ρ_{t}$ for sufficiently large $t$ , and then $μ_{G \circ ρ} > M$ and $D_{K L} (G \circ ρ, G \circ ρ_{π_{0}}) < ϵ$ . But by the chain rule for KL divergence, $D_{K L} (ρ, ρ_{π_{0}}) = D_{K L} (G \circ ρ, G \circ ρ_{π_{0}}) + E_{γ \sim G \circ ρ} [D_{K L} (ρ (T) | G (T) = γ ∥ ρ_{π_{0}} (T) | G (T) = γ)]$ . Since we constructed $ρ$ so that the probabilities of each $τ$ conditional on its return being $γ$ are equal, the second term is zero, and we also have $D_{K L} (ρ, ρ_{π_{0}}) < ϵ$ .

Finally, since the KL divergence between trajectory distributions is the sum of KL divergence between policies at each action in the trajectory, and each trajectory has at least one action, $E_{s \in T, T \sim T r (π)} [D_{K L} (π (s) ∥ π_{0} (s))] \leq E_{T \sim T r (π)} \sum_{s \in T} [D_{K L} (π (s) ∥ π_{0} (s))] = D_{K L} (ρ ∥ ρ_{π_{0}}) < ϵ$ as desired.

To define $π$ such that $T r (π) = ρ$ , we let $π (s, a) = P r (a_{i} = a | τ = (. . ., s, a_{i}, . . .) \sim ρ)$ .

Then the probability that any trajectory $τ = (s_{1}, a_{1}, \dots, a_{n})$ is sampled is:

\begin{matrix} T r (π) (τ) & = n \prod i = 1 π (s_{i}, a_{i}) = n \prod i = 1 P r (a_{i} = a_{i}^{'} | τ^{'} = (. . ., s, a_{i}^{'}, . . .) \sim ρ) = n \prod i = 1 P r (a_{i} = a_{i}^{'} | τ^{'} = (s_{1}^{'}, a_{1}^{'}, . . ., s, a_{i}^{'}, . . .) \sim ρ, s_{< i} = s^{'} < i, a < i = a_{< i}^{'}) & = ρ (τ) \end{matrix}

In (2), returns are Markovian, so all trajectory prefixes ending in state $s$ have the same distribution of returns under any policy. In the construction of $ρ$ , all trajectories with the same mean return have equal measure. Therefore, conditioning on earlier states and actions of $τ$ does not change the measure, so we can write (3). So $T r (π) = ρ$ as desired. $■$

Lagrange multipliers to minimize KL divergence

Theorem 3: If $V$ is light-tailed, $E_{Q} [V]$ is finite, and $d = D_{K L} (P ∥ Q)$ is bounded, then $E_{P} [V]$ is bounded, and $E_{P} [V] \to 0$ as $d \to 0$ .

Using Lagrange multipliers, we find that when KL divergence is minimized, we have $P (V) [λ_{1} log \frac{P (V)}{Q (V)} + λ_{2} - X] = 0$ for some constants $λ_{1}, λ_{2}$ , so

log \frac{P (V)}{Q (V)} = \frac{V - λ_{2}}{λ_{1}}

P (V) = Q (V) exp (\frac{V - λ_{2}}{λ_{1}}) = Q (V) e^{V / λ_{1}} e^{- λ_{2} / λ_{1}} = C Q (V) e^{V / λ_{1}}

That is, the new PDF is an exponential tilting of the old PDF. Now what is $E_{P} [V]$ ? It’s just $\int_{- \infty}^{\infty} C V e^{V / λ_{1}} Q (X) d V$ . If the distribution of V is heavy-tailed distribution, this is $\infty$ ; if it is light-tailed, this is some finite value.

When $d = 0$ , $P$ and $Q$ are identical and $E [V] = 0$ . So by a continuity argument, $E_{P} [V] \to 0$ as $d \to 0$ . $■$

Light tails + independence imply $E V \to \infty$

Theorem 4: If $U = X + V$ with $X$ and $V$ both light-tailed, and the distribution of U is continuous, and $π^{*} (β) △ = arg {max}_{π} E [U (π)] - β D_{K L} (π, π_{0})$ , then ${lim}_{β \to 0^{+}} E [V (π^{*} (β))] = \infty$ .

Proof: Fix some $β$ . Using Lagrange multipliers, we find that for any event $S$ , ${Pr}_{π} (S) = {Pr}_{π_{0}} (S) e^{λ U (S)}$ . Let $c (β)$ be the median value of $U$ under the policy $π^{*} (β)$ ; that is, $P r (U > c (β) | U \sim G \circ T r (π^{*} (β))) = \frac{1}{2} .$ This exists because $U$ has a continuous distribution. Then:

\begin{matrix} E [V | π] & = \frac{1}{2} E [V | π, U < c] + \frac{1}{2} E [V | π, U \geq c] \geq \frac{1}{2} E [V | π, U < c] + \frac{1}{2} E [V | π] lim β \to 0^{+} E [V | π] & \geq lim β \to 0^{+} \frac{1}{2} E [V | π, U < c] + lim β \to 0^{+} \frac{1}{2} E [V | π] \end{matrix}

The left term is $c$ , while the right term is $\infty$ , so the overall limit is $\infty$ . $■$

AI ALIGNMENT FORUM
AF

5

Goodhart in RL with KL: Appendix

5

Theorem about distributions

Theorem about deterministic Markovian-return MDPs

Lagrange multipliers to minimize KL divergence

Light tails + independence imply $E V \to \infty$

5

Goodhart in RL with KL: Appendix

5

Theorem about distributions

Theorem about deterministic Markovian-return MDPs

Lagrange multipliers to minimize KL divergence

Light tails + independence imply EV→∞

Light tails + independence imply $E V \to \infty$