This is the appendix to the previous post on Goodhart’s Law and KL regularization, containing all of our proofs.

Theorem about distributions

Theorem 1: Given any heavy-tailed reference distribution  over  with mean , and any , there is a distribution  with mean  and .

Proof: WLOG let . We construct a sequence of distributions  such that  for any constant , and . We define  for any  thusly. Writing  for the CDF  and  for , we let

Intuitively, we rescale the part of the distribution to the right of  evenly to have total probability , which is less than 1 because .

We must check that . We can write

We know that  because it is an integral of values strictly greater than t. Because  is a weighted average of  and , and , we know . So  We also know that for sufficiently large . Intuitively, starting from , which has mean 0,  moves a probability mass approaching  from mean <0 to mean >t.

Now we can say

Because  has a finite mean, , and so .

Now we check that :

Since both  and  go to  as , the left term goes to , and so

 is heavy tailed, so by definition . This implies that for every  there is a sufficiently large  so that for all , which means that .

Therefore for every , which since KL divergence is nonnegative means that  as desired. 

Theorem about deterministic Markovian-return MDPs

Definition: A deterministic-transition MDP with Markovian returns (DMRMDP) is an MDP  such that:

  • The transition function  is deterministic, i.e., for each state  and action , there exists a unique state  such that .
  • There is a set of sink states  that terminate a trajectory, which is disjoint with the set of start states.
  • Returns are Markovian; that is, for any two trajectories  if , then  and  have identical return distributions. Equivalently, for the trajectory random variable  distributed according to any policy, with return  for any .

Note: Sampling from a language model and applying RLHF is well-modeled as a DMRMDP, since the state is a sequence of tokens (actions) which deterministically results from the last token and returns depend only on the final state.

Theorem 2: Let  be a deterministic-transition MDP with Markovian returns. Given  we define the function that takes policies to trajectories , and the average return function  which induces a function . Let  be some reference policy. If  is heavy-tailed with finite mean , then for any , there is a policy  with mean return  and .

Proof: We will exhibit a distribution of trajectories  such that  and , and then construct a policy  with . Note that this proof applies for continuous action spaces if trajectories are replaced with measurable sets, but this would make it harder to read.

Let . We have a heavy-tailed distribution of return  over , so we can apply Theorem 1. But to define , we can construct  in the proof of Theorem 1 in a particular way. For any , we need a  that uniformly upweights values of mean return such that . We can define  such that any trajectory  is upweighted by a factor depending only on its mean return:

Then we can let  and the rest of the proof of Theorem 1 applies. Therefore, applying the theorem, we can let  for sufficiently large , and then  and . But by the chain rule for KL divergence, . Since we constructed  so that the probabilities of each  conditional on its return being  are equal, the second term is zero, and we also have .

Finally, since the KL divergence between trajectory distributions is the sum of KL divergence between policies at each action in the trajectory, and each trajectory has at least one action,  as desired.

To define  such that , we let .

Then the probability that any trajectory  is sampled is:

In (2), returns are Markovian, so all trajectory prefixes ending in state  have the same distribution of returns under any policy. In the construction of , all trajectories with the same mean return have equal measure. Therefore, conditioning on earlier states and actions of  does not change the measure, so we can write (3). So  as desired. 

Lagrange multipliers to minimize KL divergence

Theorem 3: If  is light-tailed,  is finite, and  is bounded, then  is bounded, and  as .

Using Lagrange multipliers, we find that when KL divergence is minimized, we have  for some constants , so

That is, the new PDF is an exponential tilting of the old PDF. Now what is ? It’s just . If the distribution of V is heavy-tailed distribution, this is ; if it is light-tailed, this is some finite value.

When  and  are identical and . So by a continuity argument,  as 

Light tails + independence imply 

Theorem 4: If with  and  both light-tailed, and the distribution of U is continuous, and , then .

Proof: Fix some . Using Lagrange multipliers, we find that for any event . Let  be the median value of  under the policy ; that is,  This exists because  has a continuous distribution. Then:

The left term is , while the right term is , so the overall limit is 

New Comment