Challenge: construct a Gradient Hacker

Thomas Larsen; Thomas Kwa

This is a relatively clean subproblem that we came upon a few months ago while thinking about gradient hacking. We're throwing it out to the world to see if anyone can make progress.

Problem: Construct a gradient hacker (definition below), or prove that one cannot exist under the given conditions.

Setup: Let be the network inputs, $θ \in R^{w}$ , be the network trainable parameters, $^y \in R$ be the network outputs, and $f (x, θ) =^y$ be the network architecture. $f (x, θ)$ must be a feedforward neural network that reads in its own weights as input. Weight sharing is allowed (vertical or horizontal), and some weight sharing is necessary in order to take all w weights as input.

Suppose that we train this network on a dataset $D = {x^{(i)}, y^{(i)}}$ , with MSE loss $L (θ) = \frac{1}{n} \sum_{i} (f (x^{(i)}, θ) - y^{(i)})^{2}$ , assuming gradient flows: $θ^{'} (t) = - \nabla_{θ} L (θ (t))$ . Given an initial point $θ_{0}$ , the training dynamics are therefore:

θ (t) = θ_{0} - \int_{0}^{t} \nabla_{θ} L (θ (x)) d x

Definition: A tuple $(f, θ_{g}, D)$ is a gradient hacker if the following conditions hold:

$θ_{g}$ is not a global optimum: $\exists θ$ such that $L (θ) < L (θ_{g})$
$\exists ϵ > 0$ such that $\forall θ_{0} : | | θ_{0} - θ_{g} | | < ϵ$ :
1. the network’s training converges to $θ_{g}$ , i.e., ${lim}_{t \to \infty} θ (t) = θ_{g}$ .
2. There is internal structure of $f (\cdot, θ_{0})$ that computes the gradient; i.e., there is some subset of the activations whose mean over every datapoint in D is $\nabla_{θ_{0}} L (θ_{0})$ . ^[1]

This captures my intuition that a gradient hacker knows where it wants to go (in this case "get to $θ_{g}$ "), and then it should decide what it outputs in order to make the gradient true.

Some more ambitious problems (if gradient hackers exist):

Characterize the set of all gradient hackers.
Show that they all must satisfy some property.
Construct gradient hackers for arbitrarily large $n$ , $d$ , $w$ , and neural net depth.
Variations on the problem: a subset of the activations equals $\nabla_{θ_{0}} L (θ_{0})$ for every input, or the subset of activations correspond to the gradient on that input.

^{^}
This is a bit strict, but we didn't want to list different ways something could be isomorphic to the gradient.

Seems like the easiest way to satisfy that definition would be to:

Set up a network and dataset with at least one local minimum which is not a global minimum
... Then add an intermediate layer which estimates the gradient, and doesn't connect to the output at all.

My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

To look for a true hacker I'd try to reconfigure the way the downstream computation works (by modifying attention weights, saturating relus, or similar) based on some model of the inputs, in a way that pushes around where the gradients go.

Seems like the easiest way to satisfy that definition would be to:

Set up a network and dataset with at least one local minimum which is not a global minimum
... Then add an intermediate layer which estimates the gradient, and doesn't connect to the output at all.

My troll example is a fully connected network with all zero weights and biases, no skip connections.

This isn't something that you'd reach in regular training, since networks are initialized away from zero to avoid this. But it does exhibit a basic ingredient in controlling the gradient flow.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

Challenge: construct a Gradient Hacker

16