Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.
See also: Inner Alignment
Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.
See also: Inner Alignment