AI ALIGNMENT FORUM
Tags
AF

Gradient Hacking

•

Applied to AI Can be “Gradient Aware” Without Doing Gradient hacking. by Sodium 2mo ago

•

Applied to Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor by Roger Dearnaley 1y ago

•

Applied to Interpreting the Learning of Deceit by Roger Dearnaley 1y ago

•

Applied to Research Log, RLLMv2: Phi-1.5, GPT2XL and Falcon-RW-1B as paperclip maximizers by Miguel de Guzman 1y ago

•

Applied to Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation by duck_master 1y ago

•

Applied to Eliciting Credit Hacking Behaviours in LLMs by omegastick 1y ago

•

Applied to What an actually pessimistic containment strategy looks like by elbow921 1y ago

•

Applied to Gradient hacking via actual hacking by Max H 2y ago

•

Applied to Challenge: construct a Gradient Hacker by RobertM 2y ago

•

Applied to Gradient hacking is extremely difficult by Beren Millidge 2y ago

•

Applied to Gradient Filtering by Arun Jose 2y ago

•

Applied to [ASoT] Simulators show us behavioural properties by default by Arun Jose 2y ago

•

Applied to (Extremely) Naive Gradient Hacking Doesn't Work by ojorgensen 2y ago

•

Applied to Gradient Hacker Design Principles From Biology by Multicore 2y ago

Multicore v1.1.0Aug 27th 2022 GMT (+262) LW0

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

•

Applied to Gradient hacking: definitions and examples by Raymond Arnold 2y ago

•

Applied to Is Fisherian Runaway Gradient Hacking? by Ryan Kidd 3y ago