x
Steering RL Training: Benchmarking Interventions Against Reward Hacking — AI Alignment Forum