I work on deceptive alignment and reward hacking at Anthropic
This is cool work! There are two directions I'd be excited to see explored in more detail:
Thank you for catching this.
These linked to section titles in our draft gdoc for this post. I have replaced them with mentions of the appropriate sections in this post.
Thank you for pointing this out. I should have been more clear.
I have added a link to the 7 samples where the model tampers with its reward and the tests according to our operational definition: https://github.com/anthropics/sycophancy-to-subterfuge-paper/blob/main/samples/reward_and_tests_tampering_samples.md
I have added the following sentence to the caption of figure one (which links to the markdown file above):
I have added the following section to the discussion:
These changes should go live on Arxiv on Monday at around 5pm due to the arXiv release schedule.