Seb Farquhar

Testing Gemini models for scheming tendencies

by Vika, David Lindner, Seb Farquhar, and Rohin Shah

As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models would sabotage their own safeguards, if given the opportunity. Our new papers focus on propensity for...

May 2947

Seb Farquhar

Seb Farquhar

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

MONA: Managed Myopia with Approval Feedback

Testing Gemini models for scheming tendencies

Seb Farquhar

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

MONA: Managed Myopia with Approval Feedback

Testing Gemini models for scheming tendencies

Testing Gemini models for scheming tendencies

MONA: Managed Myopia with Approval Feedback

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Discussion: Challenges with Unsupervised LLM Knowledge Discovery