AI ALIGNMENT FORUMTags
AF

Anthropic (org)

•

Applied to Cicadas, Anthropic, and the bilateral alignment problem by kromem 10h ago

•

Applied to EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024 by Stephen Casper 1d ago

•

Applied to Anthropic: Reflections on our Responsible Scaling Policy by Zac Hatfield-Dodds 3d ago

•

Applied to Anthropic AI made the right call by bhauth 1mo ago

•

Applied to OMMC Announces RIP by Adam Scholl 2mo ago

•

Applied to Vaniver's thoughts on Anthropic's RSP by Gunnar Zarncke 4mo ago

•

Applied to Introducing Alignment Stress-Testing at Anthropic by Gunnar Zarncke 4mo ago

•

Applied to On Anthropic’s Sleeper Agents Paper by Gunnar Zarncke 4mo ago

•

Applied to Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation by Soroush Pour 6mo ago

•

Applied to Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy by Zac Hatfield-Dodds 7mo ago

•

Applied to Comparing Anthropic's Dictionary Learning to Ours by Robert_AIZI 7mo ago

•

Applied to Measuring and Improving the Faithfulness of Model-Generated Reasoning by HenningB 8mo ago

•

Applied to Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust by Zac Hatfield-Dodds 8mo ago

•

Applied to Towards Monosemanticity: Decomposing Language Models With Dictionary Learning by Zac Hatfield-Dodds 8mo ago

•

Applied to Amazon to invest up to $4 billion in Anthropic by RobertM 8mo ago

•

Applied to AI Awareness through Interaction with Blatantly Alien Models by Vojtech Kovarik 10mo ago

•

Applied to Frontier Model Forum by RobertM 10mo ago

•

Applied to Frontier Model Security by RobertM 10mo ago