AI ALIGNMENT FORUM
AF

ChengCheng

Posts

Sorted by New

10Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

2mo

0

8GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning

5mo

0

25Pacing Outside the Box: RNNs Learn to Plan in Sokoban

8mo

7

8Does robustness improve with scale?

8mo

0

11VLM-RM: Specifying Rewards with Natural Language

1y

0

11Uncovering Latent Human Wellbeing in LLM Embeddings

2y

1

Wikitag Contributions

Comments

Sorted by