AI ALIGNMENT FORUM
AF

Ollie J

Posts

Sorted by New

35[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

10mo

0

18Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

1y

0

Wikitag Contributions

Comments

Sorted by