AI ALIGNMENT FORUM
AF

Mikita Balesni

Posts

Sorted by New

70Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

12d

1

84Frontier Models are Capable of In-context Scheming

4mo

9

29Toward Safety Cases For AI Scheming

5mo

0

42Apollo Research 1-year update

10mo

0

14How I select alignment research projects

1y

4

26A starter guide for evals

1y

0

33Understanding strategic deception and deceptive alignment

2y

0

56Paper: LLMs trained on “A is B” fail to learn “B is A”

2y

0

44Paper: On measuring situational awareness in LLMs

2y

13

89Announcing Apollo Research

2y

4

Wikitag Contributions

Comments

Sorted by