AI ALIGNMENT FORUM
AF

Mikita Balesni

Posts

Sorted by New

18How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

11d

0

70Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

1mo

1

84Frontier Models are Capable of In-context Scheming

5mo

9

29Toward Safety Cases For AI Scheming

6mo

0

42Apollo Research 1-year update

11mo

0

14How I select alignment research projects

1y

4

26A starter guide for evals

1y

0

33Understanding strategic deception and deceptive alignment

2y

0

56Paper: LLMs trained on “A is B” fail to learn “B is A”

2y

0

44Paper: On measuring situational awareness in LLMs

2y

13

Wikitag Contributions

Comments

Sorted by