AI ALIGNMENT FORUM
AF

Meg

Posts

Sorted by New

80Auditing language models for hidden objectives

1mo

3

118Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

1y

69

49Steering Llama-2 with contrastive activation additions

1y

23

39Towards Understanding Sycophancy in Language Models

1y

0

56Paper: LLMs trained on “A is B” fail to learn “B is A”

2y

0

44Paper: On measuring situational awareness in LLMs

2y

13

Wikitag Contributions

Comments

Sorted by