AI ALIGNMENT FORUM
AF

Miles Turpin

Research scientist at Scale AI on the SEAL team (safety)

Posts

Sorted by New

41Reward hacking behavior can generalize across tasks

10mo

1

13Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

1y

0

17Some Quick Follow-Up Experiments to “Taken out of context: On measuring situational awareness in LLMs”

1y

0

22Unfaithful Explanations in Chain-of-Thought Prompting

2y

0

Wikitag Contributions

Comments

Sorted by