AI ALIGNMENT FORUM
AF

Nate Thomas

Redwood Research and Constellation

Posts

Sorted by New

21Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

1y

2

22Causal scrubbing: results on induction heads

2y

0

20Causal scrubbing: results on a paren balance checker

2y

2

11Causal scrubbing: Appendix

2y

1

103Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

2y

29

59Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

2y

5

65High-stakes alignment via adversarial training [Redwood Research report]

3y

15

21We're Redwood Research, we do applied alignment research, AMA

3y

2

Wikitag Contributions

Comments

Sorted by

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

Nate Thomas1y10

Thanks, Neel! It should be fixed now.

Takeaways from our robust injury classifier project [Redwood Research]

Nate Thomas3y912

Note that it's unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is "given a model, how easy is it to find a failure by attacking that model using our rewriting tools?"