AI ALIGNMENT FORUM
AF

Can

Posts

Sorted by New

42SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

4mo

1

43OthelloGPT learned a bag of heuristics

9mo

1

8An adversarial example for Direct Logit Attribution: memory management in gelu-4l

2y

0

21Understanding mesa-optimization using toy models

2y

0

Wikitag Contributions

Comments

Sorted by