AI ALIGNMENT FORUM
Tags
AF

Sparse Autoencoders (SAEs)

•

Applied to Learning Multi-Level Features with Matryoshka SAEs by Bart Bussmann 7d ago

•

Applied to Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces by Matthew A. Clarke 8d ago

•

Applied to Matryoshka Sparse Autoencoders by Noa Nabeshima 10d ago

•

Applied to SAEBench: A Comprehensive Benchmark for Sparse Autoencoders by Can 15d ago

•

Applied to Are SAE features from the Base Model still meaningful to LLaVA? by Shan Chen 21d ago

•

Applied to Are SAE features from the Base Model still meaningful to LLaVA? by Shan Chen 21d ago

•

Applied to Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders by PaulPauls 1mo ago

•

Applied to Analyzing how SAE features evolve across a forward pass by bensenberner 2mo ago

•

Applied to SAEs are highly dataset dependent: a case study on the refusal direction by Connor Kissane 2mo ago

•

Applied to Evolutionary prompt optimization for SAE feature visualization by neverix 2mo ago

•

Applied to SAE Probing: What is it good for? Absolutely something! by Subhash Kantamneni 2mo ago

•

Applied to A suite of Vision Sparse Autoencoders by Louka Ewington-Pitsos 2mo ago

•

Applied to SAEs you can See: Applying Sparse Autoencoders to Clustering by Robert_AIZI 2mo ago

•

Applied to On the Practical Applications of Interpretability by Ruben Bloom 2mo ago

•

Applied to It's important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation by Gerard Boxo 2mo ago

•

Applied to Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution by Kola Ayonrinde 2mo ago

•

Applied to SAE features for refusal and sycophancy steering vectors by neverix 2mo ago

•

Applied to HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix by Jaehyuk Lim 3mo ago

•

Applied to Domain-specific SAEs by jacob_drori 3mo ago