AI ALIGNMENT FORUM
Tags
AF

Interpretability (ML & AI)

•

Applied to Towards a Unified Interpretability of Artificial and Biological Neural Networks by jan_bauer 20h ago

•

Applied to A short critique of Omohundro's "Basic AI Drives" by Soumyadeep Bose 3d ago

•

Applied to Learning Multi-Level Features with Matryoshka SAEs by Bart Bussmann 3d ago

•

Applied to Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces by Matthew A. Clarke 4d ago

•

Applied to Matryoshka Sparse Autoencoders by Noa Nabeshima 6d ago

•

Applied to Testing which LLM architectures can do hidden serial reasoning by Filip Sondej 6d ago

•

Applied to SAEBench: A Comprehensive Benchmark for Sparse Autoencoders by Can 12d ago

•

Applied to Backdoors have universal representations across large language models by Amirali Abdullah 16d ago

•

Applied to Gradient Routing: Masking Gradients to Localize Computation in Neural Networks by Alex Turner 17d ago

•

Applied to Are SAE features from the Base Model still meaningful to LLaVA? by Shan Chen 17d ago

•

Applied to Are SAE features from the Base Model still meaningful to LLaVA? by Shan Chen 17d ago

•

Applied to Deep Learning is cheap Solomonoff induction? by Lucius Bushnaq 19d ago

•

Applied to Intricacies of Feature Geometry in Large Language Models by Satvik Golechha 19d ago

•

Applied to Beyond Gaussian: Language Model Representations and Distributions by Matt Levinson 20d ago

•

Applied to AXRP Episode 38.2 - Jesse Hoogland on Singular Learning Theory by DanielFilan 25d ago

•

Applied to Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders by PaulPauls 1mo ago