This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Activation Engineering
•
Applied to
Mechanistically Eliciting Latent Behaviors in Language Models
by
Alex Turner
13d
ago
•
Applied to
How well do truth probes generalise?
by
mishajw
3mo
ago
•
Applied to
Auto-matching hidden layers in Pytorch LLMs
by
chanind
3mo
ago
•
Applied to
What's the theory of impact for activation vectors?
by
jacobjacob
3mo
ago
•
Applied to
Implementing activation steering
by
Annah
3mo
ago
•
Applied to
Investigating Bias Representations in LLMs via Activation Steering
by
kave
4mo
ago
•
Applied to
Striking Implications for Learning Theory, Interpretability — and Safety?
by
Roger Dearnaley
4mo
ago
•
Applied to
Steering Llama-2 with contrastive activation additions
by
Alex Turner
5mo
ago
•
Applied to
Classifying representations of sparse autoencoders (SAEs)
by
Annah
6mo
ago
•
Applied to
Features and Adversaries in MemoryDT
by
Tassilo Neubauer
6mo
ago
•
Applied to
Comparing representation vectors between llama 2 base and chat
by
Nina Rimsky
7mo
ago
•
Applied to
Paper: Understanding and Controlling a Maze-Solving Policy Network
by
Alex Turner
7mo
ago
•
Applied to
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
by
Zach Stein-Perlman
7mo
ago
•
Applied to
Evaluating hidden directions on the utility dataset: classification, steering and removal
by
Annah
8mo
ago
•
Applied to
Understanding and controlling a maze-solving policy network
by
Alex Turner
8mo
ago
•
Applied to
Sparse Coding, for Mechanistic Interpretability and Activation Engineering
by
David Udell
8mo
ago
•
Applied to
ActAdd: Steering Language Models without Optimization
by
Alex Turner
8mo
ago
•
Applied to
Modulating sycophancy in an RLHF model via activation steering
by
David Udell
9mo
ago
•
Applied to
Understanding Counterbalanced Subtractions for Better Activation Additions
by
David Udell
9mo
ago