This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Tags
AF
Login
Activation Engineering
•
Applied to
Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
by
Alex Turner
18d
ago
•
Applied to
A Sober Look at Steering Vectors for LLMs
by
Joschka Braun
1mo
ago
•
Applied to
Avoiding jailbreaks by discouraging their representation in activation space
by
Guido Bergman
3mo
ago
•
Applied to
Validating / finding alignment-relevant concepts using neural data
by
Bogdan Ionut Cirstea
3mo
ago
•
Applied to
Programming Refusal with Conditional Activation Steering
by
Bruce W. Lee
3mo
ago
•
Applied to
Activation Engineering Theories of Impact
by
Jakub Nowak
5mo
ago
•
Applied to
I found >800 orthogonal "write code" steering vectors
by
Jacob G-W
5mo
ago
•
Applied to
An Introduction to Representation Engineering - an activation-based paradigm for controlling LLMs
by
Jan Wehner
5mo
ago
•
Applied to
Control Vectors as Dispositional Traits
by
Jan Wehner
6mo
ago
•
Applied to
Representation Tuning
by
Jan Wehner
6mo
ago
•
Applied to
LLMs Universally Learn a Feature Representing Token Frequency / Rarity
by
Sean Osier
6mo
ago
•
Applied to
Jailbreak steering generalization
by
Nina Panickssery
6mo
ago
•
Applied to
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
by
Henry Cai
6mo
ago
•
Applied to
Introducing SARA: a new activation steering technique
by
Alejandro Tlaie Boria
6mo
ago
•
Applied to
Mechanistically Eliciting Latent Behaviors in Language Models
by
Alex Turner
8mo
ago