This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Sycophancy
Settings
Applied to
Towards a Science of Evals for Sycophancy
by
andrejfsantos
2mo
ago
Applied to
Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience
by
rife
2mo
ago
Applied to
SAE features for refusal and sycophancy steering vectors
by
neverix
6mo
ago
Applied to
Evaluating LLaMA 3 for political sycophancy
by
alma.liezenga
6mo
ago
Applied to
Two new datasets for evaluating political sycophancy in LLMs
by
alma.liezenga
6mo
ago
Applied to
Sycophancy to subterfuge: Investigating reward tampering in large language models
by
Raymond Arnold
10mo
ago
Applied to
Antagonistic AI
by
Xybermancer
1y
ago
Applied to
Steering Llama-2 with contrastive activation additions
by
Alex Turner
1y
ago
Applied to
Reducing sycophancy and improving honesty via activation steering
by
Maxime Riché
1y
ago
Maxime Riché
v1.0.0
Dec 18th 2023 GMT
LW
1
Created by
Maxime Riché
at
1y