AI ALIGNMENT FORUM
Wikitags
AF

Sycophancy

Settings

Applied to Towards a Science of Evals for Sycophancy by andrejfsantos 2mo ago

Applied to Disproving the "People-Pleasing" Hypothesis for AI Self-Reports of Experience by rife 2mo ago

Applied to SAE features for refusal and sycophancy steering vectors by neverix 6mo ago

Applied to Evaluating LLaMA 3 for political sycophancy by alma.liezenga 6mo ago

Applied to Two new datasets for evaluating political sycophancy in LLMs by alma.liezenga 6mo ago

Applied to Sycophancy to subterfuge: Investigating reward tampering in large language models by Raymond Arnold 10mo ago

Applied to Antagonistic AI by Xybermancer 1y ago

Applied to Steering Llama-2 with contrastive activation additions by Alex Turner 1y ago

Applied to Reducing sycophancy and improving honesty via activation steering by Maxime Riché 1y ago

Maxime Riché v1.0.0Dec 18th 2023 GMT LW1

Created by Maxime Riché at 1y