This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
AI ALIGNMENT FORUM
Wikitags
AF
Login
Subscribe
Discussion
0
Sycophancy
Subscribe
Discussion
0
This page is a stub.
Posts tagged
Sycophancy
Most Relevant
1
90
Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison
,
Evan Hubinger
10mo
21
2
49
Steering Llama-2 with contrastive activation additions
Nina Panickssery
,
Wuschel Schulz
,
NickGabs
,
Meg
,
Evan Hubinger
,
Alex Turner
1y
23
0
52
Reducing sycophancy and improving honesty via activation steering
Nina Panickssery
2y
4
1
13
SAE features for refusal and sycophancy steering vectors
neverix
,
Dmitrii Kharlapenko
,
Arthur Conmy
,
Neel Nanda
6mo
0
0
3
Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga
6mo
0