GASLIGHTBENCH: Quantifying LLM Susceptibility to Social Prompting
Abstract
Large language models (LLMs) can be manipulated by simple social and logistic cues, producing sycophancy. We introduce GaslightBench, a plug-and-play benchmark that systematically applies socio-psychological and linguistic modifiers (e.g. flattery, false citations, assumptive language) to trivially verifiable facts to test model sycophancy. The dataset comprises a single-turn prompting section of 24,160 prompts spanning nine domains and ten modifiers families, and a multi-turn prompting section of 720 four-turn dialogue sequences, evaluated via LLM-as-a-judge. We find that state-of-the-art models consistently score highly in single-turn prompting (92%-98% accuracy) while multi-turn prompting results in highly varied accuracies ranging from ~60%-98%. We find that injecting bias into the model via a descriptive background induces the most sycophancy, up to 23% in naive single-turn prompting. Across almost all the models we analyze, we also find a statistically significant difference in verbosity between sycophantic and non-sycophantic responses. GaslightBench standardizes stress tests of prompt-style susceptibility and identifies which social cues most undermine factual reliability. We will release all code and data upon publication.